Thread: better atomics

better atomics

From
Robert Haas
Date:
On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Partially that only works because we sprinkle 'volatile's on lots of
> places. That can actually hurt performance... it'd usually be something
> like:
> #define pg_atomic_load(uint32) (*(volatile uint32 *)(atomic))
>
> Even if not needed in some places because a variable is already
> volatile, it's very helpful in pointing out what happens and where you
> need to be careful.

That's fine with me.

>> > * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val)
>> > * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval)
>> > * uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add)
>> > * uint32 pg_atomic_fetch_sub_u32(uint32 *ptr, uint32 add)
>> > * uint32 pg_atomic_fetch_and_u32(uint32 *ptr, uint32 add)
>> > * uint32 pg_atomic_fetch_or_u32(uint32 *ptr, uint32 add)
>>
>> Do we really need all of those?  Compare-and-swap is clearly needed,
>> and fetch-and-add is also very useful.  I'm not sure about the rest.
>> Not that I necessarily object to having them, but it will be a lot
>> more work.
>
> Well, _sub can clearly be implemented with _add generically. I find code
> using _sub much easier to read than _add(-whatever), but that's maybe
> just me.

Yeah, I wouldn't bother with _sub.

> But _and, _or are really useful because they can be used to atomically
> clear and set flag bits.

Until we have some concrete application that requires this
functionality for adequate performance, I'd be inclined not to bother.I think we have bigger fish to fry, and (to
extendmy nautical
 
metaphor) trying to get this working everywhere might amount to trying
to boil the ocean.

>> > * u64 variants of the above
>> > * bool pg_atomic_test_set(void *ptr)
>> > * void pg_atomic_clear(void *ptr)
>>
>> What are the intended semantics here?
>
> It's basically TAS() without defining whether setting a value that needs
> to be set is a 1 or a 0. Gcc provides this and I think we should make
> our spinlock implementation use it if there is no special cased
> implementation available.

I'll reserve judgement until I see the patch.

>> More general question: how do we move the ball down the field in this
>> area?  I'm willing to do some of the legwork, but I doubt I can do all
>> of it, and I really think we need to make some progress here soon, as
>> it seems that you and I and Ants are all running into the same
>> problems in slightly different ways.  What's our strategy for getting
>> something done here?
>
> That's a pretty good question.
>
> I am willing to write the gcc implementation, plus the generic wrappers
> and provide the infrastructure to override it per-platform. I won't be
> able to do anything about non-linux, non-gcc platforms in that timeframe
> though.
>
> I was thinking of something like:
> include/storage/atomic.h
> include/storage/atomic-gcc.h
> include/storage/atomic-msvc.h
> include/storage/atomic-acc-ia64.h
> where atomic.h first has a list of #ifdefs including the specialized
> files and then lots of #ifndef providing generic variants if not
> already provided by the platorm specific file.

Seems like we might want to put some of this in one of the various
directories called "port", or maybe a new subdirectory of one of them.This basically sounds sane, but I'm unwilling to
committo dropping
 
support for all obscure platforms we currently support unless there
are about 500 people +1ing the idea, so I think we need to think about
what happens when we don't have platform support for these primitives.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics

From
Andres Freund
Date:
On 2013-10-16 14:30:34 -0400, Robert Haas wrote:
> > But _and, _or are really useful because they can be used to atomically
> > clear and set flag bits.
> 
> Until we have some concrete application that requires this
> functionality for adequate performance, I'd be inclined not to bother.
>  I think we have bigger fish to fry, and (to extend my nautical
> metaphor) trying to get this working everywhere might amount to trying
> to boil the ocean.

Well, I actually have a very, very preliminary patch using them in the
course of removing the spinlocks in PinBuffer() (via LockBufHdr()).

I think it's also going to be easier to add all required atomics at once
than having to go over several platforms in independent feature
patches adding atomics one by one.

> > I was thinking of something like:
> > include/storage/atomic.h
> > include/storage/atomic-gcc.h
> > include/storage/atomic-msvc.h
> > include/storage/atomic-acc-ia64.h
> > where atomic.h first has a list of #ifdefs including the specialized
> > files and then lots of #ifndef providing generic variants if not
> > already provided by the platorm specific file.

> Seems like we might want to put some of this in one of the various
> directories called "port", or maybe a new subdirectory of one of them.

Hm. I'd have to be src/include/port, right? Not sure if putting some
files there will be cleaner, but I don't mind it either.

>  This basically sounds sane, but I'm unwilling to commit to dropping
> support for all obscure platforms we currently support unless there
> are about 500 people +1ing the idea, so I think we need to think about
> what happens when we don't have platform support for these primitives.

I think the introduction of the atomics really isn't much dependent on
the removal. The only thing I'd touch around platforms in that patch is
adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
remove the special case usages of __sync_lock_test_and_set() (Arm64,
some 32 bit arms).
It's the patches that would like to rely on atomics that will have the
problem :(

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-16 14:30:34 -0400, Robert Haas wrote:
>>> But _and, _or are really useful because they can be used to atomically
>>> clear and set flag bits.

>> Until we have some concrete application that requires this
>> functionality for adequate performance, I'd be inclined not to bother.
>> I think we have bigger fish to fry, and (to extend my nautical
>> metaphor) trying to get this working everywhere might amount to trying
>> to boil the ocean.

> Well, I actually have a very, very preliminary patch using them in the
> course of removing the spinlocks in PinBuffer() (via LockBufHdr()).

I think we need to be very, very wary of making our set of required
atomics any larger than it absolutely has to be.  The more stuff we
require, the closer we're going to be to making PG a program that only
runs well on Intel.  I am not comforted by the "gcc will provide good
implementations of the atomics everywhere" argument, because in fact
it won't.  See for example the comment associated with our only existing
use of gcc atomics:
* On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if* not fall back on the SWPB instruction.
SWPBdoes not work on ARMv6 or* later, so the compiler builtin is preferred if available.  Note also that* the int-width
variantof the builtin works on more chips than other widths.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

That's not a track record that should inspire much faith that complete
sets of atomics will exist elsewhere.  What's more, we don't just need
atomics that *work*, we need atomics that are *fast*, else the whole
exercise turns into pessimization not improvement.  The more atomic ops
we depend on, the more likely it is that some of them will involve kernel
support on some chips, destroying whatever performance improvement we
hoped to get.

> ... The only thing I'd touch around platforms in that patch is
> adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
> remove the special case usages of __sync_lock_test_and_set() (Arm64,
> some 32 bit arms).

Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be
about one line long.  The above quote seems to me to be exactly the kind
of optimism that is not warranted in this area.
        regards, tom lane



Re: better atomics

From
Andres Freund
Date:
On 2013-10-16 22:39:07 +0200, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-10-16 14:30:34 -0400, Robert Haas wrote:
> >>> But _and, _or are really useful because they can be used to atomically
> >>> clear and set flag bits.
> 
> >> Until we have some concrete application that requires this
> >> functionality for adequate performance, I'd be inclined not to bother.
> >> I think we have bigger fish to fry, and (to extend my nautical
> >> metaphor) trying to get this working everywhere might amount to trying
> >> to boil the ocean.
> 
> > Well, I actually have a very, very preliminary patch using them in the
> > course of removing the spinlocks in PinBuffer() (via LockBufHdr()).
> 
> I think we need to be very, very wary of making our set of required
> atomics any larger than it absolutely has to be.  The more stuff we
> require, the closer we're going to be to making PG a program that only
> runs well on Intel.

Well, essentially I am advocating to support basically three operations:
* compare and swap
* atomic add (and by that sub)
* test and set

The other operations are just porcelain around that.

With compare and swap both the others can be easily implemented if
neccessary.

Note that e.g. linux - running on all platforms we're talking about but
VAX - exensively and unconditionally uses atomic operations widely. So I
think we don't have to be too afraid about non-intel performance here.

>  I am not comforted by the "gcc will provide good
> implementations of the atomics everywhere" argument, because in fact
> it won't.  See for example the comment associated with our only existing
> use of gcc atomics:
> 
>  * On ARM, we use __sync_lock_test_and_set(int *, int) if available, and if
>  * not fall back on the SWPB instruction.  SWPB does not work on ARMv6 or
>  * later, so the compiler builtin is preferred if available.  Note also that
>  * the int-width variant of the builtin works on more chips than other widths.
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> That's not a track record that should inspire much faith that complete
> sets of atomics will exist elsewhere.  What's more, we don't just need
> atomics that *work*, we need atomics that are *fast*, else the whole
> exercise turns into pessimization not improvement.  The more atomic ops
> we depend on, the more likely it is that some of them will involve kernel
> support on some chips, destroying whatever performance improvement we
> hoped to get.

Hm. I can't really follow. We *prefer* to use __sync_lock_test_and_set
in contrast to our own open-coded implementation, right? And the comment
about some hardware only supporting "int-width" matches that I only want to
require u32 atomics support, right?

I completely agree that we cannot rely on 8byte math or similar (16byte
cmpxchg) to be supported by all platforms. That indeed would require
kernel fallbacks on e.g. ARM.

> > ... The only thing I'd touch around platforms in that patch is
> > adding a generic fallback pg_atomic_test_and_set() to s_lock.h and
> > remove the special case usages of __sync_lock_test_and_set() (Arm64,
> > some 32 bit arms).
> 
> Um, if we had a "generic" pg_atomic_test_and_set(), s_lock.h would be
> about one line long.  The above quote seems to me to be exactly the kind
> of optimism that is not warranted in this area.

Well, I am somewhat hesitant to change s_lock for more platforms than
necessary. So I proposed to restructure it in a way that leaves all
existing implementations in place that do not already rely on
__sync_lock_test_and_set().

There's also SPIN_DELAY btw, which I don't see any widely provided
intrinsics for.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Robert Haas
Date:
On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I have a related problem, which is that some code I'm currently
>> working on vis-a-vis parallelism can run lock-free on platforms with
>> atomic 8 bit assignment but needs a spinlock or two elsewhere.  So I'd
>> want to use pg_atomic_store_u64(), but I'd also need a clean way to
>> test, at compile time, whether it's available.
>
> Yes, definitely. There should be a couple of #defines that declare
> whether non-prerequisite options are supported. I'd guess we want at least:
> * 8byte math
> * 16byte compare_and_swap

I'm not terribly excited about relying on 16-byte CAS, but I agree
that 8-byte math, at least, is important.  I've not been successful in
finding any evidence that gcc has preprocessor symbols to tell us
about the properties of 8-byte loads and stores.  The closest thing
that I found is:

http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html

That page provides intrinsics for 8-byte atomic loads and stores,
among other things.  But it seems that the only method for determining
whether they work on a particular target is to compile a test program
and see whether it complains about __atomic_load_n and/or
__atomic_store_n being unresolved symbols.  I suppose we could add a
configure test for that.  Yuck.

Anyone have a better idea?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics

From
Andres Freund
Date:
On 2013-10-28 14:10:48 -0400, Robert Haas wrote:
> On Wed, Oct 16, 2013 at 12:52 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I have a related problem, which is that some code I'm currently
> >> working on vis-a-vis parallelism can run lock-free on platforms with
> >> atomic 8 bit assignment but needs a spinlock or two elsewhere.  So I'd
> >> want to use pg_atomic_store_u64(), but I'd also need a clean way to
> >> test, at compile time, whether it's available.
> >
> > Yes, definitely. There should be a couple of #defines that declare
> > whether non-prerequisite options are supported. I'd guess we want at least:
> > * 8byte math
> > * 16byte compare_and_swap
> 
> I'm not terribly excited about relying on 16-byte CAS, but I agree
> that 8-byte math, at least, is important.  I've not been successful in
> finding any evidence that gcc has preprocessor symbols to tell us
> about the properties of 8-byte loads and stores.  The closest thing
> that I found is:

I am talking about making 16byte CAS explicitly optional though? I think
if code wants to optionally make use of it (e.g. on x86 where it's been
available basically forever) that's fine. It just has to be optional.
Or are you saying you're simply not interested in 16byte CAS generally?

Same thing for 8byte math - there's no chance that that is going to be
supported over all platforms anytime soon, so it has to be optional.

> http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
> 
> That page provides intrinsics for 8-byte atomic loads and stores,
> among other things.  But it seems that the only method for determining
> whether they work on a particular target is to compile a test program
> and see whether it complains about __atomic_load_n and/or
> __atomic_store_n being unresolved symbols.  I suppose we could add a
> configure test for that.  Yuck.

Well, that's pretty easy to integrate into configure - and works on
crossbuilds. So I think that'd be ok.

But I actually think this is going to be a manual thing, atomic 8byte
math will fall back to kernel emulation on several targets, so we
probably want some defines to explicitly declare it supported on targets
where that's not the case.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Robert Haas
Date:
On Mon, Oct 28, 2013 at 2:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I'm not terribly excited about relying on 16-byte CAS, but I agree
>> that 8-byte math, at least, is important.  I've not been successful in
>> finding any evidence that gcc has preprocessor symbols to tell us
>> about the properties of 8-byte loads and stores.  The closest thing
>> that I found is:
>
> I am talking about making 16byte CAS explicitly optional though? I think
> if code wants to optionally make use of it (e.g. on x86 where it's been
> available basically forever) that's fine. It just has to be optional.
> Or are you saying you're simply not interested in 16byte CAS generally?

I am just not interested in it generally.  Relying on too many OS or
architecture-specific primitives has several disadvantages.  It's
going to be a nuisance on more obscure platforms where they may or may
not be supported and may or may not work right even if supported.
Even once we get things working right everywhere, it'll mean that
performance may suffer on non-mainstream platforms.  And I think in
many cases it may suggest that we're using an architecture-specific
quirk to work around a fundamental problem with our algorithms.  I'm
more or less convinced of the need for 8-byte atomic reads and writes,
test-and-set, 8-byte compare-and-swap, and 8-byte fetch-and-add.  But
beyond that I'm less sold.  Most of the academic papers I've read on
implementing lock-free or highly-parallel constructs attempt to
confine themselves to 8-byte operations with 8-byte compare-and-swap,
and I'm a bit disposed to think we ought to try to hew to that as
well.  I'm not dead set against going further, but I lean against it,
for all of the reasons mentioned above.

> Same thing for 8byte math - there's no chance that that is going to be
> supported over all platforms anytime soon, so it has to be optional.

Agreed.

>> http://gcc.gnu.org/onlinedocs/gcc/_005f_005fatomic-Builtins.html
>>
>> That page provides intrinsics for 8-byte atomic loads and stores,
>> among other things.  But it seems that the only method for determining
>> whether they work on a particular target is to compile a test program
>> and see whether it complains about __atomic_load_n and/or
>> __atomic_store_n being unresolved symbols.  I suppose we could add a
>> configure test for that.  Yuck.
>
> Well, that's pretty easy to integrate into configure - and works on
> crossbuilds. So I think that'd be ok.
>
> But I actually think this is going to be a manual thing, atomic 8byte
> math will fall back to kernel emulation on several targets, so we
> probably want some defines to explicitly declare it supported on targets
> where that's not the case.

Hmm, OK.  I had not come across such cases.  There are architectures
where it Just Works (like x64_64), architectures where you can make it
work by using special instructions (like x86), and (presumably)
architectures where it doesn't work at all.  Places where it works
using really slow kernel emulation are yet another category to worry
about.  Unfortunately, I have not found any good source that describes
which architectures fall into which category.  Without that, pulling
this together seems intimidating, unless we just declare it working
for x86_64, disable it everywhere else, and wait for people running on
other architectures to complain.

I wonder whether it'd be safe to assume that any machine where
pointers are 8 bytes has 8-byte atomic loads and stores.  I bet there
is a counterexample somewhere.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics

From
Andres Freund
Date:
On 2013-10-28 15:02:41 -0400, Robert Haas wrote:
> On Mon, Oct 28, 2013 at 2:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> I'm not terribly excited about relying on 16-byte CAS, but I agree
> >> that 8-byte math, at least, is important.  I've not been successful in
> >> finding any evidence that gcc has preprocessor symbols to tell us
> >> about the properties of 8-byte loads and stores.  The closest thing
> >> that I found is:
> >
> > I am talking about making 16byte CAS explicitly optional though? I think
> > if code wants to optionally make use of it (e.g. on x86 where it's been
> > available basically forever) that's fine. It just has to be optional.
> > Or are you saying you're simply not interested in 16byte CAS generally?
> 
> I am just not interested in it generally.  Relying on too many OS or
> architecture-specific primitives has several disadvantages.  It's
> going to be a nuisance on more obscure platforms where they may or may
> not be supported and may or may not work right even if supported.
> Even once we get things working right everywhere, it'll mean that
> performance may suffer on non-mainstream platforms.

But it'll suffer against the *increased* performance on modern
platforms, it shouldn't suffer noticeably against the previous
performance on the older platform if we're doing our homework...

> Most of the academic papers I've read on
> implementing lock-free or highly-parallel constructs attempt to
> confine themselves to 8-byte operations with 8-byte compare-and-swap,
> and I'm a bit disposed to think we ought to try to hew to that as
> well.  I'm not dead set against going further, but I lean against it,
> for all of the reasons mentioned above.

I think there are quite some algorithms relying on 16byte CAS, that's
why I was thinking about it at all. I think it's easier to add support
for it in the easier trawl through the compilers, but I won't argue much
for it otherwise for now.

> > But I actually think this is going to be a manual thing, atomic 8byte
> > math will fall back to kernel emulation on several targets, so we
> > probably want some defines to explicitly declare it supported on targets
> > where that's not the case.
> 
> Hmm, OK.  I had not come across such cases.

E.g. ARM/linux which we probably cannot ignore.

> Places where it works
> using really slow kernel emulation are yet another category to worry
> about.  Unfortunately, I have not found any good source that describes
> which architectures fall into which category.  Without that, pulling
> this together seems intimidating, unless we just declare it working
> for x86_64, disable it everywhere else, and wait for people running on
> other architectures to complain.

Yes, I think whitelisting targs is a sensible approach here. I am pretty
sure we can do better than just x86-64, but that's easily doable in an
incremental fashion.

> I wonder whether it'd be safe to assume that any machine where
> pointers are 8 bytes has 8-byte atomic loads and stores.  I bet there
> is a counterexample somewhere.  :-(

Sparc64 :(.

Btw, could you quickly give some keywords what you're thinking about
making lockless?
I mainly am thinking about lwlocks and buffer pins so far. Nothing
really ambitious.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Robert Haas
Date:
On Mon, Oct 28, 2013 at 3:32 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I wonder whether it'd be safe to assume that any machine where
>> pointers are 8 bytes has 8-byte atomic loads and stores.  I bet there
>> is a counterexample somewhere.  :-(
>
> Sparc64 :(.
>
> Btw, could you quickly give some keywords what you're thinking about
> making lockless?
> I mainly am thinking about lwlocks and buffer pins so far. Nothing
> really ambitious.

Well, I was going to use it for some code I'm working on for
parallelism, but I just tested the overhead of a spinlock, and it was
zero, possibly negative.  So I may not have an immediate application.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics

From
Heikki Linnakangas
Date:
On 28.10.2013 21:32, Andres Freund wrote:
> On 2013-10-28 15:02:41 -0400, Robert Haas wrote:
>> Most of the academic papers I've read on
>> implementing lock-free or highly-parallel constructs attempt to
>> confine themselves to 8-byte operations with 8-byte compare-and-swap,
>> and I'm a bit disposed to think we ought to try to hew to that as
>> well.  I'm not dead set against going further, but I lean against it,
>> for all of the reasons mentioned above.
>
> I think there are quite some algorithms relying on 16byte CAS, that's
> why I was thinking about it at all. I think it's easier to add support
> for it in the easier trawl through the compilers, but I won't argue much
> for it otherwise for now.

Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit 
platforms that's 16 bytes, but on 32-bit platforms an 8 byte version 
will suffice.

- Heikki



Re: better atomics

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 28.10.2013 21:32, Andres Freund wrote:
>> I think there are quite some algorithms relying on 16byte CAS, that's
>> why I was thinking about it at all. I think it's easier to add support
>> for it in the easier trawl through the compilers, but I won't argue much
>> for it otherwise for now.

> Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit 
> platforms that's 16 bytes, but on 32-bit platforms an 8 byte version 
> will suffice.

You're both just handwaving.  How many is "many", and which ones might
we actually have enough use for to justify dealing with such a dependency?
I don't think we should buy into this without some pretty concrete
justification.
        regards, tom lane



Re: better atomics

From
Andres Freund
Date:
On 2013-10-28 16:06:47 -0400, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 28.10.2013 21:32, Andres Freund wrote:
> >> I think there are quite some algorithms relying on 16byte CAS, that's
> >> why I was thinking about it at all. I think it's easier to add support
> >> for it in the easier trawl through the compilers, but I won't argue much
> >> for it otherwise for now.
> 
> > Many algorithms require a 2*(pointer width) CAS instruction. On 64-bit 
> > platforms that's 16 bytes, but on 32-bit platforms an 8 byte version 
> > will suffice.
> 
> You're both just handwaving.  How many is "many", and which ones might
> we actually have enough use for to justify dealing with such a dependency?
> I don't think we should buy into this without some pretty concrete
> justification.

Well, I don't think either of us is suggesting to make it required. But
it's going to be painful to go through the list of compilers repeatedly
instead of just adding all operations at one. I am willing to do that
for several compilers once, but I don't want to do it in each and every
feature submission using another atomic op.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-10-28 16:06:47 -0400, Tom Lane wrote:
>> You're both just handwaving.  How many is "many", and which ones might
>> we actually have enough use for to justify dealing with such a dependency?
>> I don't think we should buy into this without some pretty concrete
>> justification.

> Well, I don't think either of us is suggesting to make it required. But
> it's going to be painful to go through the list of compilers repeatedly
> instead of just adding all operations at one. I am willing to do that
> for several compilers once, but I don't want to do it in each and every
> feature submission using another atomic op.

Basically I'm arguing that that choice is backwards.  We should decide
first what the list of supported atomics is going to be, and each and
every element of that list has to have a convincing concrete argument
why we need to support it.  Not "we might want this someday".  Because
when someday comes, that's when we'd be paying the price of finding out
which platforms actually support the primitive correctly and with what
performance.  Or if someday never comes, we're not ahead either.

Or if you like: no matter what you do, the first feature submission
that makes use of a given atomic op is going to suffer pain.  I don't
want to still be suffering that pain two or three years out.  The shorter
the list of allowed atomic ops, the sooner we'll be done with debugging
them.
        regards, tom lane



Re: better atomics

From
Andres Freund
Date:
On 2013-10-28 16:29:35 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-10-28 16:06:47 -0400, Tom Lane wrote:
> >> You're both just handwaving.  How many is "many", and which ones might
> >> we actually have enough use for to justify dealing with such a dependency?
> >> I don't think we should buy into this without some pretty concrete
> >> justification.
> 
> > Well, I don't think either of us is suggesting to make it required. But
> > it's going to be painful to go through the list of compilers repeatedly
> > instead of just adding all operations at one. I am willing to do that
> > for several compilers once, but I don't want to do it in each and every
> > feature submission using another atomic op.
> 
> Basically I'm arguing that that choice is backwards.  We should decide
> first what the list of supported atomics is going to be, and each and
> every element of that list has to have a convincing concrete argument
> why we need to support it.

The list I have previously suggested was:

* pg_atomic_load_u32(uint32 *)
* uint32 pg_atomic_store_u32(uint32 *)

To be able to write code without spreading volatiles all over while
making it very clear that that part of the code is worthy of attention.

* uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val)
* bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval)

Used in most lockfree algorithms, can be used to provide fallbacks for
when any of the other 32bit operations is not implemented for a platform
(so it's easier to bring up a platform).
E.g. used in in the wait-free LW_SHARED patch.

* uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add)

Plain math.
E.g. used in in the wait-free LW_SHARED patch.

* uint32 pg_atomic_fetch_and_u32(uint32 *ptr, uint32 add)
* uint32 pg_atomic_fetch_or_u32(uint32 *ptr, uint32 add)

Useful for (un-)setting flags atomically. Needed for (my approach to)
spinlock-less Pin/UnpinBuffer.

* u64 variants of the above

lockfree list manipulation need at least pointer width operations of at
least compare_exchange().

I *personally* don't have a case for 8byte math yet, but I am pretty
sure parallel sort might be interested in it.

* bool pg_atomic_test_set(void *ptr)
* void pg_atomic_clear(void *ptr)

Can be used as the implementation for spinlocks based on compiler
intrinsics if no native implementation is existing. Makes it easier to
bring up new platforms.

Wrappers around the above:
* uint32 pg_atomic_fetch_sub_u32(uint32 *ptr, uint32 add)
Generic wrapper that imo makes the code easier to read. No
per-platform/compiler overhead.

* pg_atomic_add_until_u32(uint32 *ptr, uint32 val, uint32 limit)
Useful for managing refcounts like pin counts.

* pg_atomic_(add|sub|and|or)_fetch_u32()
Generic wrappers for more legible code.  No
per-platform/compiler overhead.

* pg_atomic_compare_and_swap_128()

Many algorithms are faster (e.g. some lockless hashtables, which'd be
great for the buffer mapping) when a double-pointer-width CAS is
available, but also work with an pointer-width CAS in a less efficient
manner.

> Or if you like: no matter what you do, the first feature submission
> that makes use of a given atomic op is going to suffer pain.  I don't
> want to still be suffering that pain two or three years out.  The shorter
> the list of allowed atomic ops, the sooner we'll be done with debugging
> them.

Yea. As, partially, even evidenced by this discussion ;).

Which would you like to see go?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Andres Freund
Date:
Hi,

(Coming back to this now)

On 2013-10-28 21:55:22 +0100, Andres Freund wrote:
> The list I have previously suggested was:
> 
> * pg_atomic_load_u32(uint32 *)
> * uint32 pg_atomic_store_u32(uint32 *)
> 
> To be able to write code without spreading volatiles all over while
> making it very clear that that part of the code is worthy of attention.
> 
> * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val)
> * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval)

So what I've previously proposed did use plain types for the
to-be-manipulated atomic integers. I am not sure about that anymore
though, C11 includes support for atomic operations, but it assumes that
the to-be-manipulated variable is of a special "atomic" type. While I am
not propose that we try to emulate C11's API completely (basically
impossible as it uses type agnostic functions, it also exposes too
much), it seems to make sense to allow PG's atomics to be implemented by
C11's.

The API is described at: http://en.cppreference.com/w/c/atomic

The fundamental difference to what I've suggested so far is that the
atomic variable isn't a plain integer/type but a distinct datatype that
can *only* be manipulated via the atomic operators. That has the nice
property that it cannot be accidentally manipulated via plain operators.

E.g. it wouldn't be
uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_);
but
uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_);

where, for now, atomic_uint32 is something like:
typedef struct pg_atomic_uint32
{volatile uint32 value;
} pg_atomic_uint32;
a future C11 implementation would just typedef C11's respective atomic
type to pg_atomic_uint32.

Comments?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Robert Haas
Date:
On Tue, Nov 5, 2013 at 10:51 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi,
>
> (Coming back to this now)
>
> On 2013-10-28 21:55:22 +0100, Andres Freund wrote:
>> The list I have previously suggested was:
>>
>> * pg_atomic_load_u32(uint32 *)
>> * uint32 pg_atomic_store_u32(uint32 *)
>>
>> To be able to write code without spreading volatiles all over while
>> making it very clear that that part of the code is worthy of attention.
>>
>> * uint32 pg_atomic_exchange_u32(uint32 *ptr, uint32 val)
>> * bool pg_atomic_compare_exchange_u32(uint32 *ptr, uint32 *expected, uint32 newval)
>
> So what I've previously proposed did use plain types for the
> to-be-manipulated atomic integers. I am not sure about that anymore
> though, C11 includes support for atomic operations, but it assumes that
> the to-be-manipulated variable is of a special "atomic" type. While I am
> not propose that we try to emulate C11's API completely (basically
> impossible as it uses type agnostic functions, it also exposes too
> much), it seems to make sense to allow PG's atomics to be implemented by
> C11's.
>
> The API is described at: http://en.cppreference.com/w/c/atomic
>
> The fundamental difference to what I've suggested so far is that the
> atomic variable isn't a plain integer/type but a distinct datatype that
> can *only* be manipulated via the atomic operators. That has the nice
> property that it cannot be accidentally manipulated via plain operators.
>
> E.g. it wouldn't be
> uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_);
> but
> uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_);
>
> where, for now, atomic_uint32 is something like:
> typedef struct pg_atomic_uint32
> {
>         volatile uint32 value;
> } pg_atomic_uint32;
> a future C11 implementation would just typedef C11's respective atomic
> type to pg_atomic_uint32.
>
> Comments?

Sadly, I don't have a clear feeling for how well or poorly that's
going to work out.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics

From
Ants Aasma
Date:
On Tue, Nov 5, 2013 at 5:51 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> So what I've previously proposed did use plain types for the
> to-be-manipulated atomic integers. I am not sure about that anymore
> though, C11 includes support for atomic operations, but it assumes that
> the to-be-manipulated variable is of a special "atomic" type. While I am
> not propose that we try to emulate C11's API completely (basically
> impossible as it uses type agnostic functions, it also exposes too
> much), it seems to make sense to allow PG's atomics to be implemented by
> C11's.
>
> The API is described at: http://en.cppreference.com/w/c/atomic
>
> The fundamental difference to what I've suggested so far is that the
> atomic variable isn't a plain integer/type but a distinct datatype that
> can *only* be manipulated via the atomic operators. That has the nice
> property that it cannot be accidentally manipulated via plain operators.
>
> E.g. it wouldn't be
> uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_);
> but
> uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_);
>
> where, for now, atomic_uint32 is something like:
> typedef struct pg_atomic_uint32
> {
>         volatile uint32 value;
> } pg_atomic_uint32;
> a future C11 implementation would just typedef C11's respective atomic
> type to pg_atomic_uint32.
>
> Comments?

C11 atomics need to be initialized through atomic_init(), so a simple
typedef will not be correct here. Also, C11 atomics are designed to
have the compiler take care of memory barriers - so they might not
always be a perfect match to our API's, or any API implementable
without compiler support.

However I'm mildly supportive on having a separate type for variables
accessed by atomics. It can result in some unnecessary code churn, but
in general if an atomic access to a variable is added, then all other
accesses to it need to be audited for memory barriers, even if they
were previously correctly synchronized by a lock.

I guess the best approach for deciding would be to try to convert a
couple of the existing unlocked accesses to the API and see what the
patch looks like.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: better atomics

From
Andres Freund
Date:
On 2013-11-06 16:48:17 +0200, Ants Aasma wrote:
> C11 atomics need to be initialized through atomic_init(), so a simple
> typedef will not be correct here. Also, C11 atomics are designed to
> have the compiler take care of memory barriers - so they might not
> always be a perfect match to our API's, or any API implementable
> without compiler support.

Yea, I think we'll always need to have our layer above C11 atomics, but
it still will be useful to allow them to be used to implement our
atomics.

FWIW, I find the requirement for atomic_init() really, really
annoying. Not that that will change anything ;)

> However I'm mildly supportive on having a separate type for variables
> accessed by atomics. It can result in some unnecessary code churn, but
> in general if an atomic access to a variable is added, then all other
> accesses to it need to be audited for memory barriers, even if they
> were previously correctly synchronized by a lock.

Ok, that's what I am writing right now.

> I guess the best approach for deciding would be to try to convert a
> couple of the existing unlocked accesses to the API and see what the
> patch looks like.

I don't think there really exist any interesting ones? I am using my
lwlock reimplementation as a testbed so far.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics

From
Ants Aasma
Date:
On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> FWIW, I find the requirement for atomic_init() really, really
> annoying. Not that that will change anything ;)

Me too, but I guess this allows them to emulate atomics on platforms
that don't have native support. Whether that is a good idea is a
separate question.

>> However I'm mildly supportive on having a separate type for variables
>> accessed by atomics. It can result in some unnecessary code churn, but
>> in general if an atomic access to a variable is added, then all other
>> accesses to it need to be audited for memory barriers, even if they
>> were previously correctly synchronized by a lock.
>
> Ok, that's what I am writing right now.
>
>> I guess the best approach for deciding would be to try to convert a
>> couple of the existing unlocked accesses to the API and see what the
>> patch looks like.
>
> I don't think there really exist any interesting ones? I am using my
> lwlock reimplementation as a testbed so far.

BufferDesc management in src/backend/storage/buffer/bufmgr.c.
The seqlock like thing used to provide consistent reads from
PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like
it's broken on weakly ordered machines)
The unlocked reads that are done from various procarray variables.
The XLogCtl accesses in xlog.c.

IMO the volatile keyword should be confined to the code actually
implementing atomics, locking. The "use volatile pointer to prevent
code rearrangement" comment is evil. If there are data races in the
code, they need comments pointing this out and probably memory
barriers too. If there are no data races, then the volatile
declaration is useless and a pessimization. Currently it's more of a
catch all "here be dragons" declaration for data structures.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de



Re: better atomics

From
Andres Freund
Date:
On 2013-11-06 08:15:59 -0500, Robert Haas wrote:
> > The API is described at: http://en.cppreference.com/w/c/atomic
> >
> > The fundamental difference to what I've suggested so far is that the
> > atomic variable isn't a plain integer/type but a distinct datatype that
> > can *only* be manipulated via the atomic operators. That has the nice
> > property that it cannot be accidentally manipulated via plain operators.
> >
> > E.g. it wouldn't be
> > uint32 pg_atomic_fetch_add_u32(uint32 *ptr, uint32 add_);
> > but
> > uint32 pg_atomic_fetch_add_u32(pg_atomic_uint32 *ptr, uint32 add_);
> >
> > where, for now, atomic_uint32 is something like:
> > typedef struct pg_atomic_uint32
> > {
> >         volatile uint32 value;
> > } pg_atomic_uint32;
> > a future C11 implementation would just typedef C11's respective atomic
> > type to pg_atomic_uint32.
> >
> > Comments?
>
> Sadly, I don't have a clear feeling for how well or poorly that's
> going to work out.

I've implemented it that way and it imo looks sensible. I dislike the
pg_atomic_init_(flag|u32|u64) calls that are forced on us by wanting to
be compatible with C11, but I think that doesn't end up being too bad.

So, attached is a very first draft of an atomics implementation. There's
lots to be done, but this is how I think it should roughly look like and
what things we should implement in the beginning.
The API should be understandable from looking at
src/include/storage/atomics.h

I haven't tested the IBM xlc, HPUX IA64 acc, Sunpro fallback at all, but
the important part is that those provide the necessary tools to
implement everything. Also, even the gcc implementation falls back to
compare_and_swap() based implementations for the operations, but that
shouldn't matter for now. I haven't looked for other compilers yet.

Questions:
* Fundamental issues with the API?
* should we split of compiler/arch specific parts of into include/ports
  or such? -0.5 from me.
* should we add src/include/storage/atomics subdirectory? +0.5 from me.
* Do we really want to continue to cater to compilers not supporting
  PG_USE_INLINE? I've tried to add support for them, but it does make
  things considerably more #ifdef-y.
  Afaik it's only HPUX's acc that has problems, and it seems to work but
  generate warnings, so maybe that's ok?
* Which additional compilers do we want to support? icc (might be gcc
  compatible)?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics

From
Andres Freund
Date:
On 2013-11-06 17:47:45 +0200, Ants Aasma wrote:
> On Wed, Nov 6, 2013 at 4:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > FWIW, I find the requirement for atomic_init() really, really
> > annoying. Not that that will change anything ;)
> 
> Me too, but I guess this allows them to emulate atomics on platforms
> that don't have native support. Whether that is a good idea is a
> separate question.

Since static initialization is supported that would require quite some
dirty hacks, but well, we're talking about compiler makers ;)

> > I don't think there really exist any interesting ones? I am using my
> > lwlock reimplementation as a testbed so far.
> 
> BufferDesc management in src/backend/storage/buffer/bufmgr.c.
> The unlocked reads that are done from various procarray variables.
> The XLogCtl accesses in xlog.c.

Those are actually only modified under spinlocks afair, so there isn't
too much interesting happening. But yes, it'd be better if we used more
explicit means to manipulate/query them.

> The seqlock like thing used to provide consistent reads from
> PgBackendStatus src/backend/postmaster/pgstat.c (this code looks like
> it's broken on weakly ordered machines)

Whoa. Never noticed that bit. The consequences of races are quite low,
but still...

> IMO the volatile keyword should be confined to the code actually
> implementing atomics, locking. The "use volatile pointer to prevent
> code rearrangement" comment is evil. If there are data races in the
> code, they need comments pointing this out and probably memory
> barriers too. If there are no data races, then the volatile
> declaration is useless and a pessimization. Currently it's more of a
> catch all "here be dragons" declaration for data structures.

Agreed. But I don't think we can replace them all at once. Or well, I
won't ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - require PG_USE_INLINE support?

From
Andres Freund
Date:
On 2013-11-06 17:24:57 +0100, Andres Freund wrote:
> Questions:
> * Do we really want to continue to cater to compilers not supporting
>   PG_USE_INLINE? I've tried to add support for them, but it does make
>   things considerably more #ifdef-y.
>   Afaik it's only HPUX's acc that has problems, and it seems to work but
>   generate warnings, so maybe that's ok?

Maybe we can simply silence that specific warning for HPUX when using
aC++ like in the attached patch? It's not like somebody really looks at
those anyway given how bleaty it is.
Or we can just generally remove the "quiet inline" check.

I haven't tested the patch since I don't have a HPUX machine but that's the option to use according to
http://h21007.www2.hp.com/portal/download/files/unprot/aCxx/Online_Help/options.htm#opt+Wargs

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - spinlock fallback?

From
Andres Freund
Date:
Hi,

Instead of de-supporting platforms that don't have CAS support or
providing parallel implementations we could relatively easily build a
spinlock based fallback using the already existing requirement for
tas().
Something like an array of 16 spinlocks, indexed by a more advanced
version of ((char *)(&atomics) >> sizeof(char *)) % 16. The platforms
that would fallback aren't that likely to be used under heavy
concurrency, so the price for that shouldn't be too high.

The only real problem with that would be that we'd need to remove the
spinnlock fallback for barriers, but that seems to be pretty much
disliked.

Thoughts?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - spinlock fallback?

From
Robert Haas
Date:
On Mon, Nov 11, 2013 at 1:57 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Instead of de-supporting platforms that don't have CAS support or
> providing parallel implementations we could relatively easily build a
> spinlock based fallback using the already existing requirement for
> tas().
> Something like an array of 16 spinlocks, indexed by a more advanced
> version of ((char *)(&atomics) >> sizeof(char *)) % 16. The platforms
> that would fallback aren't that likely to be used under heavy
> concurrency, so the price for that shouldn't be too high.
>
> The only real problem with that would be that we'd need to remove the
> spinnlock fallback for barriers, but that seems to be pretty much
> disliked.

I think this is worth considering.  I'm not too clear what to do about
the barriers problem, though.  I feel like we've dug ourselves into a
bit of a hole, there, and I'm not sure I understand the issues well
enough to dig us back out of it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - spinlock fallback?

From
Andres Freund
Date:
On 2013-11-12 13:21:30 -0500, Robert Haas wrote:
> > The only real problem with that would be that we'd need to remove the
> > spinnlock fallback for barriers, but that seems to be pretty much
> > disliked.
> 
> I think this is worth considering.

Ok, cool. The prototype patch I have for that is pretty small, so it doesn't
look too bad.

What currently scares me is the amount of code I have to write that I can't
test... I really can't see me being able to provide a patch that doesn't
require some buildfarm cycles to really work on all platforms.

>  I'm not too clear what to do about
> the barriers problem, though.  I feel like we've dug ourselves into a
> bit of a hole, there, and I'm not sure I understand the issues well
> enough to dig us back out of it.

I think any platform where we aren't able to provide a proper compiler/memory
barrier will also have broken spinlock relase semantics (as in missing release
memory barrier). So arguably removing the fallback is a good idea anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Andres Freund
Date:
Hi,

This is the first real submission of how I think our atomics
architecture should look like. It's definitely a good bit from complete,
please view it from that POV!
The most important thing I need at this point is feedback whether the
API in general looks ok and what should be changed, before I spend the
time to implement it nicely everywhere.

The API is loosely modeled to match C11's atomics
http://en.cppreference.com/w/c/atomic library, at least to the degree
that the latter could be used to implement pg's atomics.

Please note:
* Only gcc's implementation is tested. I'd be really surprised if it
  compiled on anything else.
* I've currently replaced nearly all of s_lock.h to use parts of the
  atomics API. We might rather use it's contents to implement atomics
  'flags' on some of the wierder platforms out there.
* I've integrated barrier.h into the atomics code. It's pretty much a
  required part, and compiler/arch specific so that seems to make sense.
* 'autoreconf' needs to be run after applying the patchset. I got some
  problems with running autoconf 2.63 here, using 2.69 makes the diff to
  big.

Questions:
* What do you like/dislike about the API (storage/atomics.h)
* decide whether it's ok to rely on inline functions or whether we need
  to provide out-of-line versions. I think we should just bite the
  bullet and require support. Some very old arches might need to live with
  warnings. Patch 01 tries to silence them on HP-UX.
* decide whether we want to keep the semaphore fallback for
  spinlocks. I strongly vote for removing it. The patchset provides
  compiler based default implementations for atomics, so it's unlikely
  to be helpful when bringing up a new platform.

TODO:
* actually test on other platforms that new gcc
* We should probably move most of the (documentation) content of
  s_lock.h somewhere else.
* Integrate the spinlocks based fallback for platforms that only provide
  something like TAS().

Comments?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - v0.2

From
Robert Haas
Date:
On Fri, Nov 15, 2013 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Questions:
> * What do you like/dislike about the API (storage/atomics.h)
> * decide whether it's ok to rely on inline functions or whether we need
>   to provide out-of-line versions. I think we should just bite the
>   bullet and require support. Some very old arches might need to live with
>   warnings. Patch 01 tries to silence them on HP-UX.
> * decide whether we want to keep the semaphore fallback for
>   spinlocks. I strongly vote for removing it. The patchset provides
>   compiler based default implementations for atomics, so it's unlikely
>   to be helpful when bringing up a new platform.

On point #2, I don't personally know of any systems that I care about
where inlining isn't supported.  However, we've gone to quite a bit of
trouble relatively recently to keep things working for platforms where
that is the case, so I feel the need for an obligatory disclaimer: I
will not commit any patch that moves the wood in that area unless
there is massive consensus that this is an acceptable way forward.
Similarly for point #3: Heikki not long ago went to the trouble of
unbreaking --disable-spinlocks, which suggests that there is at least
some interest in continuing to be able to run in that mode.  I'm
perfectly OK with the decision that we don't care about that any more,
but I will not be the one to make that decision, and I think it
requires a greater-than-normal level of consensus.

On point #1, I dunno.  It looks like a lot of rearrangement to me, and
I'm not really sure what the final form of it is intended to be.  I
think it desperately needs a README explaining what the point of all
this is and how to add support for a new platform or compiler if yours
doesn't work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On point #2, I don't personally know of any systems that I care about
> where inlining isn't supported.  However, we've gone to quite a bit of
> trouble relatively recently to keep things working for platforms where
> that is the case, so I feel the need for an obligatory disclaimer: I
> will not commit any patch that moves the wood in that area unless
> there is massive consensus that this is an acceptable way forward.
> Similarly for point #3: Heikki not long ago went to the trouble of
> unbreaking --disable-spinlocks, which suggests that there is at least
> some interest in continuing to be able to run in that mode.  I'm
> perfectly OK with the decision that we don't care about that any more,
> but I will not be the one to make that decision, and I think it
> requires a greater-than-normal level of consensus.

Hm.  Now that I think about it, isn't Peter proposing to break systems
without working "inline" over here?
http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net

Maybe that patch needs a bit more discussion.  I tend to agree that
significantly moving our portability goalposts shouldn't be undertaken
lightly.  On the other hand, it is 2013, and so it seems not unreasonable
to reconsider choices last made more than a dozen years ago.

Here's an idea that might serve as a useful touchstone for making choices:
compiler inadequacies are easier to fix than hardware/OS inadequacies.
Even a dozen years ago, people could get gcc running on any platform of
interest, and I seriously doubt that's less true today than back then.
So if anyone complains "my compiler doesn't support 'const'", we could
reasonably reply "so install gcc".  On the other hand, if the complaint is
"my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
to buy a new server.
        regards, tom lane



Re: better atomics - v0.2

From
Peter Eisentraut
Date:
On 11/19/13, 9:57 AM, Tom Lane wrote:
> Hm.  Now that I think about it, isn't Peter proposing to break systems
> without working "inline" over here?
> http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net

No, that's about const, volatile, #, and memcmp.

I don't have an informed opinion about requiring inline support
(although it would surely be nice).




Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 09:57:20 -0500, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On point #2, I don't personally know of any systems that I care about
> > where inlining isn't supported.  However, we've gone to quite a bit of
> > trouble relatively recently to keep things working for platforms where
> > that is the case, so I feel the need for an obligatory disclaimer: I
> > will not commit any patch that moves the wood in that area unless
> > there is massive consensus that this is an acceptable way forward.
> > Similarly for point #3: Heikki not long ago went to the trouble of
> > unbreaking --disable-spinlocks, which suggests that there is at least
> > some interest in continuing to be able to run in that mode.  I'm
> > perfectly OK with the decision that we don't care about that any more,
> > but I will not be the one to make that decision, and I think it
> > requires a greater-than-normal level of consensus.
> 
> Hm.  Now that I think about it, isn't Peter proposing to break systems
> without working "inline" over here?
> http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net

Hm. Those don't directly seem to affect our inline tests. Our test is
PGAC_C_INLINE which itself uses AC_C_INLINE and then tests that
unreferenced inlines don't generate warnings.

> Maybe that patch needs a bit more discussion.  I tend to agree that
> significantly moving our portability goalposts shouldn't be undertaken
> lightly.  On the other hand, it is 2013, and so it seems not unreasonable
> to reconsider choices last made more than a dozen years ago.

I don't think there's even platforms out there that don't support const
inlines, just some that unnecessarily generate warnings. But since
builds on those platforms are already littered with warnings, that
doesn't seem something we need to majorly be concerned with.

The only animal we have that doesn't support quiet inlines today is
HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
to simply suppress the warning there.

> Here's an idea that might serve as a useful touchstone for making choices:
> compiler inadequacies are easier to fix than hardware/OS inadequacies.
> Even a dozen years ago, people could get gcc running on any platform of
> interest, and I seriously doubt that's less true today than back then.
> So if anyone complains "my compiler doesn't support 'const'", we could
> reasonably reply "so install gcc".

Agreed.

> On the other hand, if the complaint is
> "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
> to buy a new server.

Agreed. I've am even wondering about 32bit CAS since it's not actually
that hard to emulate it using spinlocks. Certainly less work than
arguing about removing stone-age platforms ;)

There's no fundamental issue with continuing to support semaphore based
spinlocks I am just wondering about the point of supporting that
configuration since it will always yield horrible performance.

The only fundamental thing that I don't immediately see how we can
support is the spinlock based memory barrier since that introduces a
circularity (atomics need barrier, barrier needs spinlocks, spinlock
needs atomics).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Robert Haas
Date:
On Tue, Nov 19, 2013 at 9:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On point #2, I don't personally know of any systems that I care about
>> where inlining isn't supported.  However, we've gone to quite a bit of
>> trouble relatively recently to keep things working for platforms where
>> that is the case, so I feel the need for an obligatory disclaimer: I
>> will not commit any patch that moves the wood in that area unless
>> there is massive consensus that this is an acceptable way forward.
>> Similarly for point #3: Heikki not long ago went to the trouble of
>> unbreaking --disable-spinlocks, which suggests that there is at least
>> some interest in continuing to be able to run in that mode.  I'm
>> perfectly OK with the decision that we don't care about that any more,
>> but I will not be the one to make that decision, and I think it
>> requires a greater-than-normal level of consensus.
>
> Hm.  Now that I think about it, isn't Peter proposing to break systems
> without working "inline" over here?
> http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net
>
> Maybe that patch needs a bit more discussion.  I tend to agree that
> significantly moving our portability goalposts shouldn't be undertaken
> lightly.  On the other hand, it is 2013, and so it seems not unreasonable
> to reconsider choices last made more than a dozen years ago.

Sure.  inline isn't C89, and to date, we've been quite rigorous about
enforcing C89-compliance, so I'm going to try not to be the guy that
casually breaks that.  On the other hand, I haven't personally seen
any systems that don't have inline in a very long time.  The first
systems I did development on were not even C89; they were K&R, and you
had to use old-style function declarations.  It's probably safe to say
that they didn't support inline, either, though I don't recall trying
it.  But the last time I used a system like that was probably in the
early nineties, and it's been a long time since then.  However, I
upgrade my hardware about every 2 years, so I'm not the best guy to
say what the oldest stuff people still care about is.

> Here's an idea that might serve as a useful touchstone for making choices:
> compiler inadequacies are easier to fix than hardware/OS inadequacies.
> Even a dozen years ago, people could get gcc running on any platform of
> interest, and I seriously doubt that's less true today than back then.
> So if anyone complains "my compiler doesn't support 'const'", we could
> reasonably reply "so install gcc".  On the other hand, if the complaint is
> "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
> to buy a new server.

I generally agree with that principle, but I'd offer the caveat that
some people do still have valid reasons for wanting to use non-GCC
compilers.  I have been told that HP's aCC produces better results on
Itanium than gcc, for example.  Still, I think we could probably make
a list of no more than half-a-dozen compilers that we really care
about - the popular open source ones (is there anything we need to
care about other than gcc and clang?) and the ones supported by large
vendors (IBM's ICC and HP's aCC being the ones that I know about) and
adopt features that are supported by everything on that list.

Another point is that, some releases ago, we began requiring working
64-bit arithmetic as per commit
d15cb38dec01fcc8d882706a3fa493517597c76b.  We might want to go through
our list of nominally-supported platforms and see whether any of them
would fail to compile for that reason.  Based on the commit message
for the aforementioned commit, such platforms haven't been viable for
PostgreSQL since 8.3, so we're not losing anything by canning them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Robert Haas
Date:
On Tue, Nov 19, 2013 at 10:16 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On the other hand, if the complaint is
>> "my hardware doesn't support 64-bit CAS", it's not reasonable to tell them
>> to buy a new server.
>
> Agreed. I've am even wondering about 32bit CAS since it's not actually
> that hard to emulate it using spinlocks. Certainly less work than
> arguing about removing stone-age platforms ;)
>
> There's no fundamental issue with continuing to support semaphore based
> spinlocks I am just wondering about the point of supporting that
> configuration since it will always yield horrible performance.
>
> The only fundamental thing that I don't immediately see how we can
> support is the spinlock based memory barrier since that introduces a
> circularity (atomics need barrier, barrier needs spinlocks, spinlock
> needs atomics).

We've been pretty much assuming for a long time that calling a
function in another translation unit acts as a compiler barrier.
There's a lot of code that isn't actually safe against global
optimization; we assume, for example, that memory accesses can't move
over an LWLockAcquire(), but that's just using spinlocks internally,
and those aren't guaranteed to be compiler barriers, per previous
discussion.  So one idea for a compiler barrier is just to define a
function call pg_compiler_barrier() in a file by itself, and make that
the fallback implementation.  That will of course fail if someone uses
a globally optimizing compiler, but I think it'd be OK to say that if
you want to do that, you'd better have a real barrier implementation.
Right now, it's probably unsafe regardless.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 09:12:42 -0500, Robert Haas wrote:
> On point #1, I dunno.  It looks like a lot of rearrangement to me, and
> I'm not really sure what the final form of it is intended to be.

Understandable. I am not that sure what parts we want to rearange
either. We very well could leave barrier.h and s_lock.h untouched and
just maintain the atomics stuff in parallel and switch over at some
later point.
I've mainly switched over s_lock.h to a) get some good coverage of the
atomics code b) make sure the api works fine for it. We could add the
atomics.h based implementation as a fallback for the cases in which no
native implementation is available.

> I think it desperately needs a README explaining what the point of all
> this is and how to add support for a new platform or compiler if yours
> doesn't work.

Ok, I'll work on that. It would be useful to get feedback on the
"frontend" API in atomics.h though. If people are unhappy with the API
it might very well change how we can implement the API on different
architectures.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Robert Haas
Date:
On Tue, Nov 19, 2013 at 10:26 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-19 09:12:42 -0500, Robert Haas wrote:
>> On point #1, I dunno.  It looks like a lot of rearrangement to me, and
>> I'm not really sure what the final form of it is intended to be.
>
> Understandable. I am not that sure what parts we want to rearange
> either. We very well could leave barrier.h and s_lock.h untouched and
> just maintain the atomics stuff in parallel and switch over at some
> later point.
> I've mainly switched over s_lock.h to a) get some good coverage of the
> atomics code b) make sure the api works fine for it. We could add the
> atomics.h based implementation as a fallback for the cases in which no
> native implementation is available.
>
>> I think it desperately needs a README explaining what the point of all
>> this is and how to add support for a new platform or compiler if yours
>> doesn't work.
>
> Ok, I'll work on that. It would be useful to get feedback on the
> "frontend" API in atomics.h though. If people are unhappy with the API
> it might very well change how we can implement the API on different
> architectures.

Sure, but if people can't understand the API, then it's hard to decide
whether to be unhappy with it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Tom Lane
Date:
Peter Eisentraut <peter_e@gmx.net> writes:
> On 11/19/13, 9:57 AM, Tom Lane wrote:
>> Hm.  Now that I think about it, isn't Peter proposing to break systems
>> without working "inline" over here?
>> http://www.postgresql.org/message-id/1384257026.8059.5.camel@vanquo.pezone.net

> No, that's about const, volatile, #, and memcmp.

Ah, sorry, not enough caffeine absorbed yet.  Still, we should stop
to think about whether this represents an undesirable move of the
portability goalposts.  The first three of these are certainly
compiler issues, and I personally don't have a problem with blowing
off compilers that still haven't managed to implement all of C89 :-(.
I'm not clear on which systems had the memcmp issue --- do we have
the full story on that?

> I don't have an informed opinion about requiring inline support
> (although it would surely be nice).

inline is C99, and we've generally resisted requiring C99 features.
Maybe it's time to move that goalpost, and maybe not.
        regards, tom lane



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 10:23:57 -0500, Robert Haas wrote:
> > The only fundamental thing that I don't immediately see how we can
> > support is the spinlock based memory barrier since that introduces a
> > circularity (atomics need barrier, barrier needs spinlocks, spinlock
> > needs atomics).
>
> We've been pretty much assuming for a long time that calling a
> function in another translation unit acts as a compiler barrier.
> There's a lot of code that isn't actually safe against global
> optimization; we assume, for example, that memory accesses can't move
> over an LWLockAcquire(), but that's just using spinlocks internally,
> and those aren't guaranteed to be compiler barriers, per previous
> discussion.  So one idea for a compiler barrier is just to define a
> function call pg_compiler_barrier() in a file by itself, and make that
> the fallback implementation.  That will of course fail if someone uses
> a globally optimizing compiler, but I think it'd be OK to say that if
> you want to do that, you'd better have a real barrier implementation.

That works for compiler, but not for memory barriers :/

> Right now, it's probably unsafe regardless.

Yes, I have pretty little trust in the current state. Both from the
infrastructure perspective (spinlocks, barriers) as from individual
pieces of code. To a good part we're probably primarily protected by
x86's black magic and the fact that everyone with sufficient concurrency
to see problems uses x86.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Robert Haas
Date:
On Tue, Nov 19, 2013 at 10:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-19 10:23:57 -0500, Robert Haas wrote:
>> > The only fundamental thing that I don't immediately see how we can
>> > support is the spinlock based memory barrier since that introduces a
>> > circularity (atomics need barrier, barrier needs spinlocks, spinlock
>> > needs atomics).
>>
>> We've been pretty much assuming for a long time that calling a
>> function in another translation unit acts as a compiler barrier.
>> There's a lot of code that isn't actually safe against global
>> optimization; we assume, for example, that memory accesses can't move
>> over an LWLockAcquire(), but that's just using spinlocks internally,
>> and those aren't guaranteed to be compiler barriers, per previous
>> discussion.  So one idea for a compiler barrier is just to define a
>> function call pg_compiler_barrier() in a file by itself, and make that
>> the fallback implementation.  That will of course fail if someone uses
>> a globally optimizing compiler, but I think it'd be OK to say that if
>> you want to do that, you'd better have a real barrier implementation.
>
> That works for compiler, but not for memory barriers :/

True, but we already assume that a spinlock is a memory barrier minus
a compiler barrier.  So if you have a working compiler barrier, you
ought to be able to fix spinlocks to be memory barriers.  And then, if
you need a memory barrier for some other purpose, you can always fall
back to acquiring and releasing a spinlock.

Maybe that's too contorted.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > I don't have an informed opinion about requiring inline support
> > (although it would surely be nice).
> 
> inline is C99, and we've generally resisted requiring C99 features.
> Maybe it's time to move that goalpost, and maybe not.

But it's a part of C99 that was very widely implemented before, so even
if we don't want to rely on C99 in its entirety, relying on inline
support is realistic.

I think, independent from atomics, the readability & maintainability win
by relying on inline functions instead of long macros, potentially with
multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
significant.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> The only animal we have that doesn't support quiet inlines today is
> HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
> to simply suppress the warning there.

Or just not worry about it, if it's only a warning?  Or does the warning
mean code bloat (lots of useless copies of the inline function)?
        regards, tom lane



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 10:37:35 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > The only animal we have that doesn't support quiet inlines today is
> > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
> > to simply suppress the warning there.
> 
> Or just not worry about it, if it's only a warning?  Or does the warning
> mean code bloat (lots of useless copies of the inline function)?

I honestly have no idea whether it causes code bloat - I'd be surprised
if it did since it detects that they are unused, but I cannot rule it
out entirely.
The suggested patch - untested since I have no access to HP-UX - just
adds +W2177 to the compiler's commandline in template/hpux which
supposedly suppressed that warning.

I think removing the quiet inline test is a good idea, but that doesn't
preclude fixing the warnings at the same time.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Andres Freund
Date:
Hi,

On 2013-11-19 10:27:57 -0500, Robert Haas wrote:
> > Ok, I'll work on that. It would be useful to get feedback on the
> > "frontend" API in atomics.h though. If people are unhappy with the API
> > it might very well change how we can implement the API on different
> > architectures.
>
> Sure, but if people can't understand the API, then it's hard to decide
> whether to be unhappy with it.

Fair point, while there's already an explantory blurb ontop of atomics.h
and the semantics of the functions should all be documented, it doesn't
explain the API on a high level. I've expanded it a bit

The attached patches compile and make check successfully on linux x86,
amd64 and msvc x86 and amd64. This time and updated configure is
included. The majority of operations still rely on CAS emulation.

I've copied the introduction here for easier reading:

/*-------------------------------------------------------------------------
 *
 * atomics.h
 *      Generic atomic operations support.
 *
 * Hardware and compiler dependent functions for manipulating memory
 * atomically and dealing with cache coherency. Used to implement locking
 * facilities and other concurrency safe constructs.
 *
 * There's three data types which can be manipulated atomically:
 *
 * * pg_atomic_flag - supports atomic test/clear operations, useful for
 *      implementing spinlocks and similar constructs.
 *
 * * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated
 *      by several processes at the same time.
 *
 * * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support
 *      can be tested by checking for PG_HAVE_64_BIT_ATOMICS.
 *
 * The values stored in theses cannot be accessed using the API provided in
 * thise file, i.e. it is not possible to write statements like `var +=
 * 10`. Instead, for this example, one would have to use
 * `pg_atomic_fetch_add_u32(&var, 10)`.
 *
 * This restriction has several reasons: Primarily it doesn't allow non-atomic
 * math to be performed on these values which wouldn't be concurrency
 * safe. Secondly it allows to implement these types ontop of facilities -
 * like C11's atomics - that don't provide direct access. Lastly it serves as
 * a useful hint what code should be looked at more carefully because of
 * concurrency concerns.
 *
 * To be useful they usually will need to be placed in memory shared between
 * processes or threads, most frequently by embedding them in structs. Be
 * careful to align atomic variables to their own size!
 *
 * Before variables of one these types can be used they needs to be
 * initialized using the type's initialization function (pg_atomic_init_flag,
 * pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global
 * pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These
 * initializations have to be performed before any (possibly concurrent)
 * access is possible, otherwise the results are undefined.
 *
 * For several mathematical operations two variants exist: One that returns
 * the old value before the operation was performed, and one that that returns
 * the new value. *_fetch_<op> variants will return the old value,
 * *_<op>_fetch the new one.
 *
 *
 * To bring up postgres on a platform/compiler at the very least
 * either one of
 * * pg_atomic_test_set_flag(), pg_atomic_init_flag(), pg_atomic_clear_flag()
 * * pg_atomic_compare_exchange_u32()
 * * pg_atomic_exchange_u32()
 * and all of
 * * pg_compiler_barrier(), pg_memory_barrier()a
 * need to be implemented. There exist generic, hardware independent,
 * implementations for several compilers which might be sufficient, although
 * possibly not optimal, for a new platform.
 *
 * To add support for a new architecture review whether the compilers used on
 * that platform already provide adequate support in the files included
 * "Compiler specific" section below. If not either add an include for a new
 * file adding generic support for your compiler there and/or add/update an
 * architecture specific include in the "Architecture specific" section below.
 *
 * Operations that aren't implemented by either architecture or compiler
 * specific code are implemented ontop of compare_exchange() loops or
 * spinlocks.
 *
 * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
 * whenever possible. Writing correct code using these facilities is hard.
 *
 * For an introduction to using memory barriers within the PostgreSQL backend,
 * see src/backend/storage/lmgr/README.barrier
 *
 * Portions Copyright (c) 1996-2013, PostgreSQL Global Development Group
 * Portions Copyright (c) 1994, Regents of the University of California
 *
 * src/include/storage/atomics.h
 *
 *-------------------------------------------------------------------------
 */

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - v0.2

From
Robert Haas
Date:
On Tue, Nov 19, 2013 at 11:38 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> /*-------------------------------------------------------------------------
>  *
>  * atomics.h
>  *        Generic atomic operations support.
>  *
>  * Hardware and compiler dependent functions for manipulating memory
>  * atomically and dealing with cache coherency. Used to implement locking
>  * facilities and other concurrency safe constructs.
>  *
>  * There's three data types which can be manipulated atomically:
>  *
>  * * pg_atomic_flag - supports atomic test/clear operations, useful for
>  *      implementing spinlocks and similar constructs.
>  *
>  * * pg_atomic_uint32 - unsigned 32bit integer that can correctly manipulated
>  *      by several processes at the same time.
>  *
>  * * pg_atomic_uint64 - optional 64bit variant of pg_atomic_uint32. Support
>  *      can be tested by checking for PG_HAVE_64_BIT_ATOMICS.
>  *
>  * The values stored in theses cannot be accessed using the API provided in
>  * thise file, i.e. it is not possible to write statements like `var +=
>  * 10`. Instead, for this example, one would have to use
>  * `pg_atomic_fetch_add_u32(&var, 10)`.
>  *
>  * This restriction has several reasons: Primarily it doesn't allow non-atomic
>  * math to be performed on these values which wouldn't be concurrency
>  * safe. Secondly it allows to implement these types ontop of facilities -
>  * like C11's atomics - that don't provide direct access. Lastly it serves as
>  * a useful hint what code should be looked at more carefully because of
>  * concurrency concerns.
>  *
>  * To be useful they usually will need to be placed in memory shared between
>  * processes or threads, most frequently by embedding them in structs. Be
>  * careful to align atomic variables to their own size!

What does that mean exactly?

>  * Before variables of one these types can be used they needs to be
>  * initialized using the type's initialization function (pg_atomic_init_flag,
>  * pg_atomic_init_u32 and pg_atomic_init_u64) or in case of global
>  * pg_atomic_flag variables using the PG_ATOMIC_INIT_FLAG macro. These
>  * initializations have to be performed before any (possibly concurrent)
>  * access is possible, otherwise the results are undefined.
>  *
>  * For several mathematical operations two variants exist: One that returns
>  * the old value before the operation was performed, and one that that returns
>  * the new value. *_fetch_<op> variants will return the old value,
>  * *_<op>_fetch the new one.

Ugh.  Do we really need this much complexity?

>  * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
>  * whenever possible. Writing correct code using these facilities is hard.
>  *
>  * For an introduction to using memory barriers within the PostgreSQL backend,
>  * see src/backend/storage/lmgr/README.barrier

The extent to which these primitives themselves provide ordering
guarantees should be documented.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 12:43:44 -0500, Robert Haas wrote:
> >  * To be useful they usually will need to be placed in memory shared between
> >  * processes or threads, most frequently by embedding them in structs. Be
> >  * careful to align atomic variables to their own size!
> 
> What does that mean exactly?

The alignment part? A pg_atomic_flag needs to be aligned to
sizeof(pg_atomic_flag), pg_atomic_uint32,64s to their sizeof()
respectively. When embedding them inside a struct the whole struct needs
to be allocated at least at that alignment.
Not sure how to describe that concisely.

I've added wrapper functions around the implementation that test for
alignment to make sure it's easy to spot such mistakes which could end
up having ugly results on some platforms.

> >  * For several mathematical operations two variants exist: One that returns
> >  * the old value before the operation was performed, and one that that returns
> >  * the new value. *_fetch_<op> variants will return the old value,
> >  * *_<op>_fetch the new one.
> 
> Ugh.  Do we really need this much complexity?

Well, based on my experience it's really what you need when writing code
using it. The _<op>_fetch variants are implemented generically using the
_fetch_<op> operations, so leaving them out wouldn't win us much. What
would you like to remove?

> >  * Use higher level functionality (lwlocks, spinlocks, heavyweight locks)
> >  * whenever possible. Writing correct code using these facilities is hard.
> >  *
> >  * For an introduction to using memory barriers within the PostgreSQL backend,
> >  * see src/backend/storage/lmgr/README.barrier
> 
> The extent to which these primitives themselves provide ordering
> guarantees should be documented.

Every function should have a comment to that regard, although two have
an XXX where I am not sure what we want to mandate, specifically whether
pg_atomic_test_set_flag() should be a full or acquire barrier and
whether pg_atomic_clear_flag should be full or release.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Bruce Momjian
Date:
On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:
> On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > > I don't have an informed opinion about requiring inline support
> > > (although it would surely be nice).
> > 
> > inline is C99, and we've generally resisted requiring C99 features.
> > Maybe it's time to move that goalpost, and maybe not.
> 
> But it's a part of C99 that was very widely implemented before, so even
> if we don't want to rely on C99 in its entirety, relying on inline
> support is realistic.
> 
> I think, independent from atomics, the readability & maintainability win
> by relying on inline functions instead of long macros, potentially with
> multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
> significant.

Oh, man, my fastgetattr() macro is going to be simplified.  All my good
work gets rewritten.  ;-)

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:
> On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:
> > On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > > > I don't have an informed opinion about requiring inline support
> > > > (although it would surely be nice).
> > > 
> > > inline is C99, and we've generally resisted requiring C99 features.
> > > Maybe it's time to move that goalpost, and maybe not.
> > 
> > But it's a part of C99 that was very widely implemented before, so even
> > if we don't want to rely on C99 in its entirety, relying on inline
> > support is realistic.
> > 
> > I think, independent from atomics, the readability & maintainability win
> > by relying on inline functions instead of long macros, potentially with
> > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
> > significant.
> 
> Oh, man, my fastgetattr() macro is going to be simplified.  All my good
> work gets rewritten.  ;-)

That and HeapKeyTest() alone are sufficient reason for this ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Bruce Momjian
Date:
On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:
> On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:
> > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:
> > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > > > > I don't have an informed opinion about requiring inline support
> > > > > (although it would surely be nice).
> > > > 
> > > > inline is C99, and we've generally resisted requiring C99 features.
> > > > Maybe it's time to move that goalpost, and maybe not.
> > > 
> > > But it's a part of C99 that was very widely implemented before, so even
> > > if we don't want to rely on C99 in its entirety, relying on inline
> > > support is realistic.
> > > 
> > > I think, independent from atomics, the readability & maintainability win
> > > by relying on inline functions instead of long macros, potentially with
> > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
> > > significant.
> > 
> > Oh, man, my fastgetattr() macro is going to be simplified.  All my good
> > work gets rewritten.  ;-)
> 
> That and HeapKeyTest() alone are sufficient reason for this ;)

Has there been any performance testing on this rewrite to use atomics? 
If so, can I missed it.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:
> On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:
> > On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:
> > > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:
> > > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > > > > > I don't have an informed opinion about requiring inline support
> > > > > > (although it would surely be nice).
> > > > > 
> > > > > inline is C99, and we've generally resisted requiring C99 features.
> > > > > Maybe it's time to move that goalpost, and maybe not.
> > > > 
> > > > But it's a part of C99 that was very widely implemented before, so even
> > > > if we don't want to rely on C99 in its entirety, relying on inline
> > > > support is realistic.
> > > > 
> > > > I think, independent from atomics, the readability & maintainability win
> > > > by relying on inline functions instead of long macros, potentially with
> > > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
> > > > significant.
> > > 
> > > Oh, man, my fastgetattr() macro is going to be simplified.  All my good
> > > work gets rewritten.  ;-)
> > 
> > That and HeapKeyTest() alone are sufficient reason for this ;)
> 
> Has there been any performance testing on this rewrite to use atomics? 
> If so, can I missed it.

Do you mean inline? Or atomics? If the former no, if the latter
yes. I've started on it because of
http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Bruce Momjian
Date:
On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote:
> On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:
> > On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:
> > > On 2013-11-19 16:37:32 -0500, Bruce Momjian wrote:
> > > > On Tue, Nov 19, 2013 at 04:34:59PM +0100, Andres Freund wrote:
> > > > > On 2013-11-19 10:30:24 -0500, Tom Lane wrote:
> > > > > > > I don't have an informed opinion about requiring inline support
> > > > > > > (although it would surely be nice).
> > > > > > 
> > > > > > inline is C99, and we've generally resisted requiring C99 features.
> > > > > > Maybe it's time to move that goalpost, and maybe not.
> > > > > 
> > > > > But it's a part of C99 that was very widely implemented before, so even
> > > > > if we don't want to rely on C99 in its entirety, relying on inline
> > > > > support is realistic.
> > > > > 
> > > > > I think, independent from atomics, the readability & maintainability win
> > > > > by relying on inline functions instead of long macros, potentially with
> > > > > multiple eval hazards, or contortions like ILIST_INCLUDE_DEFINITIONS is
> > > > > significant.
> > > > 
> > > > Oh, man, my fastgetattr() macro is going to be simplified.  All my good
> > > > work gets rewritten.  ;-)
> > > 
> > > That and HeapKeyTest() alone are sufficient reason for this ;)
> > 
> > Has there been any performance testing on this rewrite to use atomics? 
> > If so, can I missed it.
> 
> Do you mean inline? Or atomics? If the former no, if the latter
> yes. I've started on it because of
> http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de

Yes, I was wondering about atomics.  I think we know the performance
characteristics of inlining.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 17:25:21 -0500, Bruce Momjian wrote:
> On Tue, Nov 19, 2013 at 11:21:06PM +0100, Andres Freund wrote:
> > On 2013-11-19 17:16:56 -0500, Bruce Momjian wrote:
> > > On Tue, Nov 19, 2013 at 10:39:19PM +0100, Andres Freund wrote:
> > Do you mean inline? Or atomics? If the former no, if the latter
> > yes. I've started on it because of
> > http://archives.postgresql.org/message-id/20130926225545.GB26663%40awork2.anarazel.de
> 
> Yes, I was wondering about atomics.  I think we know the performance
> characteristics of inlining.

In that case it really depends on what we do with the atomics, providing
the abstraction itself shouldn't change performance at all, but the
possible scalability wins of using them in some critical paths are
pretty huge.

From a prototype I have, ontop of the number in the above post, removing
the spinlock from PinBuffer and UnpinBuffer() (only the
!BM_PIN_COUNT_WAITER path) gives another factor of two on the used
machine, although that required limiting MAX_BACKENDS to 0xfffff instead
of the current 0x7fffff.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-11-19 10:37:35 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > The only animal we have that doesn't support quiet inlines today is
> > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
> > to simply suppress the warning there.
> 
> Or just not worry about it, if it's only a warning?

So, my suggestion on that end is that we remove the requirement for
quiet inline now and see if it that has any negative consequences on the
buildfarm for a week or so. Imo that's a good idea regardless of us
relying on inlining support.
Does anyone have anything against that plan? If not, I'll prepare a
patch.

> Or does the warning
> mean code bloat (lots of useless copies of the inline function)?

After thinking on that for a bit, yes that's a possible consequence, but
it's quite possible that it happens in cases where we don't get the
warning too, so I don't think that argument has too much bearing as I
don't recall a complaint about it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Peter Eisentraut
Date:
This patch didn't make it out of the 2013-11 commit fest.  You should
move it to the next commit fest (probably with an updated patch)
before January 15th.




Re: better atomics - v0.2

From
Robert Haas
Date:
On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-11-19 10:37:35 -0500, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>> > The only animal we have that doesn't support quiet inlines today is
>> > HP-UX/ac++, and I think - as in patch 1 in the series - we might be able
>> > to simply suppress the warning there.
>>
>> Or just not worry about it, if it's only a warning?
>
> So, my suggestion on that end is that we remove the requirement for
> quiet inline now and see if it that has any negative consequences on the
> buildfarm for a week or so. Imo that's a good idea regardless of us
> relying on inlining support.
> Does anyone have anything against that plan? If not, I'll prepare a
> patch.
>
>> Or does the warning
>> mean code bloat (lots of useless copies of the inline function)?
>
> After thinking on that for a bit, yes that's a possible consequence, but
> it's quite possible that it happens in cases where we don't get the
> warning too, so I don't think that argument has too much bearing as I
> don't recall a complaint about it.

But I bet the warning we're getting there is complaining about exactly
that thing.  It's possible that it means "warning: i've optimized away
your unused function into nothing" but I think it's more likely that
it means "warning: i just emitted dead code".  Indeed, I suspect that
this is why that test is written that way in the first place.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2013-12-23 15:04:08 -0500, Robert Haas wrote:
> On Thu, Dec 5, 2013 at 6:39 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-11-19 10:37:35 -0500, Tom Lane wrote:
> >> Or does the warning
> >> mean code bloat (lots of useless copies of the inline function)?
> >
> > After thinking on that for a bit, yes that's a possible consequence, but
> > it's quite possible that it happens in cases where we don't get the
> > warning too, so I don't think that argument has too much bearing as I
> > don't recall a complaint about it.
> 
> But I bet the warning we're getting there is complaining about exactly
> that thing.  It's possible that it means "warning: i've optimized away
> your unused function into nothing" but I think it's more likely that
> it means "warning: i just emitted dead code".  Indeed, I suspect that
> this is why that test is written that way in the first place.

I don't think that's particularly likely - remember, we're talking about
static inline functions here. So the compiler would have to detect that
there's an inline function which won't be used and then explicitly emit
a function body. That seems like an odd thing to do.
So, if there's compilers like that around, imo they have to deal with their
own problems. It's not like it will prohibit using postgres.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.2

From
Marti Raudsepp
Date:
On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> The attached patches compile and make check successfully on linux x86,
> amd64 and msvc x86 and amd64. This time and updated configure is
> included. The majority of operations still rely on CAS emulation.

Note that the atomics-generic-acc.h file has ® characters in the
Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
characters, I think you should make sure they're UTF-8

Regards,
Marti



Re: better atomics - v0.2

From
Robert Haas
Date:
On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote:
> On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> The attached patches compile and make check successfully on linux x86,
>> amd64 and msvc x86 and amd64. This time and updated configure is
>> included. The majority of operations still rely on CAS emulation.
>
> Note that the atomics-generic-acc.h file has ® characters in the
> Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
> characters, I think you should make sure they're UTF-8

I think we're better off avoiding them altogether.  What's wrong with (R)?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.2

From
Andres Freund
Date:
On 2014-01-21 10:20:35 -0500, Robert Haas wrote:
> On Sun, Jan 19, 2014 at 2:43 PM, Marti Raudsepp <marti@juffo.org> wrote:
> > On Tue, Nov 19, 2013 at 6:38 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> The attached patches compile and make check successfully on linux x86,
> >> amd64 and msvc x86 and amd64. This time and updated configure is
> >> included. The majority of operations still rely on CAS emulation.
> >
> > Note that the atomics-generic-acc.h file has ® characters in the
> > Latin-1 encoding ("Intel ® Itanium ®"). If you have to use weird
> > characters, I think you should make sure they're UTF-8
>
> I think we're better off avoiding them altogether.  What's wrong with (R)?

Nothing at all. That was just the copied title from the pdf, it's gone
now.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



better atomics - v0.5

From
Andres Freund
Date:
Hi,

[sorry for the second copy Robert]

Attached is a new version of the atomic operations patch. Lots has
changed since the last post:

* gcc, msvc work. acc, xlc, sunpro have blindly written support which
  should be relatively easy to fix up. All try to implement TAS, 32 bit
  atomic add, 32 bit compare exchange; some do it for 64bit as well.

* x86 gcc inline assembly means that we support every gcc version on x86
  >= i486; even when the __sync APIs aren't available yet.

* 'inline' support isn't required anymore. We fall back to
  ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery.

* When the current platform doesn't have support for atomic operations a
  spinlock backed implementation is used. This can be forced using
  --disable-atomics.

  That even works when semaphores are used to implement spinlocks (a
  separate array is used there to avoid nesting problems). It contrast
  to an earlier implementation this even works when atomics are mapped
  to different addresses in individual processes (think dsm).

* s_lock.h falls back to the atomics.h provided APIs iff it doesn't have
  native support for the current platform. This can be forced by
  defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler
  intrinsics based implementations that should make it easier to bring
  up postgres on different platfomrs.

* I tried to improve the documentation of the facilities in
  src/include/storage/atomics.h. Including documentation of the various
  barrier semantics.

* There's tests in regress.c that get call via a SQL function from the
  regression tests.

* Lots of other details changed, but that's the major pieces.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - v0.5

From
Josh Berkus
Date:
On 06/25/2014 10:14 AM, Andres Freund wrote:
> Hi,
> 
> [sorry for the second copy Robert]
> 
> Attached is a new version of the atomic operations patch. Lots has
> changed since the last post:

Is this at a state where we can performance-test it yet?

-- 
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com



Re: better atomics - v0.5

From
andres@anarazel.de (Andres Freund)
Date:
On 2014-06-25 10:39:53 -0700, Josh Berkus wrote:
> On 06/25/2014 10:14 AM, Andres Freund wrote:
> > Hi,
> > 
> > [sorry for the second copy Robert]
> > 
> > Attached is a new version of the atomic operations patch. Lots has
> > changed since the last post:
> 
> Is this at a state where we can performance-test it yet?

Well, this patch itself won't have a performance impact at all. It's
about proving the infrastructure to use atomic operations in further
patches in a compiler and platform independent way.
I'll repost a new version of the lwlocks patch (still fighting an issue
I introduced while rebasing ntop of Heikki's xlog scalability/lwlock
merger) soon. Addressing the review comments Amit has made since.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Robert Haas
Date:
On Wed, Jun 25, 2014 at 1:14 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> [sorry for the second copy Robert]
>
> Attached is a new version of the atomic operations patch. Lots has
> changed since the last post:
>
> * gcc, msvc work. acc, xlc, sunpro have blindly written support which
>   should be relatively easy to fix up. All try to implement TAS, 32 bit
>   atomic add, 32 bit compare exchange; some do it for 64bit as well.
>
> * x86 gcc inline assembly means that we support every gcc version on x86
>   >= i486; even when the __sync APIs aren't available yet.
>
> * 'inline' support isn't required anymore. We fall back to
>   ATOMICS_INCLUDE_DEFINITIONS/STATIC_IF_INLINE etc. trickery.
>
> * When the current platform doesn't have support for atomic operations a
>   spinlock backed implementation is used. This can be forced using
>   --disable-atomics.
>
>   That even works when semaphores are used to implement spinlocks (a
>   separate array is used there to avoid nesting problems). It contrast
>   to an earlier implementation this even works when atomics are mapped
>   to different addresses in individual processes (think dsm).
>
> * s_lock.h falls back to the atomics.h provided APIs iff it doesn't have
>   native support for the current platform. This can be forced by
>   defining USE_ATOMICS_BASED_SPINLOCKS. Due to generic compiler
>   intrinsics based implementations that should make it easier to bring
>   up postgres on different platfomrs.
>
> * I tried to improve the documentation of the facilities in
>   src/include/storage/atomics.h. Including documentation of the various
>   barrier semantics.
>
> * There's tests in regress.c that get call via a SQL function from the
>   regression tests.
>
> * Lots of other details changed, but that's the major pieces.

Review:

- The changes to spin.c include unnecessary whitespace adjustments.
- The new argument to s_init_lock_sema() isn't used.
- "on top" is two words, not one ("ontop").
- The changes to pg_config_manual.h add a superfluous blank line.
- Are the regression tests in regress.c likely to catch anything?  Really?
- The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
seems to be testing for __builtin_constant_p.  I don't see that being
used in the patch anywhere, though; and also, why doesn't the naming
match?
- s_lock.h adds some fallback code to use atomics to define TAS on
platforms where GCC supports atomics but where we do not have a
working TAS implementation.  However, this supposed fallback code
defines HAS_TEST_AND_SET unconditionally, so I think it will get used
even if we don't have (or have disabled via configure) atomics.
- I completely loathe the file layout you've proposed in
src/include/storage.  If we're going to have separate #include files
for each architecture (and I'd rather we didn't go that route, because
it's messy and unlike what we have now), then shouldn't that stuff go
in src/include/port or a new directory src/include/port/atomics?
Polluting src/include/storage with a whole bunch of files called
atomics-whatever.h strikes me as completely ugly, and there's a lot of
duplicate code in there.
- What's the point of defining pg_read_barrier() to be
pg_read_barrier_impl() and pg_write_barrier() to be
pg_write_barrier_impl() and so forth?  That seems like unnecessary
notation.
- Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
variable instead of re-executing it every time?  Granted, the compiler
might inline this somehow, but...

More generally, I'm really wondering if you're making the
compare-and-swap implementation a lot more complicated than it needs
to be.  How much would we lose if we supported compiler intrinsics on
versions of GCC and MSVC that have it and left everything else to
future patches?  I suspect this patch could be about 20% of its
current size and give us everything that we need.  I've previously
opposed discarding s_lock.h on the grounds that it's extremely well
battle-tested, and if we change it, we might introduce subtle bugs
that are dependent on GCC versions and such.  But now that compiler
intrinsics are relatively common, I don't really see any reason for us
to provide our own assembler versions of *new* primitives we want to
use.  As long as we have some kind of wrapper functions or macros, we
retain the *option* to add workarounds for compiler bugs or lack of
compiler support on platforms, but it seems an awful lot simpler to me
to start by assuming that that the compiler will DTRT, and only roll
our own if that proves to be necessary.  It's not like our hand-rolled
implementations couldn't have bugs - which is different from s_lock.h,
where we have evidence that the exact coding in that file does or did
work on those platforms.

Similarly, I would rip out - or separate into a separate patch - the
code that tries to use atomic operations to implement TAS if we don't
already have an implementation in s_lock.h.  That might be worth
doing, if there are any common architectures that don't already have
TAS, or to simplify porting to new platforms, but it seems like a
separate thing from what this patch is mainly about, which is enabling
the use of CAS and related operations on platforms that support them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Andres Freund
Date:
Hi,

On 2014-06-25 15:54:37 -0400, Robert Haas wrote:
> - The new argument to s_init_lock_sema() isn't used.

It's used when atomics fallback to spinlocks which fall back to
semaphores. c.f. atomics.c.

Since it better be legal to manipulate a atomic variable while holding a
spinlock we cannot simply use an arbitrary spinlock as backing for
atomics. That'd possibly cause us to wait on ourselves or cause
deadlocks.

> - Are the regression tests in regress.c likely to catch anything?
> Really?

They already cought several bugs. If the operations were immediately
widely used they probably wouldn't be necessary.

> - The patch adds a test for PGAC_HAVE_GCC__SYNC_INT32_ATOMICS, which
> seems to be testing for __builtin_constant_p.  I don't see that being
> used in the patch anywhere, though; and also, why doesn't the naming
> match?

Uh. that's a rebase screweup. Should entirely be gone.

> - s_lock.h adds some fallback code to use atomics to define TAS on
> platforms where GCC supports atomics but where we do not have a
> working TAS implementation.  However, this supposed fallback code
> defines HAS_TEST_AND_SET unconditionally, so I think it will get used
> even if we don't have (or have disabled via configure) atomics.

Hm. Good point. It should protect against that.

> - I completely loathe the file layout you've proposed in
> src/include/storage.  If we're going to have separate #include files
> for each architecture (and I'd rather we didn't go that route, because
> it's messy and unlike what we have now), then shouldn't that stuff go
> in src/include/port or a new directory src/include/port/atomics?
> Polluting src/include/storage with a whole bunch of files called
> atomics-whatever.h strikes me as completely ugly, and there's a lot of
> duplicate code in there.

I don't mind moving it somewhere else and it's easy enough to change.

> - What's the point of defining pg_read_barrier() to be
> pg_read_barrier_impl() and pg_write_barrier() to be
> pg_write_barrier_impl() and so forth?  That seems like unnecessary
> notation.

We could forgo it for pg_read_barrier() et al, but for the actual
atomics it's useful because it allows to put common checks in the !impl
routines.

> - Should SpinlockSemaInit() be assigning SpinlockSemas() to a local
> variable instead of re-executing it every time?  Granted, the compiler
> might inline this somehow, but...

I sure hope it will, but still a good idea.

> More generally, I'm really wondering if you're making the
> compare-and-swap implementation a lot more complicated than it needs
> to be.

Possible.

> How much would we lose if we supported compiler intrinsics on
> versions of GCC and MSVC that have it and left everything else to
> future patches?

The discussion so far seemed pretty clear that we can't regress somewhat
frequently used platforms. And dropping support for all but msvc and gcc
would end up doing that. We're going to have to do the legword for the
most common platforms... Note that I didn't use assembler for those, but
relied on intrinsics...

> I suspect this patch could be about 20% of its current size and give
> us everything that we need.

I think the only way to do that would be to create a s_lock.h like
maze. Which imo is a utterly horrible idea. I'm pretty sure there'll be
more assembler implementations for 9.5 and unless we think ahead of how
to structure that we'll create something bad.
I also think that a year or so down the road we're going to support more
operations. E.g. optional support for a 16byte CAS can allow for some
awesome things when you want to do lockless stuff on a 64bit platform.

I think there's chances for reducing the size by merging i386/amd64,
that looked better earlier than it does now (due to the addition of the
inline assembler).

> I've previously
> opposed discarding s_lock.h on the grounds that it's extremely well
> battle-tested, and if we change it, we might introduce subtle bugs
> that are dependent on GCC versions and such.

I think we should entirely get rid of s_lock.h. It's an unmaintainable
mess. That's what I'd done in an earlier version of the patch, but you'd
argued against it. I think you're right in that it shouldn't be done in
the same patch and possibly not even in the same cycle, but I think we
better do it sometime not too far away.
I e.g. don't see how we can guarantee sane barrier semantics with the
current implementation.

> But now that compiler
> intrinsics are relatively common, I don't really see any reason for us
> to provide our own assembler versions of *new* primitives we want to
> use.  As long as we have some kind of wrapper functions or macros, we
> retain the *option* to add workarounds for compiler bugs or lack of
> compiler support on platforms, but it seems an awful lot simpler to me
> to start by assuming that that the compiler will DTRT, and only roll
> our own if that proves to be necessary.  It's not like our hand-rolled
> implementations couldn't have bugs - which is different from s_lock.h,
> where we have evidence that the exact coding in that file does or did
> work on those platforms.

I added the x86 inline assembler because a fair number of buildfarm
animals use ancient gcc's and because I could easily test it. It's also
more efficient for gcc < 4.6. I'm not wedded to keeping it.

> Similarly, I would rip out - or separate into a separate patch - the
> code that tries to use atomic operations to implement TAS if we don't
> already have an implementation in s_lock.h.  That might be worth
> doing, if there are any common architectures that don't already have
> TAS, or to simplify porting to new platforms, but it seems like a
> separate thing from what this patch is mainly about, which is enabling
> the use of CAS and related operations on platforms that support them.

Happy to separate that out. It's also very useful to test code btw...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Heikki Linnakangas
Date:
On 06/25/2014 11:36 PM, Andres Freund wrote:
>
>> >- I completely loathe the file layout you've proposed in
>> >src/include/storage.  If we're going to have separate #include files
>> >for each architecture (and I'd rather we didn't go that route, because
>> >it's messy and unlike what we have now), then shouldn't that stuff go
>> >in src/include/port or a new directory src/include/port/atomics?
>> >Polluting src/include/storage with a whole bunch of files called
>> >atomics-whatever.h strikes me as completely ugly, and there's a lot of
>> >duplicate code in there.
> I don't mind moving it somewhere else and it's easy enough to change.

I think having a separate file for each architecture is nice. I totally 
agree that they don't belong in src/include/storage, though. s_lock.h 
has always been misplaced there, but we've let it be for historical 
reasons, but now that we're adding a dozen new files, it's time to move 
them out.

- Heikki



Re: better atomics - v0.5

From
Alvaro Herrera
Date:
Heikki Linnakangas wrote:
> On 06/25/2014 11:36 PM, Andres Freund wrote:
> >
> >>>- I completely loathe the file layout you've proposed in
> >>>src/include/storage.  If we're going to have separate #include files
> >>>for each architecture (and I'd rather we didn't go that route, because
> >>>it's messy and unlike what we have now), then shouldn't that stuff go
> >>>in src/include/port or a new directory src/include/port/atomics?
> >>>Polluting src/include/storage with a whole bunch of files called
> >>>atomics-whatever.h strikes me as completely ugly, and there's a lot of
> >>>duplicate code in there.
> >I don't mind moving it somewhere else and it's easy enough to change.
> 
> I think having a separate file for each architecture is nice. I
> totally agree that they don't belong in src/include/storage, though.
> s_lock.h has always been misplaced there, but we've let it be for
> historical reasons, but now that we're adding a dozen new files,
> it's time to move them out.

We also have a bunch of files in src/backend/storage/lmgr, both current
and introduced by this patch, which would be good to relocate.  (Really,
how did lmgr/ end up in storage/ in the first place?)

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services



Re: better atomics - v0.5

From
Robert Haas
Date:
On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Since it better be legal to manipulate a atomic variable while holding a
> spinlock we cannot simply use an arbitrary spinlock as backing for
> atomics. That'd possibly cause us to wait on ourselves or cause
> deadlocks.

I think that's going to fall afoul of Tom's previously-articulated "no
loops inside spinlocks" rule.  Most atomics, by nature, are
loop-until-it-works.

>> How much would we lose if we supported compiler intrinsics on
>> versions of GCC and MSVC that have it and left everything else to
>> future patches?
>
> The discussion so far seemed pretty clear that we can't regress somewhat
> frequently used platforms. And dropping support for all but msvc and gcc
> would end up doing that. We're going to have to do the legword for the
> most common platforms... Note that I didn't use assembler for those, but
> relied on intrinsics...

We can't *regress* somewhat-frequently used platforms, but that's
different from saying we've got to support *new* facilities on those
platforms.  Essentially the entire buildfarm is running either gcc,
some Microsoft compiler, or a compiler that's supports the same
atomics intrinsics as gcc i.e. icc or clang.  Some of those compilers
may be too old to support the atomics intrinsics, and there's one Sun
Studio animal, but I don't know that we need to care about those
things in this patch...

...unless of course the atomics fallbacks in this patch are
sufficiently sucky that anyone who ends up using those is going to be
sad.  Then the bar is a lot higher.  But if that's the case then I
wonder if we're really on the right course here.

> I added the x86 inline assembler because a fair number of buildfarm
> animals use ancient gcc's and because I could easily test it. It's also
> more efficient for gcc < 4.6. I'm not wedded to keeping it.

Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
implementations aren't that good, that's a pretty good argument for
keeping our own implementations around.  :-(

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Robert Haas
Date:
On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> I think having a separate file for each architecture is nice. I totally
> agree that they don't belong in src/include/storage, though. s_lock.h has
> always been misplaced there, but we've let it be for historical reasons, but
> now that we're adding a dozen new files, it's time to move them out.

I find the current organization pretty confusing, but maybe that could
be solved by better documentation of what's supposed to go in each
architecture or compiler-dependent file.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Since it better be legal to manipulate a atomic variable while holding a
> > spinlock we cannot simply use an arbitrary spinlock as backing for
> > atomics. That'd possibly cause us to wait on ourselves or cause
> > deadlocks.
> 
> I think that's going to fall afoul of Tom's previously-articulated "no
> loops inside spinlocks" rule.  Most atomics, by nature, are
> loop-until-it-works.

Well, so is TAS itself :).

More seriously, I think we're not going to have much fun if we're making
up the rule that you can't do an atomic add/sub while a spinlock is
held. That just precludes to many use cases and will make the code much
harder to understand. I don't think we're going to end up having many
problems if we allow atomic read/add/sub/write in there.

> >> How much would we lose if we supported compiler intrinsics on
> >> versions of GCC and MSVC that have it and left everything else to
> >> future patches?
> >
> > The discussion so far seemed pretty clear that we can't regress somewhat
> > frequently used platforms. And dropping support for all but msvc and gcc
> > would end up doing that. We're going to have to do the legword for the
> > most common platforms... Note that I didn't use assembler for those, but
> > relied on intrinsics...
> 
> We can't *regress* somewhat-frequently used platforms, but that's
> different from saying we've got to support *new* facilities on those
> platforms.

Well, we want to use the new facilities somewhere. And it's not entirely
unfair to call it a regression if other os/compiler/arch combinations
speed up but yours don't.

> Essentially the entire buildfarm is running either gcc,
> some Microsoft compiler, or a compiler that's supports the same
> atomics intrinsics as gcc i.e. icc or clang.  Some of those compilers
> may be too old to support the atomics intrinsics, and there's one Sun
> Studio animal, but I don't know that we need to care about those
> things in this patch...

I think we should support those where it's easy and where we'll see the
breakage on the buildfarm. Sun studio's intrinsics are simple enough
that I don't think my usage of them will be too hard to fix.

> ...unless of course the atomics fallbacks in this patch are
> sufficiently sucky that anyone who ends up using those is going to be
> sad.  Then the bar is a lot higher.  But if that's the case then I
> wonder if we're really on the right course here.

I don't you'll notice it much on mostly nonconcurrent uses. But some of
those architectures/platforms do support larger NUMA like setups. And
for those it'd probably hurt for some workloads. And e.g. solaris is
still fairly common for such setups.

> > I added the x86 inline assembler because a fair number of buildfarm
> > animals use ancient gcc's and because I could easily test it. It's also
> > more efficient for gcc < 4.6. I'm not wedded to keeping it.
> 
> Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
> implementations aren't that good, that's a pretty good argument for
> keeping our own implementations around.  :-(

The difference isn't that big. It's
return __atomic_compare_exchange_n(&ptr->value, expected, newval,                                  false,
__ATOMIC_SEQ_CST,__ATOMIC_SEQ_CST);
vs

bool    ret;
uint64  current;
current = __sync_val_compare_and_swap(&ptr->value, *expected, newval);
ret = current == *expected;
*expected = current;
return ret;

On x86 that's an additional cmp, mov and such. Useless pos because
cmpxchg already computes all that... (that's returned by the setz in my
patch). MSVC only supports the second form as well.

There's a fair number of < 4.2 gccs on the buildfarm as well though.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-25 20:22:31 -0400, Robert Haas wrote:
> On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
> > I think having a separate file for each architecture is nice. I totally
> > agree that they don't belong in src/include/storage, though. s_lock.h has
> > always been misplaced there, but we've let it be for historical reasons, but
> > now that we're adding a dozen new files, it's time to move them out.
> 
> I find the current organization pretty confusing, but maybe that could
> be solved by better documentation of what's supposed to go in each
> architecture or compiler-dependent file.

The idea is that first a architecture specific file (atomics-arch-*.h)
is included. That file can provide a (partial) implementation for the
specific architecture. Or it can do pretty much nothing.

After that a compiler specific file is included
(atomics-generic-*.h). If atomics aren't yet implemented that can
provide an intrinsics based implementation if the compiler version has
support for it. At the very least a compiler barrier should be provided.

After that the spinlock based fallback implementation
(atomics-fallback.h) provides atomics and barriers if not yet
available. By here we're sure that nothing else will provide them.

Then we can provide operations (atomics-generic.h) that build ontop of
the provided functions. E.g. implement _sub, _and et al.

I'll include some more of that explanation in the header.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Merlin Moncure
Date:
On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
>> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Since it better be legal to manipulate a atomic variable while holding a
>> > spinlock we cannot simply use an arbitrary spinlock as backing for
>> > atomics. That'd possibly cause us to wait on ourselves or cause
>> > deadlocks.
>>
>> I think that's going to fall afoul of Tom's previously-articulated "no
>> loops inside spinlocks" rule.  Most atomics, by nature, are
>> loop-until-it-works.
>
> Well, so is TAS itself :).
>
> More seriously, I think we're not going to have much fun if we're making
> up the rule that you can't do an atomic add/sub while a spinlock is
> held. That just precludes to many use cases and will make the code much
> harder to understand. I don't think we're going to end up having many
> problems if we allow atomic read/add/sub/write in there.

That rule seems reasonable -- why would you ever want to do this?
While you couldn't properly deadlock it seems like it could lead to
unpredictable and hard to diagnose performance stalls.

merlin



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-26 07:44:14 -0500, Merlin Moncure wrote:
> On Thu, Jun 26, 2014 at 5:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
> >> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > Since it better be legal to manipulate a atomic variable while holding a
> >> > spinlock we cannot simply use an arbitrary spinlock as backing for
> >> > atomics. That'd possibly cause us to wait on ourselves or cause
> >> > deadlocks.
> >>
> >> I think that's going to fall afoul of Tom's previously-articulated "no
> >> loops inside spinlocks" rule.  Most atomics, by nature, are
> >> loop-until-it-works.
> >
> > Well, so is TAS itself :).
> >
> > More seriously, I think we're not going to have much fun if we're making
> > up the rule that you can't do an atomic add/sub while a spinlock is
> > held. That just precludes to many use cases and will make the code much
> > harder to understand. I don't think we're going to end up having many
> > problems if we allow atomic read/add/sub/write in there.
> 
> That rule seems reasonable -- why would you ever want to do this?

Are you wondering why you'd ever manipulate an atomic op inside a
spinlock?
There's enough algorithms where a slowpath is done under a spinlock but
the fastpath is done without. You can't simply use nonatomic operations
for manipulation under the spinlock because the fastpaths might then
observe/cause bogus state.

Obviously I'm not advocating to do random stuff in spinlocks. W're
talking about single lock xadd;s (or the LL/SC) equivalent here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Noah Misch
Date:
On Thu, Jun 26, 2014 at 12:20:06PM +0200, Andres Freund wrote:
> On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
> > On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > > Since it better be legal to manipulate a atomic variable while holding a
> > > spinlock we cannot simply use an arbitrary spinlock as backing for
> > > atomics. That'd possibly cause us to wait on ourselves or cause
> > > deadlocks.
> > 
> > I think that's going to fall afoul of Tom's previously-articulated "no
> > loops inside spinlocks" rule.  Most atomics, by nature, are
> > loop-until-it-works.
> 
> Well, so is TAS itself :).
> 
> More seriously, I think we're not going to have much fun if we're making
> up the rule that you can't do an atomic add/sub while a spinlock is
> held. That just precludes to many use cases and will make the code much
> harder to understand. I don't think we're going to end up having many
> problems if we allow atomic read/add/sub/write in there.

I agree it's valuable to have a design that permits invoking an atomic
operation while holding a spinlock.

> > > I added the x86 inline assembler because a fair number of buildfarm
> > > animals use ancient gcc's and because I could easily test it. It's also
> > > more efficient for gcc < 4.6. I'm not wedded to keeping it.
> > 
> > Hmm. gcc 4.6 is only just over a year old, so if pre-4.6
> > implementations aren't that good, that's a pretty good argument for
> > keeping our own implementations around.  :-(

GCC 4.6 was released 2011-03-25, though that doesn't change the bottom line.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: better atomics - v0.5

From
Robert Haas
Date:
On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
>> On Wed, Jun 25, 2014 at 4:36 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > Since it better be legal to manipulate a atomic variable while holding a
>> > spinlock we cannot simply use an arbitrary spinlock as backing for
>> > atomics. That'd possibly cause us to wait on ourselves or cause
>> > deadlocks.
>>
>> I think that's going to fall afoul of Tom's previously-articulated "no
>> loops inside spinlocks" rule.  Most atomics, by nature, are
>> loop-until-it-works.
>
> Well, so is TAS itself :).

Yeah, which is why we have a rule that you're not suppose to acquire
another spinlock while already holding one.  You're trying to make the
spinlocks used to simulate atomic ops an exception to that rule, but
I'm not sure that's OK.  An atomic op is a lot more expensive than a
regular unlocked load or store even if it doesn't loop, and if it does
loop, it's worse, potentially by a large multiple.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Robert Haas
Date:
On Thu, Jun 26, 2014 at 7:21 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-25 20:22:31 -0400, Robert Haas wrote:
>> On Wed, Jun 25, 2014 at 5:42 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>> > I think having a separate file for each architecture is nice. I totally
>> > agree that they don't belong in src/include/storage, though. s_lock.h has
>> > always been misplaced there, but we've let it be for historical reasons, but
>> > now that we're adding a dozen new files, it's time to move them out.
>>
>> I find the current organization pretty confusing, but maybe that could
>> be solved by better documentation of what's supposed to go in each
>> architecture or compiler-dependent file.
>
> The idea is that first a architecture specific file (atomics-arch-*.h)
> is included. That file can provide a (partial) implementation for the
> specific architecture. Or it can do pretty much nothing.
>
> After that a compiler specific file is included
> (atomics-generic-*.h). If atomics aren't yet implemented that can
> provide an intrinsics based implementation if the compiler version has
> support for it. At the very least a compiler barrier should be provided.
>
> After that the spinlock based fallback implementation
> (atomics-fallback.h) provides atomics and barriers if not yet
> available. By here we're sure that nothing else will provide them.
>
> Then we can provide operations (atomics-generic.h) that build ontop of
> the provided functions. E.g. implement _sub, _and et al.
>
> I'll include some more of that explanation in the header.

I get the general principle, but I think it would be good to have one
place that says something like:

Each architecture must provide A and B, may provide both or neither of
C and D, and may also any or all of E, F, and G.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
>>> I think that's going to fall afoul of Tom's previously-articulated "no
>>> loops inside spinlocks" rule.  Most atomics, by nature, are
>>> loop-until-it-works.

>> Well, so is TAS itself :).

> Yeah, which is why we have a rule that you're not suppose to acquire
> another spinlock while already holding one.  You're trying to make the
> spinlocks used to simulate atomic ops an exception to that rule, but
> I'm not sure that's OK.  An atomic op is a lot more expensive than a
> regular unlocked load or store even if it doesn't loop, and if it does
> loop, it's worse, potentially by a large multiple.

Well, the point here is to be sure that there's a reasonably small bound
on how long we hold the spinlock.  It doesn't have to be zero, just small.

Would it be practical to say that the coding rule is that you can only
use atomic ops on fields that are protected by the spinlock, ie, nobody
else is supposed to be touching those fields while you have the spinlock?
If that's the case, then the atomic op should see no contention and thus
not take very long.

On the other hand, if the atomic op is touching something not protected
by the spinlock, that seems to me to be morally equivalent to taking a
spinlock while holding another one, which as Robert says is forbidden
by our current coding rules, and for very good reasons IMO.

However, that may be safe only as long as we have real atomic ops.
If we're simulating those using spinlocks then you have to ask questions
about how many underlying spinlocks there are and whether artificial
contention might ensue (due to the same spinlock being used for multiple
things).
        regards, tom lane



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-26 11:47:15 -0700, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Thu, Jun 26, 2014 at 6:20 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> On 2014-06-25 20:16:08 -0400, Robert Haas wrote:
> >>> I think that's going to fall afoul of Tom's previously-articulated "no
> >>> loops inside spinlocks" rule.  Most atomics, by nature, are
> >>> loop-until-it-works.
> 
> >> Well, so is TAS itself :).
> 
> > Yeah, which is why we have a rule that you're not suppose to acquire
> > another spinlock while already holding one.  You're trying to make the
> > spinlocks used to simulate atomic ops an exception to that rule, but
> > I'm not sure that's OK.  An atomic op is a lot more expensive than a
> > regular unlocked load or store even if it doesn't loop, and if it does
> > loop, it's worse, potentially by a large multiple.
> 
> Well, the point here is to be sure that there's a reasonably small bound
> on how long we hold the spinlock.  It doesn't have to be zero, just small.
> 
> Would it be practical to say that the coding rule is that you can only
> use atomic ops on fields that are protected by the spinlock, ie, nobody
> else is supposed to be touching those fields while you have the spinlock?
> If that's the case, then the atomic op should see no contention and thus
> not take very long.

I don't really see usecases where it's not related data that's being
touched, but: The point is that the fastpath (not using a spinlock) might
touch the atomic variable, even while the slowpath has acquired the
spinlock. So the slowpath can't just use non-atomic operations on the
atomic variable.
Imagine something like releasing a buffer pin while somebody else is
doing something that requires holding the buffer header spinlock.

> On the other hand, if the atomic op is touching something not protected
> by the spinlock, that seems to me to be morally equivalent to taking a
> spinlock while holding another one, which as Robert says is forbidden
> by our current coding rules, and for very good reasons IMO.
>
> However, that may be safe only as long as we have real atomic ops.
> If we're simulating those using spinlocks then you have to ask questions
> about how many underlying spinlocks there are and whether artificial
> contention might ensue (due to the same spinlock being used for multiple
> things).

With the code as submitted it's one spinlock per atomic variable. Which
is fine until spinlocks are emulated using semaphores - in that case a
separate array of semaphores (quite small in the patch as submitted) is
used so there's no chance of deadlocks against a 'real' spinlock or
anything like that.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Merlin Moncure
Date:
On Thu, Jun 26, 2014 at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> Would it be practical to say that the coding rule is that you can only
> use atomic ops on fields that are protected by the spinlock, ie, nobody
> else is supposed to be touching those fields while you have the spinlock?
> If that's the case, then the atomic op should see no contention and thus
> not take very long.

I wonder if this is true in all cases.  The address you are locking
might be logically protected but at the same time nearby to other
memory that is under contention.  In other words, I don't think you
can assume an atomic op locks exactly 4 bytes of memory for the
operation.

> On the other hand, if the atomic op is touching something not protected
> by the spinlock, that seems to me to be morally equivalent to taking a
> spinlock while holding another one, which as Robert says is forbidden
> by our current coding rules, and for very good reasons IMO.

Well not quite: you don't have the possibility of deadlock.

merlin



Re: better atomics - v0.5

From
Robert Haas
Date:
On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> I don't really see usecases where it's not related data that's being
> touched, but: The point is that the fastpath (not using a spinlock) might
> touch the atomic variable, even while the slowpath has acquired the
> spinlock. So the slowpath can't just use non-atomic operations on the
> atomic variable.
> Imagine something like releasing a buffer pin while somebody else is
> doing something that requires holding the buffer header spinlock.

If the atomic variable can be manipulated without the spinlock under
*any* circumstances, then how is it a good idea to ever manipulate it
with the spinlock?  That seems hard to reason about, and unnecessary.
Critical sections for spinlocks should be short and contain only the
instructions that need protection, and clearly the atomic op is not
one of those.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > I don't really see usecases where it's not related data that's being
> > touched, but: The point is that the fastpath (not using a spinlock) might
> > touch the atomic variable, even while the slowpath has acquired the
> > spinlock. So the slowpath can't just use non-atomic operations on the
> > atomic variable.
> > Imagine something like releasing a buffer pin while somebody else is
> > doing something that requires holding the buffer header spinlock.
> 
> If the atomic variable can be manipulated without the spinlock under
> *any* circumstances, then how is it a good idea to ever manipulate it
> with the spinlock?  That seems hard to reason about, and unnecessary.
> Critical sections for spinlocks should be short and contain only the
> instructions that need protection, and clearly the atomic op is not
> one of those.

Imagine the situation for the buffer header spinlock which is one of the
bigger performance issues atm. We could aim to replace all usages of
that with clever and complicated logic, but it's hard.

The IMO reasonable (and prototyped) way to do it is to make the common
paths lockless, but fall back to the spinlock for the more complicated
situations. For the buffer header that means that pin/unpin and buffer
lookup are lockless, but IO and changing the identity of a buffer still
require the spinlock. My attempts to avoid the latter basically required
a buffer header specific reimplementation of spinlocks.

To make pin/unpin et al safe without acquiring the spinlock the pincount
and the flags had to be moved into one variable so that when
incrementing the pincount you automatically get the flagbits back. If
it's currently valid all is good, otherwise the spinlock needs to be
acquired. But for that to work you need to set flags while the spinlock
is held.

Possibly you can come up with ways to avoid that, but I am pretty damn
sure that the code will be less robust by forbidding atomics inside a
critical section, not the contrary. It's a good idea to avoid it, but
not at all costs.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Robert Haas
Date:
On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
>> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > I don't really see usecases where it's not related data that's being
>> > touched, but: The point is that the fastpath (not using a spinlock) might
>> > touch the atomic variable, even while the slowpath has acquired the
>> > spinlock. So the slowpath can't just use non-atomic operations on the
>> > atomic variable.
>> > Imagine something like releasing a buffer pin while somebody else is
>> > doing something that requires holding the buffer header spinlock.
>>
>> If the atomic variable can be manipulated without the spinlock under
>> *any* circumstances, then how is it a good idea to ever manipulate it
>> with the spinlock?  That seems hard to reason about, and unnecessary.
>> Critical sections for spinlocks should be short and contain only the
>> instructions that need protection, and clearly the atomic op is not
>> one of those.
>
> Imagine the situation for the buffer header spinlock which is one of the
> bigger performance issues atm. We could aim to replace all usages of
> that with clever and complicated logic, but it's hard.
>
> The IMO reasonable (and prototyped) way to do it is to make the common
> paths lockless, but fall back to the spinlock for the more complicated
> situations. For the buffer header that means that pin/unpin and buffer
> lookup are lockless, but IO and changing the identity of a buffer still
> require the spinlock. My attempts to avoid the latter basically required
> a buffer header specific reimplementation of spinlocks.
>
> To make pin/unpin et al safe without acquiring the spinlock the pincount
> and the flags had to be moved into one variable so that when
> incrementing the pincount you automatically get the flagbits back. If
> it's currently valid all is good, otherwise the spinlock needs to be
> acquired. But for that to work you need to set flags while the spinlock
> is held.
>
> Possibly you can come up with ways to avoid that, but I am pretty damn
> sure that the code will be less robust by forbidding atomics inside a
> critical section, not the contrary. It's a good idea to avoid it, but
> not at all costs.

You might be right, but I'm not convinced.  Does the lwlock
scalability patch work this way, too?  I was hoping to look at that
RSN, so if there are roughly the same issues there it might shed some
light on this point also.

What I'm basically afraid of is that this will work fine in many cases
but have really ugly failure modes.  That's pretty much what happens
with spinlocks already - the overhead is insignificant at low levels
of contention but as the spinlock starts to become contended the CPUs
all fight over the cache line and performance goes to pot.  ISTM that
making the spinlock critical section significantly longer by putting
atomic ops in there could appear to win in some cases while being very
bad in others.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>
> Attached is a new version of the atomic operations patch. Lots has
> changed since the last post:
>
> * gcc, msvc work. acc, xlc, sunpro have blindly written support which
>   should be relatively easy to fix up. All try to implement TAS, 32 bit
>   atomic add, 32 bit compare exchange; some do it for 64bit as well.

I have reviewed msvc part of patch and below are my findings:

1.
+pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
+ uint32 *expected, uint32 newval)
+{
+ bool ret;
+ uint32 current;
+ current = InterlockedCompareExchange(&ptr->value, newval, *expected);

Is there a reason why using InterlockedCompareExchange() is better
than using InterlockedCompareExchangeAcquire() variant?
*Acquire or *Release variants can provide acquire memory ordering
semantics for processors which support the same else will be defined
as InterlockedCompareExchange.


2.
+pg_atomic_compare_exchange_u32_impl()
{
..
+ *expected = current;
}

Can we implement this API without changing expected value?
I mean if the InterlockedCompareExchange() is success, then it will
store the newval in ptr->value, else *ret* will be false.
I am not sure if there is anything implementation specific in changing
*expected*.

I think this value is required for lwlock patch, but I am wondering why
can't the same be achieved if we just return the *current* value and
then let lwlock patch do the handling based on it.  This will have another
advantage that our pg_* API will also have similar signature as native API's.


3. Is there a overhead of using Add, instead of increment/decrement
version of Interlocked?

I could not find any evidence which can clearly indicate, if one is better
than other except some recommendation in below link which suggests to
*maintain reference count*, use Increment/Decrement

Refer *The Locking Cookbook* in below link:

However, when I tried to check the instructions each function generates,
then I don't find anything which suggests that *Add variant can be slow.

Refer Instructions for both functions:

cPublicProc _InterlockedIncrement,1 
cPublicFpo 1,0 
        mov     ecx,Addend              ; get pointer to addend variable 
        mov     eax,1                   ; set increment value 
Lock1: 
   lock xadd    [ecx],eax               ; interlocked increment 
        inc     eax                     ; adjust return value 
        stdRET _InterlockedIncrement    ; 
 
stdENDP _InterlockedIncrement 




cPublicProc _InterlockedExchangeAdd, 2 
cPublicFpo 2,0 
 
        mov     ecx, [esp + 4]          ; get addend address 
        mov     eax, [esp + 8]          ; get increment value 
Lock4: 
   lock xadd    [ecx], eax              ; exchange add 
 
        stdRET  _InterlockedExchangeAdd 
 
stdENDP _InterlockedExchangeAdd 

For details, refer link:

I don't think there is any need of change in current implementation,
however I just wanted to share my analysis, so that if any one else
can see any point in preferring one or the other way of implementation.

4. 
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier_impl() _ReadWriteBarrier()

#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

There is a Caution notice in microsoft site indicating
_ReadWriteBarrier/MemoryBarrier are deprected.

Please refer below link:

When I checked previous versions of MSVC, I noticed that these are
supported on x86, IPF, x64 till VS2008 and supported only on x86, x64
for VS2012 onwards.

Link for VS2010 support:

5.
+ * atomics-generic-msvc.h
..
+ * Portions Copyright (c) 2013, PostgreSQL Global Development Group

Shouldn't copyright notice be 2014?

6.
pg_atomic_compare_exchange_u32()

It is better to have comments above this and all other related functions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
> On Wed, Jun 25, 2014 at 10:44 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> >
> > Attached is a new version of the atomic operations patch. Lots has
> > changed since the last post:
> >
> > * gcc, msvc work. acc, xlc, sunpro have blindly written support which
> >   should be relatively easy to fix up. All try to implement TAS, 32 bit
> >   atomic add, 32 bit compare exchange; some do it for 64bit as well.
> 
> I have reviewed msvc part of patch and below are my findings:

Thanks for looking at it - I pretty much wrote that part blindly.

> 1.
> +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
> + uint32 *expected, uint32 newval)
> +{
> + bool ret;
> + uint32 current;
> + current = InterlockedCompareExchange(&ptr->value, newval, *expected);
> 
> Is there a reason why using InterlockedCompareExchange() is better
> than using InterlockedCompareExchangeAcquire() variant?

Two things:
a) compare_exchange is a read/write operator and so far I've defined it  to have full barrier semantics (documented in
atomics.h).I'd be  happy to discuss which semantics we want for which operation.
 
b) It's only supported from vista onwards. Afaik we still support XP.
c) It doesn't make any difference on x86 ;)

> *Acquire or *Release variants can provide acquire memory ordering
> semantics for processors which support the same else will be defined
> as InterlockedCompareExchange.

> 2.
> +pg_atomic_compare_exchange_u32_impl()
> {
> ..
> + *expected = current;
> }
> 
> Can we implement this API without changing expected value?
> I mean if the InterlockedCompareExchange() is success, then it will
> store the newval in ptr->value, else *ret* will be false.
> I am not sure if there is anything implementation specific in changing
> *expected*.
> I think this value is required for lwlock patch, but I am wondering why
> can't the same be achieved if we just return the *current* value and
> then let lwlock patch do the handling based on it.  This will have another
> advantage that our pg_* API will also have similar signature as native
> API's.

Many usages of compare/exchange require that you'll get the old value
back in an atomic fashion. Unfortunately the Interlocked* API doesn't
provide a nice way to do that. Note that this definition of
compare/exchange both maps nicely to C11's atomics and the actual x86
cmpxchg instruction.

I've generally tried to mimick C11's API as far as it's
possible. There's some limitations because we can't do generic types and
such, but other than that.

> 3. Is there a overhead of using Add, instead of increment/decrement
> version of Interlocked?

There's a theoretical difference for IA64 which has a special
increment/decrement operation which can only change the value by a
rather limited number of values. I don't think it's worth to care.

> 4.
> #pragma intrinsic(_ReadWriteBarrier)
> #define pg_compiler_barrier_impl() _ReadWriteBarrier()
> 
> #ifndef pg_memory_barrier_impl
> #define pg_memory_barrier_impl() MemoryBarrier()
> #endif
> 
> There is a Caution notice in microsoft site indicating
> _ReadWriteBarrier/MemoryBarrier are deprected.

It seemed to be the most widely available API, and it's what barrier.h
already used.
Do you have a different suggestion?

> 6.
> pg_atomic_compare_exchange_u32()
> 
> It is better to have comments above this and all other related functions.

Check atomics.h, there's comments above it:
/** pg_atomic_compare_exchange_u32 - CAS operation** Atomically compare the current value of ptr with *expected and
storenewval* iff ptr and *expected have the same value. The current value of *ptr will* always be stored in
*expected.**Return whether the values have been exchanged.** Full barrier semantics.*/
 

Thanks,

Andres Freund



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-27 22:47:02 -0400, Robert Haas wrote:
> On Fri, Jun 27, 2014 at 2:00 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-06-27 13:15:25 -0400, Robert Haas wrote:
> >> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > I don't really see usecases where it's not related data that's being
> >> > touched, but: The point is that the fastpath (not using a spinlock) might
> >> > touch the atomic variable, even while the slowpath has acquired the
> >> > spinlock. So the slowpath can't just use non-atomic operations on the
> >> > atomic variable.
> >> > Imagine something like releasing a buffer pin while somebody else is
> >> > doing something that requires holding the buffer header spinlock.
> >>
> >> If the atomic variable can be manipulated without the spinlock under
> >> *any* circumstances, then how is it a good idea to ever manipulate it
> >> with the spinlock?  That seems hard to reason about, and unnecessary.
> >> Critical sections for spinlocks should be short and contain only the
> >> instructions that need protection, and clearly the atomic op is not
> >> one of those.
> >
> > Imagine the situation for the buffer header spinlock which is one of the
> > bigger performance issues atm. We could aim to replace all usages of
> > that with clever and complicated logic, but it's hard.
> >
> > The IMO reasonable (and prototyped) way to do it is to make the common
> > paths lockless, but fall back to the spinlock for the more complicated
> > situations. For the buffer header that means that pin/unpin and buffer
> > lookup are lockless, but IO and changing the identity of a buffer still
> > require the spinlock. My attempts to avoid the latter basically required
> > a buffer header specific reimplementation of spinlocks.
> >
> > To make pin/unpin et al safe without acquiring the spinlock the pincount
> > and the flags had to be moved into one variable so that when
> > incrementing the pincount you automatically get the flagbits back. If
> > it's currently valid all is good, otherwise the spinlock needs to be
> > acquired. But for that to work you need to set flags while the spinlock
> > is held.
> >
> > Possibly you can come up with ways to avoid that, but I am pretty damn
> > sure that the code will be less robust by forbidding atomics inside a
> > critical section, not the contrary. It's a good idea to avoid it, but
> > not at all costs.
> 
> You might be right, but I'm not convinced.

Why? Anything I can do to convince you of this? Note that other users of
atomics (e.g. the linux kernel) widely use atomics inside spinlock
protected regions.

> Does the lwlock scalability patch work this way, too?

No. I've moved one add/sub inside a critical section in the current
version while debugging, but it's not required at all. I generally think
it makes sense to document the suggestion of moving atomics outside, but
not make it a requirement.

> What I'm basically afraid of is that this will work fine in many cases
> but have really ugly failure modes.  That's pretty much what happens
> with spinlocks already - the overhead is insignificant at low levels
> of contention but as the spinlock starts to become contended the CPUs
> all fight over the cache line and performance goes to pot.  ISTM that
> making the spinlock critical section significantly longer by putting
> atomic ops in there could appear to win in some cases while being very
> bad in others.

Well, I'm not saying it's something I suggest doing all the time. But if
using an atomic op in the slow path allows you to remove the spinlock
from 99% of the cases I don't see it having a significant impact.
In most scenarios (where atomics aren't emulated, i.e. platforms we
expect to used in heavily concurrent cases) the spinlock and the atomic
will be on the same cacheline making stalls much less likely.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Heikki Linnakangas
Date:
On 06/27/2014 08:15 PM, Robert Haas wrote:
> On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> I don't really see usecases where it's not related data that's being
>> touched, but: The point is that the fastpath (not using a spinlock) might
>> touch the atomic variable, even while the slowpath has acquired the
>> spinlock. So the slowpath can't just use non-atomic operations on the
>> atomic variable.
>> Imagine something like releasing a buffer pin while somebody else is
>> doing something that requires holding the buffer header spinlock.
>
> If the atomic variable can be manipulated without the spinlock under
> *any* circumstances, then how is it a good idea to ever manipulate it
> with the spinlock?

With the WALInsertLock scaling stuff in 9.4, there are now two variables 
protected by a spinlock: the current WAL insert location, and the prev 
pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you 
need to grab the spinlock to update both of them atomically. But to just 
read the WAL insert pointer, you could do an atomic read of CurrBytePos 
if the architecture supports it - now you have to grab the spinlock.

Admittedly that's just an atomic read, not an atomic compare and 
exchange or fetch-and-add. Or if the architecture has an atomic 128-bit 
compare & exchange op you could replace the spinlock with that. But it's 
not hard to imagine similar situations where you sometimes need to lock 
a larger data structure to modify it atomically, but sometimes you just 
need to modify part of it and an atomic op would suffice.

I thought Andres' LWLock patch also did something like that. If the lock 
is not contended, you can acquire it with an atomic compare & exchange 
to increment the exclusive/shared counter. But to manipulate the wait 
queue, you need the spinlock.

- Heikki




Re: better atomics - v0.5

From
Amit Kapila
Date:
On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-28 11:58:41 +0530, Amit Kapila wrote:
> > 1.
> > +pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
> > + uint32 *expected, uint32 newval)
> > +{
> > + bool ret;
> > + uint32 current;
> > + current = InterlockedCompareExchange(&ptr->value, newval, *expected);
> >
> > Is there a reason why using InterlockedCompareExchange() is better
> > than using InterlockedCompareExchangeAcquire() variant?
>
> Two things:
> a) compare_exchange is a read/write operator and so far I've defined it
>    to have full barrier semantics (documented in atomics.h). I'd be
>    happy to discuss which semantics we want for which operation.

I think it is better to have some discussion about it. Apart from this,
today I noticed atomic read/write doesn't use any barrier semantics
and just rely on volatile.  I think it might not be sane in all cases,
example with Visual Studio 2003, volatile to volatile references are
ordered; the compiler will not re-order volatile variable access. However,
these operations could be re-ordered by the processor.
For detailed example, refer below link:

Also if we see C11 specs both load and store uses memory order as
memory_order_seq_cst by default which is same as for compare and
exchange
I have referred below link:

> b) It's only supported from vista onwards. Afaik we still support XP.
#ifndef pg_memory_barrier_impl
#define pg_memory_barrier_impl() MemoryBarrier()
#endif

The MemoryBarrier() macro used also supports only on vista onwards,
so I think for reasons similar to using MemoryBarrier() can apply for
this as well.

> c) It doesn't make any difference on x86 ;)

What about processors like Itanium which care about acquire/release
memory barrier semantics?

> > 2.
> > +pg_atomic_compare_exchange_u32_impl()
> > {
> > ..
> > + *expected = current;
> > }
> >
> > Can we implement this API without changing expected value?
..
> > I think this value is required for lwlock patch, but I am wondering why
> > can't the same be achieved if we just return the *current* value and
> > then let lwlock patch do the handling based on it.  This will have another
> > advantage that our pg_* API will also have similar signature as native
> > API's.
>
> Many usages of compare/exchange require that you'll get the old value
> back in an atomic fashion. Unfortunately the Interlocked* API doesn't
> provide a nice way to do that.

Yes, but I think the same cannot be accomplished even by using
expected.

>Note that this definition of
> compare/exchange both maps nicely to C11's atomics and the actual x86
> cmpxchg instruction.
>
> I've generally tried to mimick C11's API as far as it's
> possible. There's some limitations because we can't do generic types and
> such, but other than that.

If I am reading C11's spec for this API correctly, then it says as below:
"Atomically compares the value pointed to by obj with the value pointed
to by expected, and if those are equal, replaces the former with desired
(performs read-modify-write operation). Otherwise, loads the actual value
pointed to by obj into *expected (performs load operation)."

So it essentialy means that it loads actual value in expected only if
operation is not successful.


> > 3. Is there a overhead of using Add, instead of increment/decrement
> > version of Interlocked?
>
> There's a theoretical difference for IA64 which has a special
> increment/decrement operation which can only change the value by a
> rather limited number of values. I don't think it's worth to care.

Okay.

> > 4.
..
> > There is a Caution notice in microsoft site indicating
> > _ReadWriteBarrier/MemoryBarrier are deprected.
>
> It seemed to be the most widely available API, and it's what barrier.h
> already used.
> Do you have a different suggestion?

I am trying to think based on suggestion given by Microsoft, but
not completely clear how to accomplish the same at this moment.

> > 6.
> > pg_atomic_compare_exchange_u32()
> >
> > It is better to have comments above this and all other related functions.
>
> Check atomics.h, there's comments above it:
> /*
>  * pg_atomic_compare_exchange_u32 - CAS operation
>  *
>  * Atomically compare the current value of ptr with *expected and store newval
>  * iff ptr and *expected have the same value. The current value of *ptr will
>  * always be stored in *expected.
>  *
>  * Return whether the values have been exchanged.
>  *
>  * Full barrier semantics.
>  */

Okay, generally code has such comments on top of function
definition rather than with declaration.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
> On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
> > Two things:
> > a) compare_exchange is a read/write operator and so far I've defined it
> >    to have full barrier semantics (documented in atomics.h). I'd be
> >    happy to discuss which semantics we want for which operation.
> 
> I think it is better to have some discussion about it. Apart from this,
> today I noticed atomic read/write doesn't use any barrier semantics
> and just rely on volatile.

Yes, intentionally so. It's often important to get/set the current value
without the cost of emitting a memory barrier. It just has to be a
recent value  and it actually has to come from memory. And that's actually
enforced by the current definition - I think?

> > b) It's only supported from vista onwards. Afaik we still support XP.
> #ifndef pg_memory_barrier_impl
> #define pg_memory_barrier_impl() MemoryBarrier()
> #endif
> 
> The MemoryBarrier() macro used also supports only on vista onwards,
> so I think for reasons similar to using MemoryBarrier() can apply for
> this as well.

I think that's odd because barrier.h has been using MemoryBarrier()
without a version test for a long time now. I guess everyone uses a new
enough visual studio. Those intrinsics aren't actually OS but just
compiler dependent.

Otherwise we could just define it as:

FORCEINLINE
VOID
MemoryBarrier (   VOID   )
{   LONG Barrier;   __asm {       xchg Barrier, eax   }
}


> > c) It doesn't make any difference on x86 ;)
> 
> What about processors like Itanium which care about acquire/release
> memory barrier semantics?

Well, it still will be correct? I don't think it makes much sense to
focus overly much on itanium here with the price of making anything more
complicated for others.

> > > I think this value is required for lwlock patch, but I am wondering why
> > > can't the same be achieved if we just return the *current* value and
> > > then let lwlock patch do the handling based on it.  This will have
> another
> > > advantage that our pg_* API will also have similar signature as native
> > > API's.
> >
> > Many usages of compare/exchange require that you'll get the old value
> > back in an atomic fashion. Unfortunately the Interlocked* API doesn't
> > provide a nice way to do that.
> 
> Yes, but I think the same cannot be accomplished even by using
> expected.

More complicatedly so, yes? I don't think we want those comparisons on
practically every callsite.

> >Note that this definition of
> > compare/exchange both maps nicely to C11's atomics and the actual x86
> > cmpxchg instruction.
> >
> > I've generally tried to mimick C11's API as far as it's
> > possible. There's some limitations because we can't do generic types and
> > such, but other than that.
> 
> If I am reading C11's spec for this API correctly, then it says as below:
> "Atomically compares the value pointed to by obj with the value pointed
> to by expected, and if those are equal, replaces the former with desired
> (performs read-modify-write operation). Otherwise, loads the actual value
> pointed to by obj into *expected (performs load operation)."
> 
> So it essentialy means that it loads actual value in expected only if
> operation is not successful.

Yes. But in the case it's successfull it will already contain the right
value.

> > > 4.
> ..
> > > There is a Caution notice in microsoft site indicating
> > > _ReadWriteBarrier/MemoryBarrier are deprected.
> >
> > It seemed to be the most widely available API, and it's what barrier.h
> > already used.
> > Do you have a different suggestion?
> 
> I am trying to think based on suggestion given by Microsoft, but
> not completely clear how to accomplish the same at this moment.

Well, they refer to C11 stuff. But I think it'll be a while before it
makes sense to use a fallback based on that.

> > > 6.
> > > pg_atomic_compare_exchange_u32()
> > >
> > > It is better to have comments above this and all other related
> functions.
> >
> > Check atomics.h, there's comments above it:
> > /*
> >  * pg_atomic_compare_exchange_u32 - CAS operation
> >  *
> >  * Atomically compare the current value of ptr with *expected and store
> newval
> >  * iff ptr and *expected have the same value. The current value of *ptr
> will
> >  * always be stored in *expected.
> >  *
> >  * Return whether the values have been exchanged.
> >  *
> >  * Full barrier semantics.
> >  */
> 
> Okay, generally code has such comments on top of function
> definition rather than with declaration.

I don't want to have it on the _impl functions because they're
duplicated for the individual platforms and will just become out of
date...

Greetings,

Andres Freund



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-28 19:36:39 +0300, Heikki Linnakangas wrote:
> On 06/27/2014 08:15 PM, Robert Haas wrote:
> >On Thu, Jun 26, 2014 at 3:04 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> >>I don't really see usecases where it's not related data that's being
> >>touched, but: The point is that the fastpath (not using a spinlock) might
> >>touch the atomic variable, even while the slowpath has acquired the
> >>spinlock. So the slowpath can't just use non-atomic operations on the
> >>atomic variable.
> >>Imagine something like releasing a buffer pin while somebody else is
> >>doing something that requires holding the buffer header spinlock.
> >
> >If the atomic variable can be manipulated without the spinlock under
> >*any* circumstances, then how is it a good idea to ever manipulate it
> >with the spinlock?
> 
> With the WALInsertLock scaling stuff in 9.4, there are now two variables
> protected by a spinlock: the current WAL insert location, and the prev
> pointer (CurrBytePos and PrevBytePos). To insert a new WAL record, you need
> to grab the spinlock to update both of them atomically. But to just read the
> WAL insert pointer, you could do an atomic read of CurrBytePos if the
> architecture supports it - now you have to grab the spinlock.
> Admittedly that's just an atomic read, not an atomic compare and exchange or
> fetch-and-add.

There's several architectures where you can do atomic 8byte reads, but
only busing cmpxchg or similar, *not* using a plain read... So I think
it's actually a fair example.

> Or if the architecture has an atomic 128-bit compare &
> exchange op you could replace the spinlock with that. But it's not hard to
> imagine similar situations where you sometimes need to lock a larger data
> structure to modify it atomically, but sometimes you just need to modify
> part of it and an atomic op would suffice.

Yes.

> I thought Andres' LWLock patch also did something like that. If the lock is
> not contended, you can acquire it with an atomic compare & exchange to
> increment the exclusive/shared counter. But to manipulate the wait queue,
> you need the spinlock.

Right. It just happens that the algorithm requires none of the atomic
ops have to be under the spinlock... But I generally think that we'll
see more slow/fastpath like situations cropping up where we can't always
avoid the atomic op while holding the lock.

How about adding a paragraph that explains that it's better to avoid
atomics usage of spinlocks because of the prolonged runtime, especially
due to the emulation and cache misses? But also saying it's ok if the
algorithm needs it and is a sufficient benefit?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
> > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
> > I think it is better to have some discussion about it. Apart from this,
> > today I noticed atomic read/write doesn't use any barrier semantics
> > and just rely on volatile.
>
> Yes, intentionally so. It's often important to get/set the current value
> without the cost of emitting a memory barrier. It just has to be a
> recent value  and it actually has to come from memory.

I agree with you that we don't want to incur the cost of memory barrier
for get/set, however it should not be at the cost of correctness.

> And that's actually
> enforced by the current definition - I think?

Yeah, this is the only point which I am trying to ensure, thats why I sent
you few links in previous mail where I had got some suspicion that
just doing get/set with volatile might lead to correctness issue in some
cases.

Some another Note, I found today while reading more on it which suggests
that my previous observation is right:

"Within a thread of execution, accesses (reads and writes) to all volatile
objects are guaranteed to not be reordered relative to each other, but this
order is not guaranteed to be observed by another thread, since volatile
access does not establish inter-thread synchronization."

> > > b) It's only supported from vista onwards. Afaik we still support XP.
> > #ifndef pg_memory_barrier_impl
> > #define pg_memory_barrier_impl() MemoryBarrier()
> > #endif
> >
> > The MemoryBarrier() macro used also supports only on vista onwards,
> > so I think for reasons similar to using MemoryBarrier() can apply for
> > this as well.
>
> I think that's odd because barrier.h has been using MemoryBarrier()
> without a version test for a long time now. I guess everyone uses a new
> enough visual studio.

Yeap or might be the people who even are not using new enough version
doesn't have a use case that can hit a problem due to MemoryBarrier() 

> Those intrinsics aren't actually OS but just
> compiler dependent.
>
> Otherwise we could just define it as:
>
> FORCEINLINE
> VOID
> MemoryBarrier (
>     VOID
>     )
> {
>     LONG Barrier;
>     __asm {
>         xchg Barrier, eax
>     }
> }

Yes, I think it will be better, how about defining this way just for
the cases where MemoryBarrier macro is not supported.

> > > c) It doesn't make any difference on x86 ;)
> >
> > What about processors like Itanium which care about acquire/release
> > memory barrier semantics?
>
> Well, it still will be correct? I don't think it makes much sense to
> focus overly much on itanium here with the price of making anything more
> complicated for others.

Completely agreed that we should not add or change anything which
makes patch complicated or can effect the performance, however
the reason for focusing is just to ensure that we should not break
any existing use case for Itanium w.r.t performance or otherwise.

Another way is that we can give ignore or just give lower priority for
verfiying if the patch has anything wrong for Itanium processors.
  
> > If I am reading C11's spec for this API correctly, then it says as below:
> > "Atomically compares the value pointed to by obj with the value pointed
> > to by expected, and if those are equal, replaces the former with desired
> > (performs read-modify-write operation). Otherwise, loads the actual value
> > pointed to by obj into *expected (performs load operation)."
> >
> > So it essentialy means that it loads actual value in expected only if
> > operation is not successful.
>
> Yes. But in the case it's successfull it will already contain the right
> value.

The point was just that we are not completely following C11 atomic
specification, so it might be okay to tweak a bit and make things
simpler and save LOAD instruction.  However as there is no correctness
issue here and you think that current implementation is good and
can serve more cases, we can keep as it is and move forward.

> > > > 4.
> > ..
> > > > There is a Caution notice in microsoft site indicating
> > > > _ReadWriteBarrier/MemoryBarrier are deprected.
> > >
> > > It seemed to be the most widely available API, and it's what barrier.h
> > > already used.
> > > Do you have a different suggestion?
> >
> > I am trying to think based on suggestion given by Microsoft, but
> > not completely clear how to accomplish the same at this moment.
>
> Well, they refer to C11 stuff. But I think it'll be a while before it
> makes sense to use a fallback based on that.

In this case, I have a question for you.

Un-patched usage  in barrier.h is as follows:
..
#elif defined(__ia64__) || defined(__ia64)
#define pg_compiler_barrier() _Asm_sched_fence()
#define pg_memory_barrier() _Asm_mf()

#elif defined(WIN32_ONLY_COMPILER)
/* Should work on both MSVC and Borland. */
#include <intrin.h>
#pragma intrinsic(_ReadWriteBarrier)
#define pg_compiler_barrier() _ReadWriteBarrier()
#define pg_memory_barrier() MemoryBarrier()
#endif

If I understand correctly the current define mechanism in barrier.h,
it will have different definition for Itanium processors even for windows.

However the patch defines as below:

#if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
# include "storage/atomics-generic-gcc.h"
#elif defined(WIN32_ONLY_COMPILER)
# include "storage/atomics-generic-msvc.h"

What I can understand from above is that defines in 
storage/atomics-generic-msvc.h, will override any previous defines
for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
will be considered for Windows always.

> > > > 6.
> > > > pg_atomic_compare_exchange_u32()
> > > >
> > > > It is better to have comments above this and all other related
> > functions.
> > Okay, generally code has such comments on top of function
> > definition rather than with declaration.
>
> I don't want to have it on the _impl functions because they're
> duplicated for the individual platforms and will just become out of
> date...

Sure, I was also not asking for _impl functions.  What I was asking
in this point was to have comments on top of definition of
pg_atomic_compare_exchange_u32() in atomics.h
In particular, on top of below and similar functions, rather than
at the place where they are declared.

STATIC_IF_INLINE bool

pg_atomic_compare_exchange_u32(volatile pg_atomic_uint32 *ptr,
  uint32 *expected, uint32 newval)
{
CHECK_POINTER_ALIGNMENT(ptr, 4);
CHECK_POINTER_ALIGNMENT(expected, 4);

return pg_atomic_compare_exchange_u32_impl(ptr, expected, newval);
}


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
> On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2014-06-29 11:53:28 +0530, Amit Kapila wrote:
> > > On Sat, Jun 28, 2014 at 1:48 PM, Andres Freund <andres@2ndquadrant.com>
> > > I think it is better to have some discussion about it. Apart from this,
> > > today I noticed atomic read/write doesn't use any barrier semantics
> > > and just rely on volatile.
> >
> > Yes, intentionally so. It's often important to get/set the current value
> > without the cost of emitting a memory barrier. It just has to be a
> > recent value  and it actually has to come from memory.
> 
> I agree with you that we don't want to incur the cost of memory barrier
> for get/set, however it should not be at the cost of correctness.

I really can't follow here. A volatile read/store forces it to go to
memory without barriers. The ABIs of all platforms we work with
guarantee that 4bytes stores/reads are atomic - we've been relying on
that for a long while.

> > And that's actually
> > enforced by the current definition - I think?
> 
> Yeah, this is the only point which I am trying to ensure, thats why I sent
> you few links in previous mail where I had got some suspicion that
> just doing get/set with volatile might lead to correctness issue in some
> cases.

But this isn't something this patch started doing - we've been doing
that for a long while?

> Some another Note, I found today while reading more on it which suggests
> that my previous observation is right:
> 
> "Within a thread of execution, accesses (reads and writes) to all volatile
> objects are guaranteed to not be reordered relative to each other, but this
> order is not guaranteed to be observed by another thread, since volatile
> access does not establish inter-thread synchronization."
> http://en.cppreference.com/w/c/atomic/memory_order

That's just saying there's no ordering guarantees with volatile
stores/reads. I don't see a contradiction?

> > > > b) It's only supported from vista onwards. Afaik we still support XP.
> > > #ifndef pg_memory_barrier_impl
> > > #define pg_memory_barrier_impl() MemoryBarrier()
> > > #endif
> > >
> > > The MemoryBarrier() macro used also supports only on vista onwards,
> > > so I think for reasons similar to using MemoryBarrier() can apply for
> > > this as well.
> >
> > I think that's odd because barrier.h has been using MemoryBarrier()
> > without a version test for a long time now. I guess everyone uses a new
> > enough visual studio.
> 
> Yeap or might be the people who even are not using new enough version
> doesn't have a use case that can hit a problem due to MemoryBarrier() 

Well, with barrier.h as it stands they'd get a compiler error if it
indeed is unsupported? But I think there's something else going on -
msft might just be removing XP from it's documentation. Postgres supports
building with with visual studio 2005 and MemoryBarrier() seems to be
supported by it.
I think we'd otherwise seen problems since 9.1 where barrier.h was introduced.

> In this case, I have a question for you.
> 
> Un-patched usage  in barrier.h is as follows:
> ..
> #elif defined(__ia64__) || defined(__ia64)
> #define pg_compiler_barrier() _Asm_sched_fence()
> #define pg_memory_barrier() _Asm_mf()
> 
> #elif defined(WIN32_ONLY_COMPILER)
> /* Should work on both MSVC and Borland. */
> #include <intrin.h>
> #pragma intrinsic(_ReadWriteBarrier)
> #define pg_compiler_barrier() _ReadWriteBarrier()
> #define pg_memory_barrier() MemoryBarrier()
> #endif
> 
> If I understand correctly the current define mechanism in barrier.h,
> it will have different definition for Itanium processors even for windows.

Either noone has ever tested postgres on itanium windows (quite
possible), because afaik _Asm_sched_fence() doesn't work on
windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
arefor HP's acc on HPUX.

> However the patch defines as below:
> 
> #if defined(HAVE_GCC__ATOMIC_INT32_CAS) || defined(HAVE_GCC__SYNC_INT32_CAS)
> # include "storage/atomics-generic-gcc.h"
> #elif defined(WIN32_ONLY_COMPILER)
> # include "storage/atomics-generic-msvc.h"
> 
> What I can understand from above is that defines in 
> storage/atomics-generic-msvc.h, will override any previous defines
> for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
> will be considered for Windows always.

Well, the memory barrier is surrounded by #ifndef
pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
earlier since it's a compiler not an architecture thing.

> > > > > 6.
> > > > > pg_atomic_compare_exchange_u32()
> > > > >
> > > > > It is better to have comments above this and all other related
> > > functions.
> > > Okay, generally code has such comments on top of function
> > > definition rather than with declaration.
> >
> > I don't want to have it on the _impl functions because they're
> > duplicated for the individual platforms and will just become out of
> > date...
> 
> Sure, I was also not asking for _impl functions.  What I was asking
> in this point was to have comments on top of definition of
> pg_atomic_compare_exchange_u32() in atomics.h
> In particular, on top of below and similar functions, rather than
> at the place where they are declared.

Hm, we can do that. Don't think it'll be clearer (because you need to go
to the header anyway), but I don't feel strongly.

I'd much rather get rid of the separated definition/declaration, but
we'd need to rely on PG_USE_INLINE for it...

Greetings,

Andres Freund



Re: better atomics - v0.5

From
Robert Haas
Date:
On Sat, Jun 28, 2014 at 4:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > To make pin/unpin et al safe without acquiring the spinlock the pincount
>> > and the flags had to be moved into one variable so that when
>> > incrementing the pincount you automatically get the flagbits back. If
>> > it's currently valid all is good, otherwise the spinlock needs to be
>> > acquired. But for that to work you need to set flags while the spinlock
>> > is held.
>> >
>> > Possibly you can come up with ways to avoid that, but I am pretty damn
>> > sure that the code will be less robust by forbidding atomics inside a
>> > critical section, not the contrary. It's a good idea to avoid it, but
>> > not at all costs.
>>
>> You might be right, but I'm not convinced.
>
> Why? Anything I can do to convince you of this? Note that other users of
> atomics (e.g. the linux kernel) widely use atomics inside spinlock
> protected regions.

The Linux kernel isn't a great example, because they can ensure that
they aren't preempted while holding the spinlock.  We can't, and
that's a principle part of my concern here.

More broadly, it doesn't seem consistent.  I think that other projects
also sometimes write code that acquires a spinlock while holding
another spinlock, and we don't do that; in fact, we've elevated that
to a principle, regardless of whether it performs well in practice.
In some cases, the CPU instruction that we issue to acquire that
spinlock might be the exact same instruction we'd issue to manipulate
an atomic variable.  I don't see how it can be right to say that a
spinlock-protected critical section is allowed to manipulate an atomic
variable with cmpxchg, but not allowed to acquire another spinlock
with cmpxchg.  Either acquiring the second spinlock is OK, in which
case our current coding rule is wrong, or acquiring the atomic
variable is not OK, either.  I don't see how we can have that both
ways.

>> What I'm basically afraid of is that this will work fine in many cases
>> but have really ugly failure modes.  That's pretty much what happens
>> with spinlocks already - the overhead is insignificant at low levels
>> of contention but as the spinlock starts to become contended the CPUs
>> all fight over the cache line and performance goes to pot.  ISTM that
>> making the spinlock critical section significantly longer by putting
>> atomic ops in there could appear to win in some cases while being very
>> bad in others.
>
> Well, I'm not saying it's something I suggest doing all the time. But if
> using an atomic op in the slow path allows you to remove the spinlock
> from 99% of the cases I don't see it having a significant impact.
> In most scenarios (where atomics aren't emulated, i.e. platforms we
> expect to used in heavily concurrent cases) the spinlock and the atomic
> will be on the same cacheline making stalls much less likely.

And this again is my point: why can't we make the same argument about
two spinlocks situated on the same cache line?  I don't have a bit of
trouble believing that doing the same thing with a couple of spinlocks
could sometimes work out well, too, but Tom is adamantly opposed to
that.  I don't want to just make an end run around that issue without
understanding whether he has a valid point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 12:15:06 -0400, Robert Haas wrote:
> More broadly, it doesn't seem consistent.  I think that other projects
> also sometimes write code that acquires a spinlock while holding
> another spinlock, and we don't do that; in fact, we've elevated that
> to a principle, regardless of whether it performs well in practice.

It's actually more than a principle: It's now required for correctness,
because otherwise the semaphore based spinlock implementation will
deadlock otherwise.

> In some cases, the CPU instruction that we issue to acquire that
> spinlock might be the exact same instruction we'd issue to manipulate
> an atomic variable.  I don't see how it can be right to say that a
> spinlock-protected critical section is allowed to manipulate an atomic
> variable with cmpxchg, but not allowed to acquire another spinlock
> with cmpxchg.  Either acquiring the second spinlock is OK, in which
> case our current coding rule is wrong, or acquiring the atomic
> variable is not OK, either.  I don't see how we can have that both
> ways.

The no nested spinlock thing used to be a guideline. One nobody had big
problems with because it didn't prohibit solutions to problems. Since
guidelines are more useful when simple it's "no nested
spinlocks!".

Also those two cmpxchg's aren't the same:

The CAS for spinlock acquiration will only succeed if the lock isn't
acquired. If the lock holder sleeps you'll have to wait for it to wake
up.

In contrast to that CASs for lockfree (or lock-reduced) algorithms won't
be blocked if the last manipulation was done by a backend that's now
sleeping. It'll possibly loop once, get the new value, and retry the
CAS. Yes, it can still take couple of iterations under heavy
concurrency, but you don't have the problem that the lockholder goes to
sleep.

Now, you can argue that the spinlock based fallbacks make that
difference moot, but I think that's just the price for an emulation
fringe platforms will have to pay. They aren't platforms used under
heavy concurrency.

> >> What I'm basically afraid of is that this will work fine in many cases
> >> but have really ugly failure modes.  That's pretty much what happens
> >> with spinlocks already - the overhead is insignificant at low levels
> >> of contention but as the spinlock starts to become contended the CPUs
> >> all fight over the cache line and performance goes to pot.  ISTM that
> >> making the spinlock critical section significantly longer by putting
> >> atomic ops in there could appear to win in some cases while being very
> >> bad in others.
> >
> > Well, I'm not saying it's something I suggest doing all the time. But if
> > using an atomic op in the slow path allows you to remove the spinlock
> > from 99% of the cases I don't see it having a significant impact.
> > In most scenarios (where atomics aren't emulated, i.e. platforms we
> > expect to used in heavily concurrent cases) the spinlock and the atomic
> > will be on the same cacheline making stalls much less likely.
> 
> And this again is my point: why can't we make the same argument about
> two spinlocks situated on the same cache line?

Because it's not relevant? Where and why do we want to acquire two
spinlocks that are on the same cacheline?
As far as I know there's never been an actual need for nested
spinlocks. So the guideline can be as restrictive as is it is because
the price is low and there's no benefit in making it more complex.


I think this line of discussion is pretty backwards. The reason I (and
seemingly Heikki) want to use atomics is that it can reduce contention
significantly. That contention is today too a good part based on
spinlocks.  Your argument is basically that increasing spinlock
contention by doing things that can take more than a fixed cycles can
increase contention. But the changes we've talked about only make sense
if they significantly reduce spinlock contention in the first place - a
potential for a couple iterations while holding better not be relevant
for proposed patches.

I don't think a blanket rule makes sense here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> ... this again is my point: why can't we make the same argument about
> two spinlocks situated on the same cache line?  I don't have a bit of
> trouble believing that doing the same thing with a couple of spinlocks
> could sometimes work out well, too, but Tom is adamantly opposed to
> that.

I think you might be overstating my position here.  What I'm concerned
about is that we be sure that spinlocks are held for a sufficiently short
time that it's very unlikely that we get pre-empted while holding one.
I don't have any particular bright line about how short a time that is,
but more than "a few instructions" worries me.  As you say, the Linux
kernel is a bad example to follow because it hasn't got a risk of losing
its timeslice while holding a spinlock.

The existing coding rules discourage looping (though I might be okay
with a small constant loop count), and subroutine calls (mainly because
somebody might add $random_amount_of_work to the subroutine if they
don't realize it can be called while holding a spinlock).  Both of these
rules are meant to reduce the risk that a short interval balloons
into a long one due to unrelated coding changes.

The existing coding rules also discourage spinlocking within a spinlock,
and the reason for that is that there's no very clear upper bound to the
time required to obtain a spinlock, so that there would also be no clear
upper bound to the time you're holding the original one (thus possibly
leading to cascading performance problems).

So ISTM the question we ought to be asking is whether atomic operations
have bounded execution time, or more generally what the distribution of
execution times is likely to be.  I'd be OK with an answer that includes
"sometimes it can be long" so long as "sometimes" is "pretty damn seldom".
Spinlocks have a nonzero risk of taking a long time already, since we
can't entirely prevent the possibility of losing our timeslice while
holding one.  The issue here is just to be sure that that happens seldom
enough that it doesn't cause performance problems.  If we fail to do that
we might negate all the desired performance improvements from adopting
atomic ops in the first place.
        regards, tom lane



Re: better atomics - v0.5

From
Robert Haas
Date:
On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The existing coding rules also discourage spinlocking within a spinlock,
> and the reason for that is that there's no very clear upper bound to the
> time required to obtain a spinlock, so that there would also be no clear
> upper bound to the time you're holding the original one (thus possibly
> leading to cascading performance problems).

Well, as Andres points out, commit
daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
requirement than a suggestion.  If it's sometimes OK to acquire a
spinlock from within a spinlock, then that commit was badly misguided;
but the design of that patch was pretty much driven by your insistence
that that was never, ever OK.  If that was wrong, then we should
reconsider that whole commit more broadly; but if it was right, then
the atomics patch shouldn't drill a hole through it to install an
exception just for emulating atomics.

I'm personally not convinced that we're approaching this topic in the
right way.  I'm not convinced that it's at all reasonable to try to
emulate atomics on platforms that don't have them.  I would punt the
problem into the next layer and force things like lwlock.c to have
fallback implementations at that level that don't require atomics, and
remove those fallback implementations if and when we move the
goalposts so that all supported platforms must have working atomics
implementations.  People who write code that uses atomics are not
likely to think about how those algorithms will actually perform when
those atomics are merely emulated, and I suspect that means that in
practice platforms that have only emulated atomics are going to
regress significantly vs. the status quo today.  If this patch goes in
the way it's written right now, then I think it basically amounts to
throwing platforms that don't have working atomics emulation under the
bus, and my vote is that if we're going to do that, we should do it
explicitly: sorry, we now only support platforms with working atomics.
You don't have any, so you can't run PostgreSQL.  Have a nice day.

If we *don't* want to make that decision, then I'm not convinced that
the approach this patch takes is in any way viable, completely aside
from this particular question.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
> On Mon, Jun 30, 2014 at 12:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > The existing coding rules also discourage spinlocking within a spinlock,
> > and the reason for that is that there's no very clear upper bound to the
> > time required to obtain a spinlock, so that there would also be no clear
> > upper bound to the time you're holding the original one (thus possibly
> > leading to cascading performance problems).
> 
> Well, as Andres points out, commit
> daa7527afc2274432094ebe7ceb03aa41f916607 made that more of an ironclad
> requirement than a suggestion.  If it's sometimes OK to acquire a
> spinlock from within a spinlock, then that commit was badly misguided;
> but the design of that patch was pretty much driven by your insistence
> that that was never, ever OK.  If that was wrong, then we should
> reconsider that whole commit more broadly; but if it was right, then
> the atomics patch shouldn't drill a hole through it to install an
> exception just for emulating atomics.

Well, since nobody has a problem with the rule of not having nested
spinlocks and since the problems aren't the same I'm failing to see why
we should reconsider this.

My recollection was that we primarily discussed whether it'd be a good
idea to add code to Assert() when spinlocks are nested to which Tom
responded that it's not necessary because critical sections should be
short enough to immmediately see that that's not a problem. Then we
argued a bit back and forth around that.

> I'm personally not convinced that we're approaching this topic in the
> right way.  I'm not convinced that it's at all reasonable to try to
> emulate atomics on platforms that don't have them.  I would punt the
> problem into the next layer and force things like lwlock.c to have
> fallback implementations at that level that don't require atomics, and
> remove those fallback implementations if and when we move the
> goalposts so that all supported platforms must have working atomics
> implementations.

If we're requiring a atomics-less implementation for lwlock.c,
bufmgr. et al. I'll stop working on those features (and by extension on
atomics). It's an utter waste of resources and maintainability trying to
maintain parallel high level locking algorithms in several pieces of the
codebase. The nonatomic versions will stagnate and won't actually work
under load. I'd rather spend my time on something useful. The spinlock
based atomics emulation is *small*. It's easy to reason about
correctness there.

If we're going that way, we've made a couple of absolute fringe
platforms hold postgres, as actually used in reality, hostage. I think
that's insane.

> People who write code that uses atomics are not
> likely to think about how those algorithms will actually perform when
> those atomics are merely emulated, and I suspect that means that in
> practice platforms that have only emulated atomics are going to
> regress significantly vs. the status quo today.

I think you're overstating the likely performance penalty for
nonparallel platforms/workloads here quite a bit. The platforms without
changes of gettings atomics implemented aren't ones where that's a
likely thing. Yes, you'll see a regression if you run a readonly pgbench
over a 4 node NUMA system - but it's not large. You won't see much of
improved performance in comparison to 9.4, but I think that's primarily
it.

Also it's not like this is something new - the semaphore based fallback
*sucks* but we still have it.

*Many* features in new major versions regress performance for fringe
platforms. That's normal.

> If this patch goes in
> the way it's written right now, then I think it basically amounts to
> throwing platforms that don't have working atomics emulation under the
> bus

Why? Nobody is going to run 9.5 under high performance workloads on such
platforms. They'll be so IO and RAM starved that the spinlocks won't be
the bottleneck.

>, and my vote is that if we're going to do that, we should do it
> explicitly: sorry, we now only support platforms with working atomics.
> You don't have any, so you can't run PostgreSQL.  Have a nice day.

I'd be fine with that. I don't think we're gaining much by those
platforms.
*But* I don't really see why it's required at this point. The fallback
works and unless you're using semaphore based spinlocks the performance
isn't bad.
The lwlock code doesn't really get slower/faster in comparison to before
when the atomics implementation is backed by semaphores. It's pretty
much the same number of syscalls using the atomics/current implementation.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I'm personally not convinced that we're approaching this topic in the
> right way.  I'm not convinced that it's at all reasonable to try to
> emulate atomics on platforms that don't have them.  I would punt the
> problem into the next layer and force things like lwlock.c to have
> fallback implementations at that level that don't require atomics, and
> remove those fallback implementations if and when we move the
> goalposts so that all supported platforms must have working atomics
> implementations.  People who write code that uses atomics are not
> likely to think about how those algorithms will actually perform when
> those atomics are merely emulated, and I suspect that means that in
> practice platforms that have only emulated atomics are going to
> regress significantly vs. the status quo today.

I think this is a valid objection, and I for one am not prepared to
say that we no longer care about platforms that don't have atomic ops
(especially not if it's not a *very small* set of atomic ops).

Also, just because a platform claims to have atomic ops doesn't mean that
those ops perform well.  If there's a kernel trap involved, they don't,
at least not for our purposes.  We're only going to be bothering with
installing atomic-op code in places that are contention bottlenecks
for us already, so we are not going to be happy with the results for any
atomic-op implementation that's not industrial strength.  This is one
reason why I'm extremely suspicious of depending on gcc's intrinsics for
this; that will not make the issue go away, only make it beyond our
power to control.
        regards, tom lane



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 12:54:10 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > ... this again is my point: why can't we make the same argument about
> > two spinlocks situated on the same cache line?  I don't have a bit of
> > trouble believing that doing the same thing with a couple of spinlocks
> > could sometimes work out well, too, but Tom is adamantly opposed to
> > that.
> 
> I think you might be overstating my position here.  What I'm concerned
> about is that we be sure that spinlocks are held for a sufficiently short
> time that it's very unlikely that we get pre-empted while holding one.
> I don't have any particular bright line about how short a time that is,
> but more than "a few instructions" worries me.  As you say, the Linux
> kernel is a bad example to follow because it hasn't got a risk of losing
> its timeslice while holding a spinlock.

Well, they actually have preemptable and irq-interruptile versions for
some of these locks, but whatever :).

> So ISTM the question we ought to be asking is whether atomic operations
> have bounded execution time, or more generally what the distribution of
> execution times is likely to be.  I'd be OK with an answer that includes
> "sometimes it can be long" so long as "sometimes" is "pretty damn
> seldom".

I think it ranges from 'never' to 'sometimes, but pretty darn
seldom'. Depending on the platform and emulation.

And, leaving the emulation and badly written code aside, there's no
issue with another backend sleeping while the atomic variable needs to
be modified.

> Spinlocks have a nonzero risk of taking a long time already, since we
> can't entirely prevent the possibility of losing our timeslice while
> holding one.  The issue here is just to be sure that that happens seldom
> enough that it doesn't cause performance problems.  If we fail to do that
> we might negate all the desired performance improvements from adopting
> atomic ops in the first place.

I think individual patches will have to stand up to a test like
this. Modifying a atomic variable inside a spinlock is only going to be
worth it if it allows to avoid spinlocks alltogether in 95%+ of the
time.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 19:44:47 +0200, Andres Freund wrote:
> On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
> > People who write code that uses atomics are not
> > likely to think about how those algorithms will actually perform when
> > those atomics are merely emulated, and I suspect that means that in
> > practice platforms that have only emulated atomics are going to
> > regress significantly vs. the status quo today.
> 
> I think you're overstating the likely performance penalty for
> nonparallel platforms/workloads here quite a bit. The platforms without
> changes of gettings atomics implemented aren't ones where that's a
> likely thing. Yes, you'll see a regression if you run a readonly pgbench
> over a 4 node NUMA system - but it's not large. You won't see much of
> improved performance in comparison to 9.4, but I think that's primarily
> it.

To quantify this, on my 2 socket xeon E5520 workstation - which is too
small to heavily show the contention problems in pgbench -S - the
numbers are:

pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
master: 152354.294117
lwlocks-atomics: 170528.784958
lwlocks-atomics-spin: 159416.202578

pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
master: 16273.702191
lwlocks-atomics: 16768.712433
lwlocks-atomics-spin: 16744.913083

So, there really isn't a problem with the emulation. It's not actually
that surprising - the absolute number of atomic ops is prety much the
same. Where we earlier took a spinlock to increment shared we now still
take one.

I expect that repeating this on the 4 socket machine will show a large
gap between lwlocks-atomics and the other two. The other two will be
about the same, with lwlocks-atomics-spin remaining a bit better than
master.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-30 13:45:52 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > I'm personally not convinced that we're approaching this topic in the
> > right way.  I'm not convinced that it's at all reasonable to try to
> > emulate atomics on platforms that don't have them.  I would punt the
> > problem into the next layer and force things like lwlock.c to have
> > fallback implementations at that level that don't require atomics, and
> > remove those fallback implementations if and when we move the
> > goalposts so that all supported platforms must have working atomics
> > implementations.  People who write code that uses atomics are not
> > likely to think about how those algorithms will actually perform when
> > those atomics are merely emulated, and I suspect that means that in
> > practice platforms that have only emulated atomics are going to
> > regress significantly vs. the status quo today.
> 
> I think this is a valid objection, and I for one am not prepared to
> say that we no longer care about platforms that don't have atomic ops
> (especially not if it's not a *very small* set of atomic ops).

It's still TAS(), cmpxchg(), xadd. With TAS/xadd possibly implemented
via cmpxchg (which quite possibly *is* the native implementation for
$arch).

> Also, just because a platform claims to have atomic ops doesn't mean that
> those ops perform well.  If there's a kernel trap involved, they don't,
> at least not for our purposes.

Yea. I think if we start to use 8byte atomics at some later point we're
going to have to prohibit e.g. older arms from using them, even if the
compiler claims they're supported. But that's just one #define.

> This is one
> reason why I'm extremely suspicious of depending on gcc's intrinsics for
> this; that will not make the issue go away, only make it beyond our
> power to control.

The patch as submitted makes it easy to provide a platform specific
implementation of the important routines if it's likely that it's better
than the intrinsics provided one.

I'm not too worried about the intrinsics though - the number of projects
relying on them these days is quite large.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Noah Misch
Date:
On Mon, Jun 30, 2014 at 07:44:47PM +0200, Andres Freund wrote:
> On 2014-06-30 13:15:23 -0400, Robert Haas wrote:
> > I'm personally not convinced that we're approaching this topic in the
> > right way.  I'm not convinced that it's at all reasonable to try to
> > emulate atomics on platforms that don't have them.  I would punt the
> > problem into the next layer and force things like lwlock.c to have
> > fallback implementations at that level that don't require atomics, and
> > remove those fallback implementations if and when we move the
> > goalposts so that all supported platforms must have working atomics
> > implementations.
> 
> If we're requiring a atomics-less implementation for lwlock.c,
> bufmgr. et al. I'll stop working on those features (and by extension on
> atomics). It's an utter waste of resources and maintainability trying to
> maintain parallel high level locking algorithms in several pieces of the
> codebase. The nonatomic versions will stagnate and won't actually work
> under load. I'd rather spend my time on something useful. The spinlock
> based atomics emulation is *small*. It's easy to reason about
> correctness there.
> 
> If we're going that way, we've made a couple of absolute fringe
> platforms hold postgres, as actually used in reality, hostage. I think
> that's insane.

I agree completely.

> > If this patch goes in
> > the way it's written right now, then I think it basically amounts to
> > throwing platforms that don't have working atomics emulation under the
> > bus

> >, and my vote is that if we're going to do that, we should do it
> > explicitly: sorry, we now only support platforms with working atomics.
> > You don't have any, so you can't run PostgreSQL.  Have a nice day.

I might share that view if a platform faced a huge and easily-seen performance
regression, say a 40% regression to 5-client performance.  I don't expect the
shortfall to reach that ballpark for any atomics-exploiting algorithm we'll
adopt, and your (Andres's) latest measurements corroborate that.

-- 
Noah Misch
EnterpriseDB                                 http://www.enterprisedb.com



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Mon, Jun 30, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-06-30 11:04:53 +0530, Amit Kapila wrote:
> > On Sun, Jun 29, 2014 at 2:54 PM, Andres Freund <andres@2ndquadrant.com> wrote:

> > > Yes, intentionally so. It's often important to get/set the current value
> > > without the cost of emitting a memory barrier. It just has to be a
> > > recent value  and it actually has to come from memory.
> >
> > I agree with you that we don't want to incur the cost of memory barrier
> > for get/set, however it should not be at the cost of correctness.
>
> I really can't follow here. A volatile read/store forces it to go to
> memory without barriers. The ABIs of all platforms we work with
> guarantee that 4bytes stores/reads are atomic - we've been relying on
> that for a long while.

I think for such usage, we need to rely on barriers wherever there is a
need for synchronisation, recent example I have noticed is in your patch
where we have to use pg_write_barrier() during wakeup.  However if we
go by atomic ops definition, then no such dependency is required, like
if  somewhere we need to use pg_atomic_compare_exchange_u32(),
after that we don't need to ensure about barriers, because this atomic
API adheres to Full barrier semantics.

I think defining barrier support for some atomic ops and not for others
will lead to inconsistency with specs and we need to additionally
define rules for such atomic ops usage.

I think we can still rely on volatile read/store for ordering guarantee
within same thread of execution and where ever we need synchronisation
of usage among different processes, we can directly use atomic ops
(get/set) without the need to use barriers.
 
> > > > > b) It's only supported from vista onwards. Afaik we still support XP.
>
> Well, with barrier.h as it stands they'd get a compiler error if it
> indeed is unsupported? But I think there's something else going on -
> msft might just be removing XP from it's documentation.

Okay, but why would they remove for MemoryBarrier() and not
for InterlockedCompareExchange().  I think it is bit difficult to predict
the reason and if we want to use anything which is not as msft docs,
we shall do that carefully and have some way to ensure that it will
work fine.

> > In this case, I have a question for you.
> >
> > Un-patched usage  in barrier.h is as follows:
> > ..
> >
> > If I understand correctly the current define mechanism in barrier.h,
> > it will have different definition for Itanium processors even for windows.
>
> Either noone has ever tested postgres on itanium windows (quite
> possible), because afaik _Asm_sched_fence() doesn't work on
> windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> arefor HP's acc on HPUX.

Hmm.. Do you think in such a case, it would have gone in below
define:
#elif defined(__INTEL_COMPILER)
/*
 * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
 */
#if defined(__ia64__) || defined(__ia64)
#define pg_memory_barrier() __mf()
#elif defined(__i386__) || defined(__x86_64__)
#define pg_memory_barrier() _mm_mfence()
#endif



-#define pg_compiler_barrier() __memory_barrier()

Currently this will be considered as compiler barrier for
__INTEL_COMPILER, but after patch, I don't see this define. I think
after patch, it will be compiler_impl specific, but why such a change?

> > However the patch defines as below:
..
> > What I can understand from above is that defines in
> > storage/atomics-generic-msvc.h, will override any previous defines
> > for compiler/memory barriers and _ReadWriteBarrier()/MemoryBarrier()
> > will be considered for Windows always.
>
> Well, the memory barrier is surrounded by #ifndef
> pg_memory_barrier_impl. The compiler barrier can't reasonably be defined
> earlier since it's a compiler not an architecture thing.

I think as per your new compiler specific implementation stuff, this holds
true.

> > > > > > 6.
> > Sure, I was also not asking for _impl functions.  What I was asking
> > in this point was to have comments on top of definition of
> > pg_atomic_compare_exchange_u32() in atomics.h
> > In particular, on top of below and similar functions, rather than
> > at the place where they are declared.
>
> Hm, we can do that. Don't think it'll be clearer (because you need to go
> to the header anyway), but I don't feel strongly.

I think this depends on individual's perspective, so do the way
you feel better, the intention of this point was lets not deviate from
existing coding ways.

> I'd much rather get rid of the separated definition/declaration, but
> we'd need to rely on PG_USE_INLINE for it...

Okay.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:
> I think for such usage, we need to rely on barriers wherever there is a
> need for synchronisation, recent example I have noticed is in your patch
> where we have to use pg_write_barrier() during wakeup.  However if we
> go by atomic ops definition, then no such dependency is required, like
> if  somewhere we need to use pg_atomic_compare_exchange_u32(),
> after that we don't need to ensure about barriers, because this atomic
> API adheres to Full barrier semantics.

> I think defining barrier support for some atomic ops and not for others
> will lead to inconsistency with specs and we need to additionally
> define rules for such atomic ops usage.

Meh. It's quite common that read/load do not have barrier semantics. The
majority of read/write users won't need barriers and it'll be expensive
to provide them with them.

> > > > > > b) It's only supported from vista onwards. Afaik we still support
> XP.
> >
> > Well, with barrier.h as it stands they'd get a compiler error if it
> > indeed is unsupported? But I think there's something else going on -
> > msft might just be removing XP from it's documentation.
> 
> Okay, but why would they remove for MemoryBarrier() and not
> for InterlockedCompareExchange().  I think it is bit difficult to predict
> the reason and if we want to use anything which is not as msft docs,
> we shall do that carefully and have some way to ensure that it will
> work fine.

Well, we have quite good evidence for it working by knowing that
postgres has in the past worked on xp? If you have a better idea, send
in a patch for today's barrier.h and I'll adopt that.

> > > In this case, I have a question for you.
> > >
> > > Un-patched usage  in barrier.h is as follows:
> > > ..
> > >
> > > If I understand correctly the current define mechanism in barrier.h,
> > > it will have different definition for Itanium processors even for windows.
> >
> > Either noone has ever tested postgres on itanium windows (quite
> > possible), because afaik _Asm_sched_fence() doesn't work on
> > windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> > arefor HP's acc on HPUX.
> 
> Hmm.. Do you think in such a case, it would have gone in below
> define:

> #elif defined(__INTEL_COMPILER)
> /*
>  * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
>  */
> #if defined(__ia64__) || defined(__ia64)
> #define pg_memory_barrier() __mf()
> #elif defined(__i386__) || defined(__x86_64__)
> #define pg_memory_barrier() _mm_mfence()
> #endif

Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
acc, I don't see why?

But they should go into atomics-generic-acc.h.

> -#define pg_compiler_barrier() __memory_barrier()
> 
> Currently this will be considered as compiler barrier for
> __INTEL_COMPILER, but after patch, I don't see this define. I think
> after patch, it will be compiler_impl specific, but why such a change?

#if defined(__INTEL_COMPILER)
#define pg_compiler_barrier_impl()    __memory_barrier()
#else
#define pg_compiler_barrier_impl()    __asm__ __volatile__("" ::: "memory")
#endif

There earlier was a typo defining pg_compiler_barrier instead of
pg_compiler_barrier_impl for intel, maybe that's what you were referring to?


Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
Hi,

On 2014-06-30 20:28:52 +0200, Andres Freund wrote:
> To quantify this, on my 2 socket xeon E5520 workstation - which is too
> small to heavily show the contention problems in pgbench -S - the
> numbers are:
>
> pgbench -M prepared -c 16 -j 16 -T 10 -S (best of 5, noticeably variability)
> master: 152354.294117
> lwlocks-atomics: 170528.784958
> lwlocks-atomics-spin: 159416.202578
>
> pgbench -M prepared -c 1 -j 1 -T 10 -S (best of 5, noticeably variability)
> master: 16273.702191
> lwlocks-atomics: 16768.712433
> lwlocks-atomics-spin: 16744.913083
>
> So, there really isn't a problem with the emulation. It's not actually
> that surprising - the absolute number of atomic ops is prety much the
> same. Where we earlier took a spinlock to increment shared we now still
> take one.

Tom, you've voiced concern that using atomics will regress performance
for platforms that don't use atomics in a nearby thread. Can these
numbers at least ease those a bit?
I've compared the atomics vs emulated atomics on 32bit x86, 64bit x86
and POWER7. That's all I've access to at the moment.  In all cases the
performance with lwlocks ontop emulated atomics was the same as the old
spinlock based algorithm or even a bit better. I'm sure that it's
possible to design atomics based algorithms where that's not the case,
but I don't think we need to solve those before we meet them.

I think for most contended places which possibly can be improved two
things matter: The number of instructions that need to work atomically
(i.e. TAS/CAS/xchg for spinlocks, xadd/cmpxchg for atomics) and the
duration locks are held. When converting to atomics, even if emulated,
the hot paths shouldn't need more locked instructions than before and
often enough the duration during which spinlocks are held will be
smaller.

Greetings,

Andres Freund

--Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-06-26 00:42:37 +0300, Heikki Linnakangas wrote:
> On 06/25/2014 11:36 PM, Andres Freund wrote:
> >
> >>>- I completely loathe the file layout you've proposed in
> >>>src/include/storage.  If we're going to have separate #include files
> >>>for each architecture (and I'd rather we didn't go that route, because
> >>>it's messy and unlike what we have now), then shouldn't that stuff go
> >>>in src/include/port or a new directory src/include/port/atomics?
> >>>Polluting src/include/storage with a whole bunch of files called
> >>>atomics-whatever.h strikes me as completely ugly, and there's a lot of
> >>>duplicate code in there.
> >I don't mind moving it somewhere else and it's easy enough to change.
> 
> I think having a separate file for each architecture is nice. I totally
> agree that they don't belong in src/include/storage, though. s_lock.h has
> always been misplaced there, but we've let it be for historical reasons, but
> now that we're adding a dozen new files, it's time to move them out.

So,
src/include/port/atomics.h,
src/include/port/atomics/generic-$compiler.h,
src/include/port/atomics/arch-$arch.h,
src/include/port/atomics/fallback.h,
src/include/port/atomics/generic.h,
src/backend/port/atomics.c
?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-07-01 10:44:19 +0530, Amit Kapila wrote:
>
> > I think defining barrier support for some atomic ops and not for others
> > will lead to inconsistency with specs and we need to additionally
> > define rules for such atomic ops usage.
>
> Meh. It's quite common that read/load do not have barrier semantics. The
> majority of read/write users won't need barriers and it'll be expensive
> to provide them with them.

So in such a case they shouldn't use atomic API and rather access
directly, I think part of the purpose to use atomic ops is also that they
should provide synchronisation for read/write among processes, earlier
we don't have any such facility, so wherever required we have no option
but to use barriers to achieve the same.

One more point here, in patch 64-bit atomic read/write are
implemented using *_exchange_* API, won't this automatically ensure
that 64-bit versions have barrier support?


+pg_atomic_read_u64_impl(volatile pg_atomic_uint64 *ptr)
+{
..
+ pg_atomic_compare_exchange_u64_impl(ptr, &old, 0);
..
+}


> > > > > > > b) It's only supported from vista onwards. Afaik we still support
> > XP.
>
> Well, we have quite good evidence for it working by knowing that
> postgres has in the past worked on xp? If you have a better idea, send
> in a patch for today's barrier.h and I'll adopt that.

Here the only point was to check if it is better to use version of 
InterLocked... API which is a recommended for the case of Itanium-
based processor and will behave same for others.  If you have any
inclination/suggestion for the same, then I can try for it.   

> > > > In this case, I have a question for you.
>
> > > Either noone has ever tested postgres on itanium windows (quite
> > > possible), because afaik _Asm_sched_fence() doesn't work on
> > > windows/msvc, or windows doesn't define __ia64__/__ia64. Those macros
> > > arefor HP's acc on HPUX.
> >
> > Hmm.. Do you think in such a case, it would have gone in below
> > define:
>
> > #elif defined(__INTEL_COMPILER)
> > /*
> >  * icc defines __GNUC__, but doesn't support gcc's inline asm syntax
> >  */
> > #if defined(__ia64__) || defined(__ia64)
> > #define pg_memory_barrier() __mf()
> > #elif defined(__i386__) || defined(__x86_64__)
> > #define pg_memory_barrier() _mm_mfence()
> > #endif
>
> Hm? Since _Asm_sched_fence isn't for the intel compiler but for HP's
> acc, I don't see why?

Sorry, I don't understand what you mean by above?
The point I wanted to clarify here was that is it possible that above
defines lead to different definitions for Itanium on windows?

> But they should go into atomics-generic-acc.h.

Okay, but in your current patch (0001-Basic-atomic-ops-implementation),
they seem to be in different files.

> > -#define pg_compiler_barrier() __memory_barrier()
> >
> > Currently this will be considered as compiler barrier for
> > __INTEL_COMPILER, but after patch, I don't see this define. I think
> > after patch, it will be compiler_impl specific, but why such a change?
>
> #if defined(__INTEL_COMPILER)
> #define pg_compiler_barrier_impl()      __memory_barrier()
> #else
> #define pg_compiler_barrier_impl()      __asm__ __volatile__("" ::: "memory")
> #endif
>
> There earlier was a typo defining pg_compiler_barrier instead of
> pg_compiler_barrier_impl for intel, maybe that's what you were referring to?

I have tried to search for this in patch file 0001-Basic-atomic-ops-implementation
and found that it has been removed from barrier.h but not added anywhere else.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Amit Kapila
Date:
On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com> wrote:

Further review of patch:
1.
/*
 * pg_atomic_test_and_set_flag - TAS()
 *
 * Acquire/read barrier semantics.
 */
STATIC_IF_INLINE_DECLARE bool
pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);

a. I think Acquire and read barrier semantics aren't equal.
With acquire semantics, "the results of the operation are available before the
results of any operation that appears after it in code" which means it applies
for both load and stores. Definition of read barrier just ensures loads.

So will be right to declare like above in comments?

b. I think it adheres to Acquire semantics for platforms where that semantics
are supported else it defaults to full memory ordering.
Example :
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Is it right to declare that generic version of function adheres to
Acquire semantics. 


2.
bool
pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
{
return TAS((slock_t *) &ptr->sema);
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

a. In above usage (ptr->sema), sema itself is not declared as volatile,
so would it be right to use it in this way for API (InterlockedCompareExchange)
expecting volatile.

b. Also shouldn't this implementation be moved to atomics-generic-msvc.h

3. 
/*
 * pg_atomic_unlocked_test_flag - TAS()
 *
 * No barrier semantics.
 */
STATIC_IF_INLINE_DECLARE bool
pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);

a. There is no setting in this, so using TAS in function comments
seems to be wrong.
b. BTW, I am not able see this function in C11 specs.

4.
#if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}


#elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) && defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
..
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return (bool) pg_atomic_read_u32_impl(ptr);
}

Is there any reason to keep these two implementations different?

5.
atomics-generic-gcc.h
#ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return ptr->value == 0;
}
#endif

Don't we need to compare it with 1 instead of 0?

6.
atomics-fallback.h
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
/*
* Can't do this in the semaphore based implementation - always return
* true. Do this in the header so compilers can optimize the test away.
*/
return true;
}

Why we can't implement this in semaphore based implementation?

7.
/*
 * pg_atomic_clear_flag - release lock set by TAS()
 *
 * Release/write barrier semantics.
 */
STATIC_IF_INLINE_DECLARE void
pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);

a. Are Release and write barriers equivalent?

With release semantics, the results of the operation are available after the
results of any operation that appears before it in code.
A write barrier acts as a compiler barrier, and orders stores.

I think both definitions are not same.

b. IIUC then current code doesn't have release semantics for unlock/clear.
We are planing to introduce it in this patch, also this doesn't follow the
specs which says that clear should have memory_order_seq_cst ordering
semantics.

As per its current usage in patch (S_UNLOCK), it seems reasonable
to have *release* semantics for this API, however I think if we use it for
generic purpose then it might not be right to define it with Release semantics.

8.
#define PG_HAS_ATOMIC_CLEAR_FLAG
static inline void
pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
{
/* XXX: release semantics suffice? */
pg_memory_barrier_impl();
pg_atomic_write_u32_impl(ptr, 0);
}

Are we sure that having memory barrier before clearing flag will
achieve *Release* semantics; as per my understanding the definition
of Release sematics is as below and above doesn't seem to follow the same.
"With release semantics, the results of the operation are available after
the results of any operation that appears before it in code."

9.
static inline uint32
pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
{
uint32 old;
while (true)
{
old = pg_atomic_read_u32_impl(ptr);
if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
break;
}
return old;
}

What is the reason to implement exchange as compare-and-exchange, at least for
Windows I know corresponding function (InterlockedExchange) exists.
I could not see any other implementation of same.
I think this at least deserves comment explaining why we have done
the implementation this way.

10.
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
STATIC_IF_INLINE_DECLARE uint32
pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);

The function name and input value seems to be inconsistent.
The function name contains *_u32*, however the input value is *int32*.

11.
Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are
implemented using compare_exchange routines, don't we have any native API's
which we can use for implementation.

12.
I am not able to see API's pg_atomic_add_fetch_u32(), pg_atomic_sub_fetch_u32()
in C11 atomic ops specs, do you need it for something specific?

13.
+ * The non-intrinsics version is only available in vista upwards, so use the
+ * intrisc version.

spelling of intrinsic is wrong.

14. 
* pg_memory_barrier - prevent the CPU from reordering memory access
 *
 * A memory barrier must act as a compiler barrier,

Is it ensured in all cases that pg_memory_barrier implies compiler barrier
as well?
Example for msvc case, the specs doesn't seem to guarntee it.

I understand that this rule (or assumption) is carried forward from existing
implementation, however I am not aware if it is true for all cases, so thought
of mentioning here even though we don't want to do anything considering it an
existing behaviour.

Links:


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Robert Haas
Date:
On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>> On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>
> Further review of patch:
> 1.
> /*
>  * pg_atomic_test_and_set_flag - TAS()
>  *
>  * Acquire/read barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE bool
> pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
>
> a. I think Acquire and read barrier semantics aren't equal.
> With acquire semantics, "the results of the operation are available before
> the
> results of any operation that appears after it in code" which means it
> applies
> for both load and stores. Definition of read barrier just ensures loads.
>
> So will be right to declare like above in comments?

Yeah.  Barriers can be by the operations they keep from being
reordered (read, write, or both) and by the direction in which they
prevent reordering (acquire, release, or both).  So in theory there
are 9 combinations:

read barrier (both directions)
read barrier (acquire)
read barrier (release)
write barrier (both directions)
write barrier (acquire)
write barrier (both directions)
full barrier (both directions)
full barrier (acquire)
full barrier (release)

To make things more confusing, a read barrier plus a write barrier
need not equal a full barrier.  Read barriers prevent reads from being
reordered with other reads, and write barriers keep writes from being
reordered with other writes, but neither prevents a read from being
reordered relative to a write.  A full barrier, however, must prevent
all reordering: read/read, read/write, write/read, and write/write.

Clarity here is essential.

barrier.h only proposed to implement the following:

read barrier (both directions), write barrier (both directions), full
barrier (both directions)

The reason I did that is because it seemed to be more or less what
Linux was doing, and it's generally suitable for lock-less algorithms.
The acquire and release semantics come into play specifically when
you're using barriers to implement locking primitives, which is isn't
what I was trying to enable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Wed, Jul 9, 2014 at 8:58 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jul 8, 2014 at 2:21 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:
> >
> > Further review of patch:
> > 1.
> > /*
> >  * pg_atomic_test_and_set_flag - TAS()
> >  *
> >  * Acquire/read barrier semantics.
> >  */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
> > With acquire semantics, "the results of the operation are available before
> > the
> > results of any operation that appears after it in code" which means it
> > applies
> > for both load and stores. Definition of read barrier just ensures loads.
> >
> > So will be right to declare like above in comments?
>
> Yeah.  Barriers can be by the operations they keep from being
> reordered (read, write, or both) and by the direction in which they
> prevent reordering (acquire, release, or both).  So in theory there
> are 9 combinations:
>
> read barrier (both directions)
> read barrier (acquire)
> read barrier (release)
> write barrier (both directions)
> write barrier (acquire)
> write barrier (both directions)
> full barrier (both directions)
> full barrier (acquire)
> full barrier (release)

With these definitions, I think it will fall into *full barrier (acquire)*
category as it needs to ensure that no reordering should happen
for both load and store.  As an example, we can refer its
implementation for msvc which ensures that it follows full memory
barrier semantics.
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

> To make things more confusing, a read barrier plus a write barrier
> need not equal a full barrier.  Read barriers prevent reads from being
> reordered with other reads, and write barriers keep writes from being
> reordered with other writes, but neither prevents a read from being
> reordered relative to a write.  A full barrier, however, must prevent
> all reordering: read/read, read/write, write/read, and write/write.
>
> Clarity here is essential.
>
> barrier.h only proposed to implement the following:
>
> read barrier (both directions), write barrier (both directions), full
> barrier (both directions)

As per my understanding of the general theory around barriers,
read and write are defined to avoid reordering due to compiler and 
full memory barriers are defined to avoid reordering due to processors.
There are some processors that support instructions for memory
barriers with acquire, release and fence semantics.

I think the current definitions of barriers is as per needs of usage in
PostgreSQL and not by referring standard (am I missing something
here), however as you explained the definitions are quite clear.

> The reason I did that is because it seemed to be more or less what
> Linux was doing, and it's generally suitable for lock-less algorithms.
> The acquire and release semantics come into play specifically when
> you're using barriers to implement locking primitives, which is isn't
> what I was trying to enable.

Now by implementing atomic-ops, I think we are moving more towards
lock-less algorithms, so I am not sure if we just want to adhere to
existing definitions as you listed above (what current barrier.h
implements) and implement accordingly or we can extend the existing
definitions.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Martijn van Oosterhout
Date:
On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
> As per my understanding of the general theory around barriers,
> read and write are defined to avoid reordering due to compiler and
> full memory barriers are defined to avoid reordering due to processors.
> There are some processors that support instructions for memory
> barriers with acquire, release and fence semantics.

Not exactly, both compilers and processors can rearrange loads and
stores.  Because the architecture most developers work on (x86) has
such a strong memory model (it's goes to lot of effort to hide
reordering) people are under the impression that it's only the compiler
you need to worry about, but that's not true for any other
architechture.

Memory barriers/fences are on the whole discouraged if you can use
acquire/release semantics because the latter are faster and much easier
to optimise for.

I strongly recommend the "Atomic Weapons" talk on the C11 memory model
to help understand how they work. As a bonus it includes correct
implementations for the major architectures.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> He who writes carelessly confesses thereby at the very outset that he does
> not attach much importance to his own thoughts.  -- Arthur Schopenhauer

Re: better atomics - v0.5

From
Amit Kapila
Date:
On Sat, Jul 12, 2014 at 1:26 AM, Martijn van Oosterhout <kleptog@svana.org> wrote:
> On Thu, Jul 10, 2014 at 08:46:55AM +0530, Amit Kapila wrote:
> > As per my understanding of the general theory around barriers,
> > read and write are defined to avoid reordering due to compiler and
> > full memory barriers are defined to avoid reordering due to processors.
> > There are some processors that support instructions for memory
> > barriers with acquire, release and fence semantics.
>
> Not exactly, both compilers and processors can rearrange loads and
> stores.  Because the architecture most developers work on (x86) has
> such a strong memory model (it's goes to lot of effort to hide
> reordering) people are under the impression that it's only the compiler
> you need to worry about, but that's not true for any other
> architechture.

I am not saying that we need only barriers to avoid reordering due to 
compiler, rather my point was that we need to avoid reordering due to
both compiler and processor, however ways to achieve it require different
API's

> Memory barriers/fences are on the whole discouraged if you can use
> acquire/release semantics because the latter are faster and much easier
> to optimise for.

Do all processors support instructions for memory barriers with acquire,
release semantics?

> I strongly recommend the "Atomic Weapons" talk on the C11 memory model
> to help understand how they work. As a bonus it includes correct
> implementations for the major architectures.

Yes that will be a good if we can can implement as per C11 memory model and
thats what I am referring while reviewing this patch, however even if that is not
possible or difficult to achieve in all cases, it is worth to have a discussion for
memory model used in current API's implemented in patch. 

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Andres Freund
Date:
Hi Amit,

On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
> On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
> > On Tue, Jul 1, 2014 at 4:10 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> 
> Further review of patch:
> 1.
> /*
>  * pg_atomic_test_and_set_flag - TAS()
>  *
>  * Acquire/read barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE bool
> pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> 
> a. I think Acquire and read barrier semantics aren't equal.

Right. It was mean as Acquire (thus including read barrier) semantics. I
guess I'll better update README.barrier to have definitions of all barriers.

> b. I think it adheres to Acquire semantics for platforms where
> that semantics
> are supported else it defaults to full memory ordering.
> Example :
> #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> 
> Is it right to declare that generic version of function adheres to
> Acquire semantics.

Implementing stronger semantics than required should always be
fine. There's simply no sane way to work with the variety of platforms
we support in any other way.

> 
> 2.
> bool
> pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return TAS((slock_t *) &ptr->sema);
> #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

Where's that from? I can't recall adding a line of code like that?

> a. In above usage (ptr->sema), sema itself is not declared as volatile,
> so would it be right to use it in this way for API
> (InterlockedCompareExchange)
> expecting volatile.

Upgrading a pointer to volatile is defined, so I don't see the problem?

> 3.
> /*
>  * pg_atomic_unlocked_test_flag - TAS()
>  *
>  * No barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE bool
> pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
> 
> a. There is no setting in this, so using TAS in function comments
> seems to be wrong.

Good point.

> b. BTW, I am not able see this function in C11 specs.

So? It's needed for a sensible implementation of spinlocks ontop of
atomics.

> 4.
> #if !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
> defined(PG_HAVE_ATOMIC_EXCHANGE_U32)
> ..
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return pg_atomic_read_u32_impl(ptr) == 1;
> }
> 
> 
> #elif !defined(PG_HAS_ATOMIC_TEST_SET_FLAG) &&
> defined(PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32)
> ..
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return (bool) pg_atomic_read_u32_impl(ptr);
> }
> 
> Is there any reason to keep these two implementations different?

No, will change.

> 5.
> atomics-generic-gcc.h
> #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> static inline bool
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> return ptr->value == 0;
> }
> #endif
> 
> Don't we need to compare it with 1 instead of 0?

Why? It returns true if the lock is free, makes sense imo? Will add
comments to atomics.h

> 6.
> atomics-fallback.h
> pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> {
> /*
>  * Can't do this in the semaphore based implementation - always return
>  * true. Do this in the header so compilers can optimize the test away.
>  */
> return true;
> }
> 
> Why we can't implement this in semaphore based implementation?

Because semaphores don't provide the API for it?

> 7.
> /*
>  * pg_atomic_clear_flag - release lock set by TAS()
>  *
>  * Release/write barrier semantics.
>  */
> STATIC_IF_INLINE_DECLARE void
> pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
> 
> a. Are Release and write barriers equivalent?

Same answer as above. Meant as "Release (thus including write barrier) semantics"

> b. IIUC then current code doesn't have release semantics for unlock/clear.

If you're referring to s_lock.h with 'current code', yes it should have:*    On platforms with weak memory ordering,
theTAS(), TAS_SPIN(), and*    S_UNLOCK() macros must further include hardware-level memory fence*    instructions to
preventsimilar re-ordering at the hardware level.*    TAS() and TAS_SPIN() must guarantee that loads and stores issued
after*   the macro are not executed until the lock has been obtained.  Conversely,*    S_UNLOCK() must guarantee that
loadsand stores issued before the macro*    have been executed before the lock is released.
 

> We are planing to introduce it in this patch, also this doesn't follow the
> specs which says that clear should have memory_order_seq_cst ordering
> semantics.

Making it guarantee full memory barrier semantics is a major performance
loss on x86-64, so I don't want to do that.

Also there is atomic_flag_test_and_set_explicit()...

> As per its current usage in patch (S_UNLOCK), it seems reasonable
> to have *release* semantics for this API, however I think if we use it for
> generic purpose then it might not be right to define it with Release
> semantics.

I can't really see a user where it's not what you want. And if there is
- well, it can make the semantics stronger if it needs that.

> 8.
> #define PG_HAS_ATOMIC_CLEAR_FLAG
> static inline void
> pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
> {
> /* XXX: release semantics suffice? */
> pg_memory_barrier_impl();
> pg_atomic_write_u32_impl(ptr, 0);
> }
> 
> Are we sure that having memory barrier before clearing flag will
> achieve *Release* semantics; as per my understanding the definition
> of Release sematics is as below and above doesn't seem to follow the
> same.

Yes, a memory barrier guarantees a release semantics. It guarantees
more, but that's not a problem.

> 9.
> static inline uint32
> pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
> {
> uint32 old;
> while (true)
> {
> old = pg_atomic_read_u32_impl(ptr);
> if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> break;
> }
> return old;
> }
> 
> What is the reason to implement exchange as compare-and-exchange, at least
> for
> Windows I know corresponding function (InterlockedExchange) exists.
> I could not see any other implementation of same.
> I think this at least deserves comment explaining why we have done
> the implementation this way.

Because Robert and Tom insist that we shouldn't add more operations
employing hardware features. I think that's ok for now, we can always
add more capabilities later.

> 10.
> STATIC_IF_INLINE_DECLARE uint32
> pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
> STATIC_IF_INLINE_DECLARE uint32
> pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
> 
> The function name and input value seems to be inconsistent.
> The function name contains *_u32*, however the input value is *int32*.

A u32 is implemented by adding/subtracting a signed int. There's some
platforms why that's the only API provided by intrinsics and it's good
enough for pretty much everything.

> 11.
> Both pg_atomic_fetch_and_u32_impl() and pg_atomic_fetch_or_u32_impl() are
> implemented using compare_exchange routines, don't we have any native API's
> which we can use for implementation.

Same reason as for 9).

> 12.
> I am not able to see API's pg_atomic_add_fetch_u32(),
> pg_atomic_sub_fetch_u32()
> in C11 atomic ops specs, do you need it for something specific?

Yes, it's already used in the lwlocks code and it's provided by several
other atomics libraries. I don't really see a reason not to provide it,
especially as some platforms could implement it more efficiently.
> 14.
> * pg_memory_barrier - prevent the CPU from reordering memory access
>  *
>  * A memory barrier must act as a compiler barrier,
> 
> Is it ensured in all cases that pg_memory_barrier implies compiler barrier
> as well?

Yes.

> Example for msvc case, the specs doesn't seem to guarntee it.

I think it actually indirectly guarantees it, but I'm on a plane and
can't check :)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Andres Freund
Date:
On 2014-07-10 08:46:55 +0530, Amit Kapila wrote:
> As per my understanding of the general theory around barriers,
> read and write are defined to avoid reordering due to compiler and
> full memory barriers are defined to avoid reordering due to
> processors.

No, that's not the case. There's several processors where write/read
barriers are an actual thing on the hardware level.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.5

From
Amit Kapila
Date:
On Mon, Jul 14, 2014 at 12:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2014-07-08 11:51:14 +0530, Amit Kapila wrote:
> > On Thu, Jul 3, 2014 at 10:56 AM, Amit Kapila <amit.kapila16@gmail.com>
> > wrote:

> > Further review of patch:
> > 1.
> > /*
> >  * pg_atomic_test_and_set_flag - TAS()
> >  *
> >  * Acquire/read barrier semantics.
> >  */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_test_set_flag(volatile pg_atomic_flag *ptr);
> >
> > a. I think Acquire and read barrier semantics aren't equal.
>
> Right. It was mean as Acquire (thus including read barrier) semantics.

Okay, I got confused by reading comments, may be saying the same
explicitly in comments is better.

> I guess I'll better update README.barrier to have definitions of all
> barriers.

I think updating about the reasons behind using specific
barriers for atomic operations defined by PG can be useful
for people who want to understand or use the atomic op API's.

> > b. I think it adheres to Acquire semantics for platforms where
> > that semantics
> > are supported else it defaults to full memory ordering.
> > Example :
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
> >
> > Is it right to declare that generic version of function adheres to
> > Acquire semantics.
>
> Implementing stronger semantics than required should always be
> fine. There's simply no sane way to work with the variety of platforms
> we support in any other way.

Agreed, may be we can have a note somewhere either in code or
readme to mention about it, if you think that can be helpful.

> >
> > 2.
> > bool
> > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return TAS((slock_t *) &ptr->sema);
> > #define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))
>
> Where's that from?

Its there in s_lock.h. Refer below code:
#ifdef WIN32_ONLY_COMPILER
typedef LONG slock_t;

#define HAS_TEST_AND_SET
#define TAS(lock) (InterlockedCompareExchange(lock, 1, 0))

> I can't recall adding a line of code like that?

It is not added by your patch but used in your patch.

I am not sure if that is what excatly you want atomic API to define
for WIN32, but I think it is picking this definition.  I have one question
here, if the value of sema is other than 1 or 0, then it won't set the flag
and not sure what it will return (ex. if the value is negative), so is this
implementation completely sane?


> > a. In above usage (ptr->sema), sema itself is not declared as volatile,
> > so would it be right to use it in this way for API
> > (InterlockedCompareExchange)
> > expecting volatile.
>
> Upgrading a pointer to volatile is defined, so I don't see the problem?

Thats okay.  However all other structure definitions use it as
volatile, example for atomics-arch-amd64.h it is defined as follows:
#define PG_HAVE_ATOMIC_FLAG_SUPPORT
typedef struct pg_atomic_flag
{
volatile char value;
} pg_atomic_flag;

So is there any harm if we keep this definition in sync with others?

> > 3.
> > /*
> >  * pg_atomic_unlocked_test_flag - TAS()
> >  *
> >  * No barrier semantics.
> >  */
> > STATIC_IF_INLINE_DECLARE bool
> > pg_atomic_unlocked_test_flag(volatile pg_atomic_flag *ptr);
> >
> > a. There is no setting in this, so using TAS in function comments
> > seems to be wrong.
>
> Good point.
>
> > b. BTW, I am not able see this function in C11 specs.
>
> So? It's needed for a sensible implementation of spinlocks ontop of
> atomics.

Okay got the point, do you think it makes sense to have a common
agreement/consensus w.r.t which API's are necessary as a first
cut.  For me as a reviewer, if the API is implemented as per specs or
what it is desired to then we can have it, however I am not sure if there
is general consensus or at least some people agreed to set of API's
which are required for first cut, have I missed some conclusion/discussion
about it.
>
> > 5.
> > atomics-generic-gcc.h
> > #ifndef PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > #define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
> > static inline bool
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > return ptr->value == 0;
> > }
> > #endif
> >
> > Don't we need to compare it with 1 instead of 0?
>
> Why? It returns true if the lock is free, makes sense imo?

Other implementation in atomics-generic.h seems to implement it other
way 
#define PG_HAS_ATOMIC_UNLOCKED_TEST_FLAG
static inline bool
pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
{
return pg_atomic_read_u32_impl(ptr) == 1;
}

> Will add comments to atomics.h

I think that will be helpful.

> > 6.
> > atomics-fallback.h
> > pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /*
> >  * Can't do this in the semaphore based implementation - always return
> >  * true. Do this in the header so compilers can optimize the test away.
> >  */
> > return true;
> > }
> >
> > Why we can't implement this in semaphore based implementation?
>
> Because semaphores don't provide the API for it?

Okay, but it is not clear from comments why this test should always
return true, basically I am not able to think why incase of missing API
returning true is correct behaviour for this API?

> > 7.
> > /*
> >  * pg_atomic_clear_flag - release lock set by TAS()
> >  *
> >  * Release/write barrier semantics.
> >  */
> > STATIC_IF_INLINE_DECLARE void
> > pg_atomic_clear_flag(volatile pg_atomic_flag *ptr);
> >
> > a. Are Release and write barriers equivalent?
>
> Same answer as above. Meant as "Release (thus including write barrier) semantics"
>
> > b. IIUC then current code doesn't have release semantics for unlock/clear.
>
> If you're referring to s_lock.h with 'current code', yes it should have:

I was referring to next point (point number 8) which you have clarified below.

>
> > 8.
> > #define PG_HAS_ATOMIC_CLEAR_FLAG
> > static inline void
> > pg_atomic_clear_flag_impl(volatile pg_atomic_flag *ptr)
> > {
> > /* XXX: release semantics suffice? */
> > pg_memory_barrier_impl();
> > pg_atomic_write_u32_impl(ptr, 0);
> > }
> >
> > Are we sure that having memory barrier before clearing flag will
> > achieve *Release* semantics; as per my understanding the definition
> > of Release sematics is as below and above doesn't seem to follow the
> > same.
>
> Yes, a memory barrier guarantees a release semantics. It guarantees
> more, but that's not a problem.

You are right.

> > 9.
> > static inline uint32
> > pg_atomic_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 xchg_)
> > {
> > uint32 old;
> > while (true)
> > {
> > old = pg_atomic_read_u32_impl(ptr);
> > if (pg_atomic_compare_exchange_u32_impl(ptr, &old, xchg_))
> > break;
> > }
> > return old;
> > }
> >
> > What is the reason to implement exchange as compare-and-exchange, at least
> > for
> > Windows I know corresponding function (InterlockedExchange) exists.
> > I could not see any other implementation of same.
> > I think this at least deserves comment explaining why we have done
> > the implementation this way.
>
> Because Robert and Tom insist that we shouldn't add more operations
> employing hardware features. I think that's ok for now, we can always
> add more capabilities later.

No issues, may be a comment indicating some form of reason would be good.

> > 10.
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_add_u32(volatile pg_atomic_uint32 *ptr, int32 add_);
> > STATIC_IF_INLINE_DECLARE uint32
> > pg_atomic_fetch_sub_u32(volatile pg_atomic_uint32 *ptr, int32 sub_);
> >
> > The function name and input value seems to be inconsistent.
> > The function name contains *_u32*, however the input value is *int32*.
>
> A u32 is implemented by adding/subtracting a signed int. There's some
> platforms why that's the only API provided by intrinsics and it's good
> enough for pretty much everything.

As such no issues, but just wondering is there any reason to prefer
*_u32 instead of simple _32 or *32.

>
> > 12.
> > I am not able to see API's pg_atomic_add_fetch_u32(),
> > pg_atomic_sub_fetch_u32()
> > in C11 atomic ops specs, do you need it for something specific?
>
> Yes, it's already used in the lwlocks code and it's provided by several
> other atomics libraries. I don't really see a reason not to provide it,
> especially as some platforms could implement it more efficiently.

Okay, I find below implementation in your patch. Won't it add the value
twice?

+#if !defined(PG_HAS_ATOMIC_ADD_FETCH_U32) && defined(PG_HAS_ATOMIC_FETCH_ADD_U32)
+#define PG_HAS_ATOMIC_ADD_FETCH_U32
+static inline uint32
+pg_atomic_add_fetch_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
+{
+ return pg_atomic_fetch_add_u32_impl(ptr, add_) + add_;
+}
+#endif


> > 14.
> > * pg_memory_barrier - prevent the CPU from reordering memory access
> >  *
> >  * A memory barrier must act as a compiler barrier,
> >
> > Is it ensured in all cases that pg_memory_barrier implies compiler barrier
> > as well?
>
> Yes.
>
> > Example for msvc case, the specs doesn't seem to guarntee it.
>
> I think it actually indirectly guarantees it, but I'm on a plane and
> can't check :)

No problem let me know when you are comfortable, this is just for
clarification.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: better atomics - v0.5

From
Heikki Linnakangas
Date:
Hi Andres,

Are you planning to continue working on this? Summarizing the discussion 
so far:

* Robert listed a bunch of little cleanup tasks 
(http://www.postgresql.org/message-id/CA+TgmoZShVVqjaKUL6P3kDvZVPibtgkzoti3M+Fvvjg5V+XJ0A@mail.gmail.com). 
Amit posted yet more detailed commends.

* We talked about changing the file layout. I think everyone is happy 
with your proposal here: 
http://www.postgresql.org/message-id/20140702131729.GP21169@awork2.anarazel.de, 
with an overview description of what goes where 
(http://www.postgresql.org/message-id/CA+TgmoYuxfsm2dSy48=JgUTRh1mozrvmjLHgqrFkU7_WPV-xLA@mail.gmail.com)

* Talked about nested spinlocks. The consensus seems to be to (continue 
to) forbid nested spinlocks, but allow atomic operations while holding a 
spinlock. When atomics are emulated with spinlocks, it's OK to acquire 
the emulation spinlock while holding another spinlock.

* Talked about whether emulating atomics with spinlocks is a good idea. 
You posted performance results showing that at least with the patch to
use atomics to implement LWLocks, the emulation performs more or less 
the same as the current spinlock-based implementation. I believe 
everyone was more or less satisfied with that.


So ISTM we have consensus that the approach to spinlock emulation and 
nesting stuff is OK. To finish the patch, you'll just need to address 
the file layout and the laundry list of other little details that have 
been raised.

- Heikki




Re: better atomics - v0.5

From
Andres Freund
Date:
Hi,

On 2014-08-20 12:43:05 +0300, Heikki Linnakangas wrote:
> Are you planning to continue working on this?

Yes, I am! I've recently been on holiday and now I'm busy with catching
up with everything that has happened since. I hope to have a next
version ready early next week.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Andres Freund
Date:
Hi,

I've finally managed to incorporate (all the?) feedback I got for
0.5. Imo the current version looks pretty good.

Most notable changes:
* Lots of comment improvements
* code moved out of storage/ into port/
* separated the s_lock.h changes into its own commit
* merged i386/amd64 into one file
* fixed lots of the little details Amit noticed
* fixed lots of XXX/FIXMEs
* rebased to today's master
* tested various gcc/msvc versions
* extended the regression tests
* ...

The patches:
0001: The actual atomics API
0002: Implement s_lock.h support ontop the atomics API. Existing
      implementations continue to be used unless
      FORCE_ATOMICS_BASED_SPINLOCKS is defined
0003-0005: Not proposed for review here. Just included because code
     actually using the atomics make testing them easier.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - v0.6

From
Oskari Saarenmaa
Date:
23.09.2014, 00:01, Andres Freund kirjoitti:
> I've finally managed to incorporate (all the?) feedback I got for
> 0.5. Imo the current version looks pretty good.
>
> Most notable changes:
> * Lots of comment improvements
> * code moved out of storage/ into port/
> * separated the s_lock.h changes into its own commit
> * merged i386/amd64 into one file
> * fixed lots of the little details Amit noticed
> * fixed lots of XXX/FIXMEs
> * rebased to today's master
> * tested various gcc/msvc versions
> * extended the regression tests
> * ...
>
> The patches:
> 0001: The actual atomics API

I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm
animal dingo) with this patch but regression tests failed due to:

/export/home/os/postgresql/src/test/regress/regress.so: symbol
pg_write_barrier_impl: referenced symbol not found

which turns out to be caused by a leftover PG_ prefix in ifdefs for
HAVE_GCC__ATOMIC_INT64_CAS.  Removing the PG_ prefix fixed the build and
regression tests.  Attached a patch to strip the invalid prefix.

> 0002: Implement s_lock.h support ontop the atomics API. Existing
>        implementations continue to be used unless
>        FORCE_ATOMICS_BASED_SPINLOCKS is defined

Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS
- both builds passed regression tests.

> 0003-0005: Not proposed for review here. Just included because code
>       actually using the atomics make testing them easier.

I'll look at these patches later.

/ Oskari

Attachment

Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:
> 23.09.2014, 00:01, Andres Freund kirjoitti:
> >I've finally managed to incorporate (all the?) feedback I got for
> >0.5. Imo the current version looks pretty good.
> >
> >Most notable changes:
> >* Lots of comment improvements
> >* code moved out of storage/ into port/
> >* separated the s_lock.h changes into its own commit
> >* merged i386/amd64 into one file
> >* fixed lots of the little details Amit noticed
> >* fixed lots of XXX/FIXMEs
> >* rebased to today's master
> >* tested various gcc/msvc versions
> >* extended the regression tests
> >* ...
> >
> >The patches:
> >0001: The actual atomics API
> 
> I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
> dingo) with this patch but regression tests failed due to:
> 
> /export/home/os/postgresql/src/test/regress/regress.so: symbol
> pg_write_barrier_impl: referenced symbol not found
> 
> which turns out to be caused by a leftover PG_ prefix in ifdefs for
> HAVE_GCC__ATOMIC_INT64_CAS.  Removing the PG_ prefix fixed the build and
> regression tests.  Attached a patch to strip the invalid prefix.

Gah. Stupid last minute changes... Thanks for diagnosing. Will
integrate.

> >0002: Implement s_lock.h support ontop the atomics API. Existing
> >       implementations continue to be used unless
> >       FORCE_ATOMICS_BASED_SPINLOCKS is defined
> 
> Applied this and built PG with and without FORCE_ATOMICS_BASED_SPINLOCKS -
> both builds passed regression tests.

Cool.

> >0003-0005: Not proposed for review here. Just included because code
> >      actually using the atomics make testing them easier.
> 
> I'll look at these patches later.

Cool. Although I'm not proposing them to be integrated as-is...

Greetings,

Andres Freund



Re: better atomics - v0.6

From
Andres Freund
Date:
Hi,

On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:
> 23.09.2014, 00:01, Andres Freund kirjoitti:
> >I've finally managed to incorporate (all the?) feedback I got for
> >0.5. Imo the current version looks pretty good.
> >
> >Most notable changes:
> >* Lots of comment improvements
> >* code moved out of storage/ into port/
> >* separated the s_lock.h changes into its own commit
> >* merged i386/amd64 into one file
> >* fixed lots of the little details Amit noticed
> >* fixed lots of XXX/FIXMEs
> >* rebased to today's master
> >* tested various gcc/msvc versions
> >* extended the regression tests
> >* ...
> >
> >The patches:
> >0001: The actual atomics API
> 
> I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
> dingo) with this patch but regression tests failed due to:

Btw, if you could try sun studio it'd be great. I wrote the support for
it blindly, and I'd be surprised if I got it right on the first try.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Oskari Saarenmaa
Date:
23.09.2014, 15:18, Andres Freund kirjoitti:
> On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:
>> 23.09.2014, 00:01, Andres Freund kirjoitti:
>>> The patches:
>>> 0001: The actual atomics API
>>
>> I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
>> dingo) with this patch but regression tests failed due to:
>
> Btw, if you could try sun studio it'd be great. I wrote the support for
> it blindly, and I'd be surprised if I got it right on the first try.

I just installed Solaris Studio 12.3 and tried compiling this:

"../../../../src/include/port/atomics/generic-sunpro.h", line 54: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 77: return
value type mismatch
"../../../../src/include/port/atomics/generic-sunpro.h", line 79:
#if-less #endif
"../../../../src/include/port/atomics/generic-sunpro.h", line 81:
#if-less #endif

atomic_add_64 and atomic_add_32 don't return anything (the
atomic_add_*_nv variants return the new value) and there were a few
extra #endifs.  Regression tests pass after applying the attached patch
which defines PG_HAS_ATOMIC_ADD_FETCH_U32.

There doesn't seem to be a fallback for defining pg_atomic_fetch_add
based on pg_atomic_add_fetch so pg_atomic_add_fetch now gets implemented
using pg_atomic_compare_exchange.

Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS
with these patches:

"../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL
does not have native spinlock support on this platform....

atomics/generic.h would implement atomic flags using operations exposed
by atomics/generic-sunpro.h, but atomics/fallback.h is included before
it and it defines functions for flag operations which s_lock.h doesn't
want to use.

/ Oskari

Attachment

Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 00:27:25 +0300, Oskari Saarenmaa wrote:
> 23.09.2014, 15:18, Andres Freund kirjoitti:
> >On 2014-09-23 13:50:28 +0300, Oskari Saarenmaa wrote:
> >>23.09.2014, 00:01, Andres Freund kirjoitti:
> >>>The patches:
> >>>0001: The actual atomics API
> >>
> >>I tried building PG on Solaris 10/Sparc using GCC 4.9.0 (buildfarm animal
> >>dingo) with this patch but regression tests failed due to:
> >
> >Btw, if you could try sun studio it'd be great. I wrote the support for
> >it blindly, and I'd be surprised if I got it right on the first try.
> 
> I just installed Solaris Studio 12.3 and tried compiling this:

Cool.

> "../../../../src/include/port/atomics/generic-sunpro.h", line 54: return
> value type mismatch
> "../../../../src/include/port/atomics/generic-sunpro.h", line 77: return
> value type mismatch
> "../../../../src/include/port/atomics/generic-sunpro.h", line 79: #if-less
> #endif
> "../../../../src/include/port/atomics/generic-sunpro.h", line 81: #if-less
> #endif
> 
> atomic_add_64 and atomic_add_32 don't return anything (the atomic_add_*_nv
> variants return the new value) and there were a few extra #endifs.
> Regression tests pass after applying the attached patch which defines
> PG_HAS_ATOMIC_ADD_FETCH_U32.

Thanks for the fixes!

Hm. I think then it's better to simply not implement addition and rely
on cmpxchg.

> Also, it's not possible to compile PG with FORCE_ATOMICS_BASED_SPINLOCKS
> with these patches:
> 
> "../../../../src/include/storage/s_lock.h", line 868: #error: PostgreSQL
> does not have native spinlock support on this platform....
> 
> atomics/generic.h would implement atomic flags using operations exposed by
> atomics/generic-sunpro.h, but atomics/fallback.h is included before it and
> it defines functions for flag operations which s_lock.h doesn't want to use.

Right. It should check not just for flag support, but also for 32bit
atomics. Easily fixable.

I'll send a new version with your fixes incorporated tomorrow.

Thanks for looking into this! Very helpful.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Heikki Linnakangas
Date:
On 09/23/2014 12:01 AM, Andres Freund wrote:
> Hi,
>
> I've finally managed to incorporate (all the?) feedback I got for
> 0.5. Imo the current version looks pretty good.

Thanks! I agree it looks good. Some random comments after a quick 
read-through:

There are some spurious whitespace changes in spin.c.

> + * These files can provide the full set of atomics or can do pretty much
> + * nothing if all the compilers commonly used on these platforms provide
> + * useable generics. It will often make sense to define memory barrier
> + * semantics in those, since e.g. the compiler intrinsics for x86 memory
> + * barriers can't know we don't need read/write barriers anything more than a
> + * compiler barrier.

I didn't understand the latter sentence.

> + * Don't add an inline assembly of the actual atomic operations if all the
> + * common implementations of your platform provide intrinsics. Intrinsics are
> + * much easier to understand and potentially support more architectures.

What about uncommon implementations, then? I think we need to provide 
inline assembly if any supported implementation on the platform does not 
provide intrinsics, otherwise we fall back to emulation. Or is that 
exactly what you're envisioning we do?

I believe such a situation hasn't come up on the currently supported 
platforms, so we don't necessary have to have a solution for that, but 
the comment doesn't feel quite right either.

> +#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
> +    Assert(TYPEALIGN((uintptr_t)(ptr), bndr))

Would be better to call this AssertAlignment, to make it clear that this 
is an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps 
move this to c.h where other assertions are - this seems generally useful.

> + * Spinloop delay -

Spurious dash.

> +/*
> + * pg_fetch_add_until_u32 - saturated addition to variable
> + *
> + * Returns the the value of ptr after the arithmetic operation.
> + *
> + * Full barrier semantics.
> + */
> +STATIC_IF_INLINE uint32
> +pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
> +                              uint32 until)
> +{
> +    CHECK_POINTER_ALIGNMENT(ptr, 4);
> +    return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
> +}
> +

This was a surprise to me, I don't recall discussion of an 
"fetch-add-until" operation, and hadn't actually ever heard of it 
before. None of the subsequent patches seem to use it either. Can we 
just leave this out?

s/the the/the/

> +#ifndef PG_HAS_ATOMIC_WRITE_U64
> +#define PG_HAS_ATOMIC_WRITE_U64
> +static inline void
> +pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
> +{
> +    /*
> +     * 64 bit writes aren't safe on all platforms. In the generic
> +     * implementation implement them as an atomic exchange. That might store a
> +     * 0, but only if the prev. value also was a 0.
> +     */
> +    pg_atomic_exchange_u64_impl(ptr, val);
> +}
> +#endif

Why is 0 special?

> +     * fail or suceed, but always return the old value

s/suceed/succeed/. Add a full stop to end.

> +    /*
> +     * Can't run the test under the semaphore emulation, it doesn't handle
> +     * checking edge cases well.
> +     */
> +#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
> +    test_atomic_flag();
> +#endif

Huh? Is the semaphore emulation broken, then?

- Heikki



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 14:59:19 +0300, Heikki Linnakangas wrote:
> On 09/23/2014 12:01 AM, Andres Freund wrote:
> >Hi,
> >
> >I've finally managed to incorporate (all the?) feedback I got for
> >0.5. Imo the current version looks pretty good.
> 
> Thanks! I agree it looks good. Some random comments after a quick
> read-through:
> 
> There are some spurious whitespace changes in spin.c.

Huh. Unsure how they crept in. Will fix.

> >+ * These files can provide the full set of atomics or can do pretty much
> >+ * nothing if all the compilers commonly used on these platforms provide
> >+ * useable generics. It will often make sense to define memory barrier
> >+ * semantics in those, since e.g. the compiler intrinsics for x86 memory
> >+ * barriers can't know we don't need read/write barriers anything more than a
> >+ * compiler barrier.
> 
> I didn't understand the latter sentence.

Hm. The background here is that postgres doesn't need memory barriers
affect sse et al registers - which is why read/write barriers currently
can be defined to be empty on TSO architectures like x86. But e.g. gcc's
barrier intrinsics don't assume that, so they do a mfence or lock xaddl
for read/write barriers. Which isn't cheap.

Will try to find a way to rephrase.

> >+ * Don't add an inline assembly of the actual atomic operations if all the
> >+ * common implementations of your platform provide intrinsics. Intrinsics are
> >+ * much easier to understand and potentially support more architectures.
> 
> What about uncommon implementations, then? I think we need to provide inline
> assembly if any supported implementation on the platform does not provide
> intrinsics, otherwise we fall back to emulation. Or is that exactly what
> you're envisioning we do?

Yes, that's what I'm envisioning. E.g. old versions of solaris studio
don't provide compiler intrinsics and also don't provide reliable ways
to barrier semantics. It doesn't seem worthwile to add a inline assembly
version for that special case.

> I believe such a situation hasn't come up on the currently supported
> platforms, so we don't necessary have to have a solution for that, but the
> comment doesn't feel quite right either.

It's the reason I added a inline assembly version of atomics for x86
gcc...

> >+#define CHECK_POINTER_ALIGNMENT(ptr, bndr) \
> >+    Assert(TYPEALIGN((uintptr_t)(ptr), bndr))
> 
> Would be better to call this AssertAlignment, to make it clear that this is
> an assertion, not something like CHECK_FOR_INTERRUPTS(). And perhaps move
> this to c.h where other assertions are - this seems generally useful.

Sounds good.

> >+ * Spinloop delay -
> 
> Spurious dash.

Probably should rather add a bit more explanation...

> >+/*
> >+ * pg_fetch_add_until_u32 - saturated addition to variable
> >+ *
> >+ * Returns the the value of ptr after the arithmetic operation.
> >+ *
> >+ * Full barrier semantics.
> >+ */
> >+STATIC_IF_INLINE uint32
> >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
> >+                              uint32 until)
> >+{
> >+    CHECK_POINTER_ALIGNMENT(ptr, 4);
> >+    return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
> >+}
> >+
> 
> This was a surprise to me, I don't recall discussion of an "fetch-add-until"
> operation, and hadn't actually ever heard of it before.

It was included from the first version on, and I'd mentioned it a couple
times.

> None of the subsequent patches seem to use it either. Can we just
> leave this out?

I have a further patch (lockless buffer header manipulation) that uses
it to manipulate the usagecount. I can add it back later if you feel
strongly about it.

> s/the the/the/
> 
> >+#ifndef PG_HAS_ATOMIC_WRITE_U64
> >+#define PG_HAS_ATOMIC_WRITE_U64
> >+static inline void
> >+pg_atomic_write_u64_impl(volatile pg_atomic_uint64 *ptr, uint64 val)
> >+{
> >+    /*
> >+     * 64 bit writes aren't safe on all platforms. In the generic
> >+     * implementation implement them as an atomic exchange. That might store a
> >+     * 0, but only if the prev. value also was a 0.
> >+     */
> >+    pg_atomic_exchange_u64_impl(ptr, val);
> >+}
> >+#endif
> 
> Why is 0 special?

That bit should actually be in the read_u64 implementation, not write...

If you do a compare and exchange you have to provide an 'expected'
value. So, if you do a compare and exchange and the previous value is 0
the value will be written superflously - but the actual value won't change.

> >+    /*
> >+     * Can't run the test under the semaphore emulation, it doesn't handle
> >+     * checking edge cases well.
> >+     */
> >+#ifndef PG_HAVE_ATOMIC_FLAG_SIMULATION
> >+    test_atomic_flag();
> >+#endif
> 
> Huh? Is the semaphore emulation broken, then?

No, it's not. The semaphore emulation doesn't implement
pg_atomic_unlocked_test_flag() (as there's no sane way to do that with
semaphores that I know of). Instead it always returns true. Which
disturbs the test. I guess I'll expand that comment...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Heikki Linnakangas
Date:
On 09/24/2014 03:37 PM, Andres Freund wrote:
>>> > >+/*
>>> > >+ * pg_fetch_add_until_u32 - saturated addition to variable
>>> > >+ *
>>> > >+ * Returns the the value of ptr after the arithmetic operation.
>>> > >+ *
>>> > >+ * Full barrier semantics.
>>> > >+ */
>>> > >+STATIC_IF_INLINE uint32
>>> > >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
>>> > >+                              uint32 until)
>>> > >+{
>>> > >+    CHECK_POINTER_ALIGNMENT(ptr, 4);
>>> > >+    return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
>>> > >+}
>>> > >+
>> >
>> >This was a surprise to me, I don't recall discussion of an "fetch-add-until"
>> >operation, and hadn't actually ever heard of it before.
> It was included from the first version on, and I'd mentioned it a couple
> times.

There doesn't seem to be any hardware implementations of that in the 
patch. Is there any architecture that has an instruction or compiler 
intrinsic for that?

- Heikki



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:
> On 09/24/2014 03:37 PM, Andres Freund wrote:
> >>>> >+/*
> >>>> >+ * pg_fetch_add_until_u32 - saturated addition to variable
> >>>> >+ *
> >>>> >+ * Returns the the value of ptr after the arithmetic operation.
> >>>> >+ *
> >>>> >+ * Full barrier semantics.
> >>>> >+ */
> >>>> >+STATIC_IF_INLINE uint32
> >>>> >+pg_atomic_fetch_add_until_u32(volatile pg_atomic_uint32 *ptr, int32 add_,
> >>>> >+                              uint32 until)
> >>>> >+{
> >>>> >+    CHECK_POINTER_ALIGNMENT(ptr, 4);
> >>>> >+    return pg_atomic_fetch_add_until_u32_impl(ptr, add_, until);
> >>>> >+}
> >>>> >+
> >>>
> >>>This was a surprise to me, I don't recall discussion of an "fetch-add-until"
> >>>operation, and hadn't actually ever heard of it before.
> >It was included from the first version on, and I'd mentioned it a couple
> >times.
> 
> There doesn't seem to be any hardware implementations of that in the patch.
> Is there any architecture that has an instruction or compiler intrinsic for
> that?

You can implement it rather efficiently on ll/sc architectures. But I
don't really think it matters. I prefer add_until (I've seen it named
saturated add before as well) to live in the atomics code, rather than
reimplement it in atomics employing code. I guess you see that
differently?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:
>> There doesn't seem to be any hardware implementations of that in the patch.
>> Is there any architecture that has an instruction or compiler intrinsic for
>> that?

> You can implement it rather efficiently on ll/sc architectures. But I
> don't really think it matters. I prefer add_until (I've seen it named
> saturated add before as well) to live in the atomics code, rather than
> reimplement it in atomics employing code. I guess you see that
> differently?

I think the question is more like "what in the world happened to confining
ourselves to a small set of atomics".  I doubt either that this exists
natively anywhere, or that it's so useful that we should expect platforms
to have efficient implementations.
        regards, tom lane



Re: better atomics - v0.6

From
andres@anarazel.de (Andres Freund)
Date:
On 2014-09-24 12:44:09 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:
> >> There doesn't seem to be any hardware implementations of that in the patch.
> >> Is there any architecture that has an instruction or compiler intrinsic for
> >> that?
> 
> > You can implement it rather efficiently on ll/sc architectures. But I
> > don't really think it matters. I prefer add_until (I've seen it named
> > saturated add before as well) to live in the atomics code, rather than
> > reimplement it in atomics employing code. I guess you see that
> > differently?
> 
> I think the question is more like "what in the world happened to confining
> ourselves to a small set of atomics". 

I fail to see why the existance of a wrapper around compare-exchange
(which is one of the primitives we'd agreed upon) runs counter to
the agreement that we'll only rely on a limited number of atomics on the
hardware level?

> I doubt either that this exists
> natively anywhere, or ethat it's so useful that we should expect platforms
> to have efficient implementations.

It's useful for my work to get rid of most LockBufHdr() calls (to
manipulate usagecount locklessly). That's why I added it. We can delay
it till that patch is ready, but I don't really see the benefit.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Heikki Linnakangas
Date:
On 09/24/2014 07:57 PM, Andres Freund wrote:
> On 2014-09-24 12:44:09 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:
>>>> There doesn't seem to be any hardware implementations of that in the patch.
>>>> Is there any architecture that has an instruction or compiler intrinsic for
>>>> that?
>>
>>> You can implement it rather efficiently on ll/sc architectures. But I
>>> don't really think it matters. I prefer add_until (I've seen it named
>>> saturated add before as well) to live in the atomics code, rather than
>>> reimplement it in atomics employing code. I guess you see that
>>> differently?
>>
>> I think the question is more like "what in the world happened to confining
>> ourselves to a small set of atomics".
>
> I fail to see why the existance of a wrapper around compare-exchange
> (which is one of the primitives we'd agreed upon) runs counter to
> the agreement that we'll only rely on a limited number of atomics on the
> hardware level?

It might be a useful function, but if there's no hardware implementation 
for it, it doesn't belong in atomics.h. We don't want to turn it into a 
general library of useful little functions.

>> I doubt either that this exists
>> natively anywhere, or ethat it's so useful that we should expect platforms
>> to have efficient implementations.

Googling around, ARM seems to have a QADD instruction that does that. 
But AFAICS it's not an atomic instruction that you could use for 
synchronization purposes, just a regular instruction.

> It's useful for my work to get rid of most LockBufHdr() calls (to
> manipulate usagecount locklessly). That's why I added it. We can delay
> it till that patch is ready, but I don't really see the benefit.

Yeah, please leave it out for now, we can argue about it later. Even if 
we want it in the future, it would be just dead, untested code now.

- Heikki



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:
> On 09/24/2014 07:57 PM, Andres Freund wrote:
> >On 2014-09-24 12:44:09 -0400, Tom Lane wrote:
> >>Andres Freund <andres@2ndquadrant.com> writes:
> >>>On 2014-09-24 18:55:51 +0300, Heikki Linnakangas wrote:
> >>>>There doesn't seem to be any hardware implementations of that in the patch.
> >>>>Is there any architecture that has an instruction or compiler intrinsic for
> >>>>that?
> >>
> >>>You can implement it rather efficiently on ll/sc architectures. But I
> >>>don't really think it matters. I prefer add_until (I've seen it named
> >>>saturated add before as well) to live in the atomics code, rather than
> >>>reimplement it in atomics employing code. I guess you see that
> >>>differently?
> >>
> >>I think the question is more like "what in the world happened to confining
> >>ourselves to a small set of atomics".
> >
> >I fail to see why the existance of a wrapper around compare-exchange
> >(which is one of the primitives we'd agreed upon) runs counter to
> >the agreement that we'll only rely on a limited number of atomics on the
> >hardware level?
> 
> It might be a useful function, but if there's no hardware implementation for
> it, it doesn't belong in atomics.h. We don't want to turn it into a general
> library of useful little functions.

Uh. It belongs there because it *atomically* does a saturated add?
That's precisely what atomics.h is about, so I don't see why it'd belong
anywhere else.

> >>I doubt either that this exists
> >>natively anywhere, or ethat it's so useful that we should expect platforms
> >>to have efficient implementations.
> 
> Googling around, ARM seems to have a QADD instruction that does that. But
> AFAICS it's not an atomic instruction that you could use for synchronization
> purposes, just a regular instruction.

On ARM - and many other LL/SC architectures - you can implement it
atomically using LL/SC. In pseudocode:

while (true)
{   load_linked(mem, reg);   if (reg + add < limit)       reg = reg + add;   else       reg = limit;   if
(store_conditional(mem,reg))       break
 
}

That's how pretty much *all* atomic instructions work on such architectures.

> >It's useful for my work to get rid of most LockBufHdr() calls (to
> >manipulate usagecount locklessly). That's why I added it. We can delay
> >it till that patch is ready, but I don't really see the benefit.
> 
> Yeah, please leave it out for now, we can argue about it later. Even if we
> want it in the future, it would be just dead, untested code now.

Ok, will remove it for now.

I won't repost a version with it removed, as removing a function as the
only doesn't seem to warrant reposting it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 09/24/2014 07:57 PM, Andres Freund wrote:
>> On 2014-09-24 12:44:09 -0400, Tom Lane wrote:
>>> I think the question is more like "what in the world happened to confining
>>> ourselves to a small set of atomics".

>> I fail to see why the existance of a wrapper around compare-exchange
>> (which is one of the primitives we'd agreed upon) runs counter to
>> the agreement that we'll only rely on a limited number of atomics on the
>> hardware level?

> It might be a useful function, but if there's no hardware implementation 
> for it, it doesn't belong in atomics.h. We don't want to turn it into a 
> general library of useful little functions.

Note that the spinlock code separates s_lock.h (hardware implementations)
from spin.h (a hardware-independent abstraction layer).  Perhaps there's
room for a similar separation here.  I tend to agree with Heikki that
wrappers around compare-exchange ought not be conflated with
compare-exchange itself, even if there might theoretically be
architectures where the wrapper function could be implemented directly.
        regards, tom lane



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 14:28:18 -0400, Tom Lane wrote:
> Note that the spinlock code separates s_lock.h (hardware implementations)
> from spin.h (a hardware-independent abstraction layer).  Perhaps there's
> room for a similar separation here.

Luckily that separation exists ;). The hardware dependant bits are in
different files than the platform independent functionality on top of
them (atomics/generic.h).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-24 20:27:39 +0200, Andres Freund wrote:
> On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:
> I won't repost a version with it removed, as removing a function as the
> only doesn't seem to warrant reposting it.

I've fixed (thanks Alvaro!) some minor additional issues besides the
removal and addressing earlier comments from Heikki:
* removal of two remaining non-ascii copyright signs. The only non ascii
  name character that remains is Alvaro's name in the commit message.
* additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE
* there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split
  into different configure tests and thus has a different name. A later
  patch entirely removes that reference, which is why I'd missed that...

Heikki has marked the patch as 'ready for commiter' in the commitfest
(conditional on his remarks being addressed) and I agree. There seems to
be little benefit in waiting further. There *definitely* will be some
platform dependant issues, but that won't change by waiting longer.

I plan to commit this quite soon unless somebody protests really
quickly.

Sorting out the issues on platforms I don't have access to based on
buildfarm feedback will take a while given how infrequent some of the
respective animals run... But that's just life.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: better atomics - v0.6

From
Stephen Frost
Date:
Andres,

* Andres Freund (andres@anarazel.de) wrote:
> Heikki has marked the patch as 'ready for commiter' in the commitfest
> (conditional on his remarks being addressed) and I agree. There seems to
> be little benefit in waiting further. There *definitely* will be some
> platform dependant issues, but that won't change by waiting longer.

Agreed.  I won't claim to have done as good a review as others, but I've
been following along as best I've been able to.

> I plan to commit this quite soon unless somebody protests really
> quickly.

+1 from me for at least moving forward to see what happens.

> Sorting out the issues on platforms I don't have access to based on
> buildfarm feedback will take a while given how infrequent some of the
> respective animals run... But that's just life.

The buildfarm is another tool (as well as Coverity) that provides a lot
of insight into issues we may not realize ahead of time, but using them
(currently) means we have to eventually go beyond looking at the code
individually and actually commit it.

I'm very excited to see this progress and trust you, Heikki, Robert, and
the others involved in this work to have done an excellent job and to be
ready and prepared to address any issues which come from it, or to pull
it out if it turns out to be necessary (which I seriously doubt, but
still).
Thanks!
    Stephen

Re: better atomics - v0.6

From
Peter Geoghegan
Date:
On Thu, Sep 25, 2014 at 3:23 PM, Stephen Frost <sfrost@snowman.net> wrote:
> * Andres Freund (andres@anarazel.de) wrote:
>> Heikki has marked the patch as 'ready for commiter' in the commitfest
>> (conditional on his remarks being addressed) and I agree. There seems to
>> be little benefit in waiting further. There *definitely* will be some
>> platform dependant issues, but that won't change by waiting longer.
>
> Agreed.  I won't claim to have done as good a review as others, but I've
> been following along as best I've been able to.

FWIW, I agree that you've already done plenty to ensure that this
works well on common platforms. We cannot reasonably expect you to be
just as sure that there are not problems on obscure platforms ahead of
commit. It isn't unheard of for certain committers (that don't like
Windows, say), to commit things despite having a suspicion that their
patch will turn some fraction of the buildfarm red. To a certain
extent, that's what we have it for.

-- 
Peter Geoghegan



Re: better atomics - v0.6

From
Robert Haas
Date:
On Thu, Sep 25, 2014 at 6:03 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2014-09-24 20:27:39 +0200, Andres Freund wrote:
>> On 2014-09-24 21:19:06 +0300, Heikki Linnakangas wrote:
>> I won't repost a version with it removed, as removing a function as the
>> only doesn't seem to warrant reposting it.
>
> I've fixed (thanks Alvaro!) some minor additional issues besides the
> removal and addressing earlier comments from Heikki:
> * removal of two remaining non-ascii copyright signs. The only non ascii
>   name character that remains is Alvaro's name in the commit message.
> * additional comments for STATIC_IF_INLINE/STATIC_IF_INLINE_DECLARE
> * there was a leftover HAVE_GCC_INT_ATOMICS reference - it's now split
>   into different configure tests and thus has a different name. A later
>   patch entirely removes that reference, which is why I'd missed that...
>
> Heikki has marked the patch as 'ready for commiter' in the commitfest
> (conditional on his remarks being addressed) and I agree. There seems to
> be little benefit in waiting further. There *definitely* will be some
> platform dependant issues, but that won't change by waiting longer.
>
> I plan to commit this quite soon unless somebody protests really
> quickly.
>
> Sorting out the issues on platforms I don't have access to based on
> buildfarm feedback will take a while given how infrequent some of the
> respective animals run... But that's just life.

I feel like this could really use a developer README: what primitives
do we have, which ones are likely to be efficient or inefficient on
which platforms, how do atomics interact with barriers, etc.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: better atomics - v0.6

From
Andres Freund
Date:
On 2014-09-25 21:02:28 -0400, Robert Haas wrote:
> I feel like this could really use a developer README: what primitives
> do we have, which ones are likely to be efficient or inefficient on
> which platforms, how do atomics interact with barriers, etc.

atomics.h has most of that. Including documenting which barrier
semantics the individual operations have. One change I can see as being
a good idea is to move/transform the definition of acquire/release
barriers from s_lock.h to README.barrier.

It doesn't document which operations are efficient on which platform,
but I wouldn't know what to say about that topic? The current set of
operations should be fast on pretty much all platforms.

What information would you like to see? An example of how to use
atomics?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



xlc atomics

From
Noah Misch
Date:
On Wed, Jun 25, 2014 at 07:14:34PM +0200, Andres Freund wrote:
> * gcc, msvc work. acc, xlc, sunpro have blindly written support which
>   should be relatively easy to fix up.

I tried this on three xlc configurations.

(1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)".  Getting it working
required the attached patch.  None of my xlc configurations have an <atomic.h>
header, and a web search turned up no evidence of one in connection with xlc
platforms.  Did you learn of a configuration needing atomic.h under xlc?  The
rest of the changes are hopefully self-explanatory in light of the
documentation cited in generic-xlc.h.  (Building on AIX has regressed in other
ways unrelated to atomics; I will write more about that in due course.)

(2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1.  This
compiler has a Clang-derived C frontend.  It defines __GNUC__ and offers
GCC-style __sync_* atomics.  Therefore, PostgreSQL selects generic-gcc.h.
test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
segfaults at runtime.  I have reported this to the vendor.  Adding
"pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
workaround.  I could add a comment about that to src/test/regress/sql/lock.sql
for affected folks to see in regression.diffs.  To do better, we could make
PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible.  Yet
another option is to force use of generic-xlc.h on this compiler.

(3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
modifying atomics.h to force use of generic-xlc.h.  While not a supported
PostgreSQL configuration, I felt this would make an interesting data point.
It worked fine after applying the patch developed for the AIX configuration.

Thanks,
nm

Attachment

Re: xlc atomics

From
Andres Freund
Date:
On 2015-07-04 18:40:41 -0400, Noah Misch wrote:
> (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)".  Getting it working
> required the attached patch.  None of my xlc configurations have an <atomic.h>
> header, and a web search turned up no evidence of one in connection with xlc
> platforms.  Did you learn of a configuration needing atomic.h under xlc?  The
> rest of the changes are hopefully self-explanatory in light of the
> documentation cited in generic-xlc.h.  (Building on AIX has regressed in other
> ways unrelated to atomics; I will write more about that in due course.)

I wrote this entirely blindly, as evidenced here by the changes you
needed. At the time somebody had promised to soon put up an aix animal,
but that apparently didn't work out.

Will you apply? Having the ability to test change seems to put you in a
much better spot then me.

> (2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
> http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1.  This
> compiler has a Clang-derived C frontend.  It defines __GNUC__ and offers
> GCC-style __sync_* atomics.

Phew. I don't see much reason to try to support this. Why would that be
interesting?

> Therefore, PostgreSQL selects generic-gcc.h.
> test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
> segfaults at runtime.  I have reported this to the vendor.  Adding
> "pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
> workaround.  I could add a comment about that to src/test/regress/sql/lock.sql
> for affected folks to see in regression.diffs.  To do better, we could make
> PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible.  Yet
> another option is to force use of generic-xlc.h on this compiler.

It seems fair enough to simply add another test and include
generic-xlc.h in that case. If it's indeed xlc, why not?

> (3) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
> modifying atomics.h to force use of generic-xlc.h.  While not a supported
> PostgreSQL configuration, I felt this would make an interesting data point.
> It worked fine after applying the patch developed for the AIX configuration.

Cool.

Greetings,

Andres Freund



Re: xlc atomics

From
Noah Misch
Date:
On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:
> On 2015-07-04 18:40:41 -0400, Noah Misch wrote:
> > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)".  Getting it working
> > required the attached patch.

> Will you apply? Having the ability to test change seems to put you in a
> much better spot then me.

I will.

> > (2) "IBM XL C/C++ for Linux, V13.1.2 (5725-C73, 5765-J08)" for ppc64le,
> > http://www-01.ibm.com/support/docview.wss?uid=swg27044056&aid=1.  This
> > compiler has a Clang-derived C frontend.  It defines __GNUC__ and offers
> > GCC-style __sync_* atomics.
>
> Phew. I don't see much reason to try to support this. Why would that be
> interesting?
>
> > Therefore, PostgreSQL selects generic-gcc.h.
> > test_atomic_ops() fails because __sync_lock_test_and_set() of one-byte types
> > segfaults at runtime.  I have reported this to the vendor.  Adding
> > "pgac_cv_gcc_sync_char_tas=no" to the "configure" invocation is a good
> > workaround.  I could add a comment about that to src/test/regress/sql/lock.sql
> > for affected folks to see in regression.diffs.  To do better, we could make
> > PGAC_HAVE_GCC__SYNC_CHAR_TAS perform a runtime test where possible.  Yet
> > another option is to force use of generic-xlc.h on this compiler.
>
> It seems fair enough to simply add another test and include
> generic-xlc.h in that case. If it's indeed xlc, why not?

Works for me.  I'll do that as attached.

Attachment

Re: xlc atomics

From
Noah Misch
Date:
On Sat, Jul 04, 2015 at 08:07:49PM -0400, Noah Misch wrote:
> On Sun, Jul 05, 2015 at 12:54:43AM +0200, Andres Freund wrote:
> > On 2015-07-04 18:40:41 -0400, Noah Misch wrote:
> > > (1) "IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)".  Getting it working
> > > required the attached patch.
>
> > Will you apply? Having the ability to test change seems to put you in a
> > much better spot then me.
>
> I will.

That commit (0d32d2e) permitted things to compile and usually pass tests, but
I missed the synchronization bug.  Since 2015-10-01, the buildfarm has seen
sixteen duplicate-catalog-OID failures.  The compiler has always been xlC, and
the branch has been HEAD or 9.5.  Examples:

http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-15%2004%3A12%3A49
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hornet&dt=2015-12-08%2001%3A20%3A16
http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2015-12-11%2005%3A25%3A54

These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
5765-J08)" for ppc64le.  The problem is generic-xlc.h
pg_atomic_compare_exchange_u32_impl() issuing __isync() before
__compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
own s_lock.h, its references, and other projects' usage:


https://github.com/boostorg/smart_ptr/blob/boost-1.60.0/include/boost/smart_ptr/detail/sp_counted_base_vacpp_ppc.hpp#L55

http://redmine.gromacs.org/projects/gromacs/repository/entry/src/external/thread_mpi/include/thread_mpi/atomic/xlc_ppc.h?rev=v5.1.2#L112
https://www-01.ibm.com/support/knowledgecenter/SSGH2K_13.1.2/com.ibm.xlc131.aix.doc/language_ref/asm_example.html

This patch's test case would have failed about half the time under today's
generic-xlc.h.  Fast machines run it in about 1s.  A test that detects the bug
99% of the time would run far longer, hence this compromise.


Archaeology enthusiasts may enjoy the bug's changing presentation over time.
LWLockAcquire() exposed it after commit ab5194e redesigned lwlock.c in terms
of atomics.  Assert builds were unaffected for commits in [ab5194e, bbfd7ed);
atomics.h asserts fattened code enough to mask the bug.  However, removing the
AssertPointerAlignment() from pg_atomic_fetch_or_u32() would have sufficed to
surface it in assert builds.  All builds demonstrated the bug once bbfd7ed
introduced xlC pg_attribute_noreturn(), which elicits a simpler instruction
sequence around the ExceptionalCondition() call embedded in each Assert.

nm

Attachment

Re: xlc atomics

From
Andres Freund
Date:
On 2016-02-15 12:11:29 -0500, Noah Misch wrote:
> These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
> around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
> 5765-J08)" for ppc64le.  The problem is generic-xlc.h
> pg_atomic_compare_exchange_u32_impl() issuing __isync() before
> __compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
> own s_lock.h, its references, and other projects' usage:

Ugh. You're right! It's about not moving code before the stwcx...

Do you want to add the new test (no objection, curious), or is that more
for testing?

Greetings,

Andres



Re: xlc atomics

From
Noah Misch
Date:
On Mon, Feb 15, 2016 at 06:33:42PM +0100, Andres Freund wrote:
> On 2016-02-15 12:11:29 -0500, Noah Misch wrote:
> > These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
> > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
> > 5765-J08)" for ppc64le.  The problem is generic-xlc.h
> > pg_atomic_compare_exchange_u32_impl() issuing __isync() before
> > __compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
> > own s_lock.h, its references, and other projects' usage:
> 
> Ugh. You're right! It's about not moving code before the stwcx...
> 
> Do you want to add the new test (no objection, curious), or is that more
> for testing?

The patch's test would join PostgreSQL indefinitely.



Re: xlc atomics

From
Tom Lane
Date:
Noah Misch <noah@leadboat.com> writes:
> That commit (0d32d2e) permitted things to compile and usually pass tests, but
> I missed the synchronization bug.  Since 2015-10-01, the buildfarm has seen
> sixteen duplicate-catalog-OID failures.

I'd been wondering about those ...

> These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
> around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
> 5765-J08)" for ppc64le.  The problem is generic-xlc.h
> pg_atomic_compare_exchange_u32_impl() issuing __isync() before
> __compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
> own s_lock.h, its references, and other projects' usage:

Nice catch!

> This patch's test case would have failed about half the time under today's
> generic-xlc.h.  Fast machines run it in about 1s.  A test that detects the bug
> 99% of the time would run far longer, hence this compromise.

Sounds like a reasonable compromise to me, although I wonder about the
value of it if we stick it into pgbench's TAP tests.  How many of the
slower buildfarm members are running the TAP tests?  Certainly mine are
not.
        regards, tom lane



Re: xlc atomics

From
Craig Ringer
Date:
On 5 July 2015 at 06:54, Andres Freund <andres@anarazel.de> wrote:
 
I wrote this entirely blindly, as evidenced here by the changes you
needed. At the time somebody had promised to soon put up an aix animal,
but that apparently didn't work out.

Similarly, I asked IBM for XL/C for a POWER7 Linux VM to improve test cover on the build farm but was told to just use gcc. It was with regards to testing hardware decfloat support and the folks I spoke to said that gcc's support was derived from that in xl/c anyway.

They're not desperately keen to get their products out there for testing.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: xlc atomics

From
Noah Misch
Date:
On Mon, Feb 15, 2016 at 02:02:41PM -0500, Tom Lane wrote:
> Noah Misch <noah@leadboat.com> writes:
> > That commit (0d32d2e) permitted things to compile and usually pass tests, but
> > I missed the synchronization bug.  Since 2015-10-01, the buildfarm has seen
> > sixteen duplicate-catalog-OID failures.
>
> I'd been wondering about those ...
>
> > These suggested OidGenLock wasn't doing its job.  I've seen similar symptoms
> > around WALInsertLocks with "IBM XL C/C++ for Linux, V13.1.2 (5725-C73,
> > 5765-J08)" for ppc64le.  The problem is generic-xlc.h
> > pg_atomic_compare_exchange_u32_impl() issuing __isync() before
> > __compare_and_swap().  __isync() shall follow __compare_and_swap(); see our
> > own s_lock.h, its references, and other projects' usage:
>
> Nice catch!
>
> > This patch's test case would have failed about half the time under today's
> > generic-xlc.h.  Fast machines run it in about 1s.  A test that detects the bug
> > 99% of the time would run far longer, hence this compromise.
>
> Sounds like a reasonable compromise to me, although I wonder about the
> value of it if we stick it into pgbench's TAP tests.  How many of the
> slower buildfarm members are running the TAP tests?  Certainly mine are
> not.

I missed a second synchronization bug in generic-xlc.h, but the pgbench test
suite caught it promptly after commit 008608b.  Buildfarm members mandrill and
hornet failed[1] or deadlocked within two runs.  The bug is that the
pg_atomic_compare_exchange_*() specifications grant "full barrier semantics",
but generic-xlc.h provided only the semantics of an acquire barrier.  Commit
008608b exposed that older problem; its LWLockWaitListUnlock() relies on
pg_atomic_compare_exchange_u32() (via pg_atomic_fetch_and_u32()) for release
barrier semantics.

The fix is to issue "sync" before each compare-and-swap, in addition to the
"isync" after it.  This is consistent with the corresponding GCC atomics.  The
pgbench test suite survived dozens of runs so patched.

[1] http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mandrill&dt=2016-04-11%2003%3A14%3A13

Attachment

Re: xlc atomics

From
Andres Freund
Date:
On 2016-04-23 21:54:07 -0400, Noah Misch wrote:
> I missed a second synchronization bug in generic-xlc.h, but the pgbench test
> suite caught it promptly after commit 008608b.

Nice catch.


> The bug is that the pg_atomic_compare_exchange_*() specifications
> grant "full barrier semantics", but generic-xlc.h provided only the
> semantics of an acquire barrier.

I find the docs at

http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html
to be be awfully silent about that matter. I guess you just looked at
the assembler code?  It's nice that one can figure out stuff like that
from an architecture manual, but it's sad that the docs for the
intrinsics is silent about that matter.


I do wonder if I shouldn't have left xlc fall by the wayside, and make
it use the fallback implementation. Given it's popularity (or rather
lack thereof) that'd probably have been a better choice.  But that ship
has sailed.


Except that I didn't verify the rs6000_pre_atomic_barrier() and
__fetch_and_add() internals about emitted sync/isync, the patch looks
good.   We've so far not referred to "sequential consistency", but given
it's rise in popularity, I don't think it hurts.

Stricly speaking generic-xlc.c shouldn't contain inline-asm, but given
xlc is only there for power...

Greetings,

Andres Freund



Re: xlc atomics

From
Noah Misch
Date:
On Mon, Apr 25, 2016 at 11:52:04AM -0700, Andres Freund wrote:
> On 2016-04-23 21:54:07 -0400, Noah Misch wrote:
> > The bug is that the pg_atomic_compare_exchange_*() specifications
> > grant "full barrier semantics", but generic-xlc.h provided only the
> > semantics of an acquire barrier.
> 
> I find the docs at
>
http://www-01.ibm.com/support/knowledgecenter/SSGH3R_13.1.2/com.ibm.xlcpp131.aix.doc/compiler_ref/bifs_sync_atomic.html
> to be be awfully silent about that matter. I guess you just looked at
> the assembler code?  It's nice that one can figure out stuff like that
> from an architecture manual, but it's sad that the docs for the
> intrinsics is silent about that matter.

Right.  The intrinsics provide little value as abstractions if one checks the
generated code to deduce how to use them.  I was tempted to replace the
intrinsics calls with inline asm.  At least these functions are unlikely to
change over time.

> Except that I didn't verify the rs6000_pre_atomic_barrier() and
> __fetch_and_add() internals about emitted sync/isync, the patch looks
> good.   We've so far not referred to "sequential consistency", but given
> it's rise in popularity, I don't think it hurts.

Thanks; committed.