Thread: TAS patch for building on armel/armhf thumb

TAS patch for building on armel/armhf thumb

From
Martin Pitt
Date:
Hello,

if you build postgresql (tested all releases from 8.4 up to trunk) for
ARM with the -mthumb instruction set (much better performance), it
fails with

gcc -g -O2 -g -Wall -O2 -fPIC -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-fno-strict-aliasing-fwrapv -g -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 -c
-oxlog.o xlog.c 
/tmp/cc8Wkglp.s: Assembler messages:
/tmp/cc8Wkglp.s:1456: Error: selected processor does not support `swpb r3,r3,[r0]'
/tmp/cc8Wkglp.s:1587: Error: selected processor does not support `swpb r2,r2,[r0]'

A fair while ago, Alexander Sack from Linaro applied a patch to our
packages to drop the Assembler bits and instead use gcc's atomic
builtins [1], which provide a proper implementation for thumb, too.

The original patch spectacularly failed on our slightly newer Panda
boards (our old builders were Freescale Babbage boards), but I got
that to work yesterday. Now it's working on Babbage, Panda, both with
and without hard float (armhf) enabled.

I'm not sure how appropriate it is for upstream to have GCC-isms in
the code, but even if it can't land upstream, perhaps it is useful for
other porters/packagers.

Thanks,

Martin

[1] http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment

Re: TAS patch for building on armel/armhf thumb

From
Heikki Linnakangas
Date:
On 16.12.2011 11:36, Martin Pitt wrote:
> if you build postgresql (tested all releases from 8.4 up to trunk) for
> ARM with the -mthumb instruction set (much better performance), it
> fails with
>
> gcc -g -O2 -g -Wall -O2 -fPIC -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels
-fno-strict-aliasing-fwrapv -g -I../../../../src/include -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/tcl8.5 -c
-oxlog.o xlog.c 
> /tmp/cc8Wkglp.s: Assembler messages:
> /tmp/cc8Wkglp.s:1456: Error: selected processor does not support `swpb r3,r3,[r0]'
> /tmp/cc8Wkglp.s:1587: Error: selected processor does not support `swpb r2,r2,[r0]'
>
> A fair while ago, Alexander Sack from Linaro applied a patch to our
> packages to drop the Assembler bits and instead use gcc's atomic
> builtins [1], which provide a proper implementation for thumb, too.
>
> The original patch spectacularly failed on our slightly newer Panda
> boards (our old builders were Freescale Babbage boards), but I got
> that to work yesterday. Now it's working on Babbage, Panda, both with
> and without hard float (armhf) enabled.
>
> I'm not sure how appropriate it is for upstream to have GCC-isms in
> the code, but even if it can't land upstream, perhaps it is useful for
> other porters/packagers.

Should be ok, as long as it's within #ifdef __GNUC__.

An even better approach would be to have a configure test for
__sync_lock_test_and_set. A quick google search suggests that Intel C
Compiler version >= 11.0 also supports __sync_lock_test_and_set, for
example. It probably makes sense to use it on any platform where it's
defined. Presumably an implementation provided by the compiler is always
going to be at least as good as any magic assembler incantations we can
come up with.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Hello all,

Heikki Linnakangas wrote:
> An even better approach would be to have a configure test for
> __sync_lock_test_and_set. A quick google search suggests that Intel
> C Compiler version >= 11.0 also supports __sync_lock_test_and_set,
> for example.

Right,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
explicitly refers to intel CC as well.

> It probably makes sense to use it on any platform where it's
> defined. Presumably an implementation provided by the compiler is
> always going to be at least as good as any magic assembler
> incantations we can come up with.

I agree. How about a patch like this? It uses builtin atomics if
available, and falls back to the custom implementations if not.

Note that a simple AC_CHECK_FUNC(__sync_lock_test_and_set) does not
work, so I used a slightly more complicated AC_TRY_LINK() which also
verifies that __sync_lock_release() is available.

The patch is against 9.1.2, but I suppose it also applies to trunk
(except perhaps the autogenerated configure part, this needs an
autoheader/autoconf run).

Thanks for considering,

Martin

--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment
On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt <mpitt@debian.org> wrote:
>> It probably makes sense to use it on any platform where it's
>> defined. Presumably an implementation provided by the compiler is
>> always going to be at least as good as any magic assembler
>> incantations we can come up with.
>
> I agree. How about a patch like this? It uses builtin atomics if
> available, and falls back to the custom implementations if not.

-1.  Absent some evidence that gcc's implementations are superior to
ours, I think we should not change stuff that works now.  That's
likely to lead to subtle bugs that are hard to find and perhaps
dependent on the exact compiler version used.

But I'm completely cool with doing this for platforms where we haven't
otherwise got an implementation.  Any port in a storm.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 19.12.2011 16:31, Robert Haas wrote:
> On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org>  wrote:
>>> It probably makes sense to use it on any platform where it's
>>> defined. Presumably an implementation provided by the compiler is
>>> always going to be at least as good as any magic assembler
>>> incantations we can come up with.
>>
>> I agree. How about a patch like this? It uses builtin atomics if
>> available, and falls back to the custom implementations if not.
>
> -1.  Absent some evidence that gcc's implementations are superior to
> ours, I think we should not change stuff that works now.  That's
> likely to lead to subtle bugs that are hard to find and perhaps
> dependent on the exact compiler version used.

Ok, we're in disagreement on that then. I don't feel very strongly about
it, let's see what others think.

One thing that caught my eye: if you use __sync_lock_and_test() to
implement S_LOCK(), you really should be using __sync_lock_release() for
S_UNLOCK().

Actually, I believe our Itanium (and possibly ARM, too) implementation
of S_UNLOCK() is wrong as it is. There is no platform-specific
S_UNLOCK() defined for Itanium, so we're using the generic implementation:

#if !defined(S_UNLOCK)
#define S_UNLOCK(lock)        (*((volatile slock_t *) (lock)) = 0)
#endif     /* S_UNLOCK */

That is not sufficient on platforms with a weak memory model, like Itanium.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> On 19.12.2011 16:31, Robert Haas wrote:
>> On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org>  wrote:
>>> I agree. How about a patch like this? It uses builtin atomics if
>>> available, and falls back to the custom implementations if not.

>> -1.  Absent some evidence that gcc's implementations are superior to
>> ours, I think we should not change stuff that works now.  That's
>> likely to lead to subtle bugs that are hard to find and perhaps
>> dependent on the exact compiler version used.

> Ok, we're in disagreement on that then. I don't feel very strongly about
> it, let's see what others think.

I agree with Robert.  There is no evidence whatsoever that this would
be an improvement, and unless somebody cares to provide such evidence,
we shouldn't risk changing code that's so full of portability hazards.

> Actually, I believe our Itanium (and possibly ARM, too) implementation
> of S_UNLOCK() is wrong as it is.

Hmm.  Anybody got a large itanium box we could play with?  If it is
wrong, I'd expect it would show up pretty quickly under pgbench or
similar.

            regards, tom lane
Robert Haas [2011-12-19  9:31 -0500]:
> On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt <mpitt@debian.org> wrote:
> >> It probably makes sense to use it on any platform where it's
> >> defined. Presumably an implementation provided by the compiler is
> >> always going to be at least as good as any magic assembler
> >> incantations we can come up with.
> >
> > I agree. How about a patch like this? It uses builtin atomics if
> > available, and falls back to the custom implementations if not.
>
> -1.  Absent some evidence that gcc's implementations are superior to
> ours, I think we should not change stuff that works now.  That's
> likely to lead to subtle bugs that are hard to find and perhaps
> dependent on the exact compiler version used.
>
> But I'm completely cool with doing this for platforms where we haven't
> otherwise got an implementation.  Any port in a storm.

Sure, then the other option is to stuff this at the end of s_lock.h if
we don't already have HAS_TEST_AND_SET. This would then mean that we
need to remove the armel implementation, as it doesn't really work on
anything non-ancient, and the gcc one got some fairly good test
coverage by now.

I'm happy to work out the patch for this. I'll just wait a bit if
there are more comments on this.

Thanks,

Martin
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)
Heikki Linnakangas [2011-12-19 17:09 +0200]:
> One thing that caught my eye: if you use __sync_lock_and_test() to
> implement S_LOCK(), you really should be using __sync_lock_release()
> for S_UNLOCK().

Right, the patch I send does that:

 #define S_UNLOCK(lock) __sync_lock_release(lock)

Martin
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)
Tom Lane [2011-12-19 10:25 -0500]:
> Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> > On 19.12.2011 16:31, Robert Haas wrote:
> >> On Sun, Dec 18, 2011 at 5:42 PM, Martin Pitt<mpitt@debian.org>  wrote:
> >>> I agree. How about a patch like this? It uses builtin atomics if
> >>> available, and falls back to the custom implementations if not.
>
> >> -1.  Absent some evidence that gcc's implementations are superior to
> >> ours, I think we should not change stuff that works now.  That's
> >> likely to lead to subtle bugs that are hard to find and perhaps
> >> dependent on the exact compiler version used.
>
> > Ok, we're in disagreement on that then. I don't feel very strongly about
> > it, let's see what others think.
>
> I agree with Robert.  There is no evidence whatsoever that this would
> be an improvement, and unless somebody cares to provide such evidence,
> we shouldn't risk changing code that's so full of portability hazards.

OK, with you and Robert preferring this as a fallback instead of a
preferred way, and with Heikki's "I don't care much", I'll rework the
patch.

Thanks,

Martin
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)
On 19.12.2011 17:39, Martin Pitt wrote:
> Heikki Linnakangas [2011-12-19 17:09 +0200]:
>> One thing that caught my eye: if you use __sync_lock_and_test() to
>> implement S_LOCK(), you really should be using __sync_lock_release()
>> for S_UNLOCK().
>
> Right, the patch I send does that:
>
>   #define S_UNLOCK(lock) __sync_lock_release(lock)

Oh, I'm blind...

Can you comment on or test whether the existing TAS implementation is
broken on ARM, because the existing S_UNLOCK() on ARM doesn't act as a
memory barrier? If we're going to keep the existing snippet of inline
assembly for those variants of ARM where it works, we need to either fix
S_UNLOCK or convince ourselves that it's not broken.

I'll take a closer look at the same on Itanium.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
On 19.12.2011 17:25, Tom Lane wrote:
> Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:
>> On 19.12.2011 16:31, Robert Haas wrote:
>> Actually, I believe our Itanium (and possibly ARM, too) implementation
>> of S_UNLOCK() is wrong as it is.
>
> Hmm.  Anybody got a large itanium box we could play with?  If it is
> wrong, I'd expect it would show up pretty quickly under pgbench or
> similar.

We've been running pgbench heavily recently in the company on large HP
Itanium boxes with 32 and 64 cores, and haven't seen any issues. I'm not
sure how it would manifest itself, though. It's also possible that while
it's a genuine bug at the source level, it happens to be masked by some
compiler decisions. It sure looks wrong.

I'll try to construct a self-contained test program to exercise that.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

[PATCH v2] Use CC atomic builtins as a fallback

From
Martin Pitt
Date:
Robert Haas [2011-12-19  9:31 -0500]:
> -1.  Absent some evidence that gcc's implementations are superior to
> ours, I think we should not change stuff that works now.  That's
> likely to lead to subtle bugs that are hard to find and perhaps
> dependent on the exact compiler version used.
>
> But I'm completely cool with doing this for platforms where we haven't
> otherwise got an implementation.  Any port in a storm.

The updated patch only uses the gcc builtins if there is no explicit
implementation, but drops the arm one as this doesn't work on ARMv7
and newer, as stated in the original mail.

Thanks,

Martin
--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment
On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
> Actually, I believe our Itanium (and possibly ARM, too) implementation
> of S_UNLOCK() is wrong as it is. There is no platform-specific
> S_UNLOCK() defined for Itanium, so we're using the generic
> implementation:
>
> #if !defined(S_UNLOCK)
> #define S_UNLOCK(lock)        (*((volatile slock_t *) (lock)) = 0)
> #endif     /* S_UNLOCK */
>
> That is not sufficient on platforms with a weak memory model, like Itanium.

Other processors may observe the lock as held after its release, but there's no
correctness problem.
Noah Misch <noah@leadboat.com> writes:
> On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
>> That is not sufficient on platforms with a weak memory model, like Itanium.

> Other processors may observe the lock as held after its release, but there's no
> correctness problem.

How weak is the memory model, exactly?

A correctness problem would ensue if out-of-order stores are possible,
ie other processors could observe the lock being freed (and then acquire
it) before seeing changes to shared variables that had been made while
holding the lock.

I'm a little dubious that this applies to Itanium, because I don't see
any memory fence instruction in the TAS macro.  If we needed one in
S_UNLOCK I'd rather expect there to be one in TAS as well.

            regards, tom lane
On 19.12.2011 22:03, Noah Misch wrote:
> On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
>> Actually, I believe our Itanium (and possibly ARM, too) implementation
>> of S_UNLOCK() is wrong as it is. There is no platform-specific
>> S_UNLOCK() defined for Itanium, so we're using the generic
>> implementation:
>>
>> #if !defined(S_UNLOCK)
>> #define S_UNLOCK(lock)        (*((volatile slock_t *) (lock)) = 0)
>> #endif     /* S_UNLOCK */
>>
>> That is not sufficient on platforms with a weak memory model, like Itanium.
>
> Other processors may observe the lock as held after its release, but there's no
> correctness problem.

I thought it would also be legal for a store to become visible to other
processors, *after* the releasing of the lock. Which would be bad. For
example, if you have:

volatile bool *shared = ...
SpinLockAcquire(lock);
shared->variable = true;
SpinLockRelease(lock);
more code

The macro-expanded code would look like:

<test and set> lock
shared->variable = true;
(*((volatile slock_t *) (lock)) = 0;
more code

I believe on an architecture with a weak memory model, like Itanium,
there's no guarantee that the assignments will happen in that order. The
lock might appear as released *before* the variable is set.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com
On 19.12.2011 22:12, Tom Lane wrote:
> Noah Misch<noah@leadboat.com>  writes:
>> On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
>>> That is not sufficient on platforms with a weak memory model, like Itanium.
>
>> Other processors may observe the lock as held after its release, but there's no
>> correctness problem.
>
> How weak is the memory model, exactly?
>
> A correctness problem would ensue if out-of-order stores are possible,
> ie other processors could observe the lock being freed (and then acquire
> it) before seeing changes to shared variables that had been made while
> holding the lock.
>
> I'm a little dubious that this applies to Itanium, because I don't see
> any memory fence instruction in the TAS macro.  If we needed one in
> S_UNLOCK I'd rather expect there to be one in TAS as well.

There's a pretty good manual called "Implementing Spinlocks on Itanium
and PA-RISC" from HP at:
http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf.
It says, in section "3.2.1 Try to get a spinlock":

 > Note also, that the ‘xchg’ instruction always
 > has the ‘acquire’ semantics, so it enforces the correct memory ordering.

But the current implementation seems to be safe, anyway.  In section
"3.2.3 Freeing a spinlock", that manual says:

 > Note, that a simple assignment statement into a volatile variable
will not work, as we are
 > assuming that the +Ovolatile=__unordered compile option is being used.

So +Ovolatile=__unordered would allow the kind of optimization that I
thought was possible, and we would have a problem if we used it. But the
default is more conservative, and a simple assignment does work.

I compiled the attached test program on an HP Itanium box, using the
same flags you get from PostgreSQL's configure on that box. The relevant
assembler output is:

         xchg4           r14 = [r15], r14           // M [slocktest.c: 66/3]
//file/line/col slocktest.c/67/3
         ld1.acq         r16 = [r11]                // M [slocktest.c: 67/3]
         nop.i           0                          // I
//file/line/col slocktest.c/68/3
         st1.rel         [r11] = r10             ;; // M [slocktest.c: 68/3]
//file/line/col slocktest.c/69/3
         st4.rel         [r15] = r0                 // M [slocktest.c: 69/3]
//file/line/col slocktest.c/70/1


The trick I missed is that the compiler attaches .rel to all the stores
and .acq to all the loads through a volatile pointer. gcc seems to do
the same. So we're safe.


It would be interesting to test whether using +Ovolatile=__unordered
would make a difference to performance. Tacking those memory fence
modifiers to all the volatile loads and stores might be expensive.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
On Mon, Dec 19, 2011 at 11:25:06PM +0200, Heikki Linnakangas wrote:
> I compiled the attached test program on an HP Itanium box, using the
> same flags you get from PostgreSQL's configure on that box. The relevant
> assembler output is:
>
>         xchg4           r14 = [r15], r14           // M [slocktest.c: 66/3]
> //file/line/col slocktest.c/67/3
>         ld1.acq         r16 = [r11]                // M [slocktest.c: 67/3]
>         nop.i           0                          // I
> //file/line/col slocktest.c/68/3
>         st1.rel         [r11] = r10             ;; // M [slocktest.c: 68/3]
> //file/line/col slocktest.c/69/3
>         st4.rel         [r15] = r0                 // M [slocktest.c: 69/3]
> //file/line/col slocktest.c/70/1
>
>
> The trick I missed is that the compiler attaches .rel to all the stores
> and .acq to all the loads through a volatile pointer. gcc seems to do
> the same. So we're safe.

The Intel compiler appears not to follow this convention:

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/ccpp_options/option_qserialize-volatile.htm

If you have that compiler installed, could you see which opcode it generates?

Thanks,
nm

Re: [PATCH v2] Use CC atomic builtins as a fallback

From
Tom Lane
Date:
Martin Pitt <mpitt@debian.org> writes:
> The updated patch only uses the gcc builtins if there is no explicit
> implementation, but drops the arm one as this doesn't work on ARMv7
> and newer, as stated in the original mail.

Getting this thread back to the original patch ... I'm afraid that if we
apply this as-is, what will happen is that we fix ARMv7 and break older
versions.  Some googling found this, for instance:

http://gcc.gnu.org/bugzilla/show_bug.cgi?id=33413

which suggests that (1) ARM gcc hasn't had __sync_lock_test_and_set for
very long, and (2) what it generates doesn't work pre-ARMv6.

So I'm thinking that removing the swpb ASM option is not such a good
idea.  We could possibly test for __sync_lock_test_and_set first, and
only use swpb if we're on ARM and don't have the builtin.

Another thing that is bothering me is that according to the gcc manual,
eg here,
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
__sync_lock_test_and_set is nominally provided for datatypes 1, 2, 4,
or 8 bytes in length, but the underlying hardware doesn't necessarily
support all those widths natively.  If you pick the wrong width then
you don't get an inline operation at all, but a call to some possibly
inefficient library subroutine.  I see that your patch just assumes that
"int" will be a good width for the lock type, but it's unclear to me
what that choice is based on and whether or not it might be a really bad
choice on some platforms.  A look through s_lock.h suggests that only a
minority of platforms prefer int-width locks ... but I have no idea
how many of those assembly snippets could have been coded to use a
different lock datatype without penalty.  Some other evidence that
4-byte __sync_lock_test_and_set isn't universal is here:
https://svn.boost.org/trac/boost/ticket/2525

Google is also finding some rather worrisome suggestions that
__sync_lock_test_and_set might involve a kernel call on some flavors of
ARM.  That would be pretty disastrous from a performance standpoint.

            regards, tom lane

Re: [PATCH v2] Use CC atomic builtins as a fallback

From
Martin Pitt
Date:
Hello Tom, all,

Tom Lane [2011-12-20 21:14 -0500]:
> Getting this thread back to the original patch ... I'm afraid that if we
> apply this as-is, what will happen is that we fix ARMv7 and break older
> versions.

Right, I guess it's dependent on the compiler version, too. That's why
my original patch tested for this first and used it if it was
available, then we can have any code in the #else part further down.

> So I'm thinking that removing the swpb ASM option is not such a good
> idea.  We could possibly test for __sync_lock_test_and_set first, and
> only use swpb if we're on ARM and don't have the builtin.

OK, that would be kind of a hybrid solution between the first and
second version of the patch. Can work on this (but only next year,
need to do some christmas prep, and then holidays/AFK).

> Another thing that is bothering me is that according to the gcc
> manual, eg here,
> http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
> __sync_lock_test_and_set is nominally provided for datatypes 1, 2,
> 4, or 8 bytes in length, but the underlying hardware doesn't
> necessarily support all those widths natively.  If you pick the
> wrong width then you don't get an inline operation at all, but a
> call to some possibly inefficient library subroutine.

It's even worse. Our original version of the patch used a char, and
while that worked fine on the older Babbage board, it completely broke
at runtime on e. g. a Panda board. It wasn't just slow, it caused
hangs and test failures all over the place. With int it works
flawlessly.

> I see that your patch just assumes that "int" will be a good width
> for the lock type, but it's unclear to me what that choice is based
> on and whether or not it might be a really bad choice on some
> platforms.

Unfortunately the gcc documentation doesn't give any further
conditions. As an "int" is usually meant to fit the standard word size
of a processor, if that function call/machine op code supports just
one datatype, it will most likely be "int", not something which you
have to mask on read/write (char) or potentially involves multiple
words (long/long long).

The intel definition [1] (and also in above gcc doc) says "32 or 64
bit integer", which also matches "int" (unless we are looking at some
really small architectures like the old Atmels with their 16 bit
words, but these are three magnitudes too small for PostgreSQL anyway
:) ) So with the intel compiler "char" is clearly forbidden.

Beyond that I don't have any further data.

So, I'm happy with keeping this patch in Debian/Ubuntu only for the
time being, as there we have a pretty clear idea of the supported ARM
platforms, and can actually test the patches on all the platforms. I
just found it prudent to at least bring it up here.

Thanks, and happy end-of-year holidays!

Martin

[1] http://www.cs.fsu.edu/~engelen/courses/HPC-adv/intref_cls.pdf p.  198

--
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Re: [PATCH v2] Use CC atomic builtins as a fallback

From
Tom Lane
Date:
Martin Pitt <mpitt@debian.org> writes:
> Tom Lane [2011-12-20 21:14 -0500]:
>> Another thing that is bothering me is that according to the gcc
>> manual, eg here,
>> http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
>> __sync_lock_test_and_set is nominally provided for datatypes 1, 2,
>> 4, or 8 bytes in length, but the underlying hardware doesn't
>> necessarily support all those widths natively.  If you pick the
>> wrong width then you don't get an inline operation at all, but a
>> call to some possibly inefficient library subroutine.

> It's even worse. Our original version of the patch used a char, and
> while that worked fine on the older Babbage board, it completely broke
> at runtime on e. g. a Panda board. It wasn't just slow, it caused
> hangs and test failures all over the place. With int it works
> flawlessly.

Yeah, that was another thing I found worrisome while googling: there
were a disturbingly large number of claims that __sync_lock_test_and_set
and/or __sync_lock_release were flat-out broken on various combinations
of gcc version and platform.  After reading that, there is no way at all
that I'd accept your original patch to use these functions everywhere.

For the moment I'm inclined to consider using these functions *only* on
ARM, so as to limit our exposure to such bugs.  That would also limit
the risk of using an inappropriate choice of lock width.

            regards, tom lane

[PATCH v3] Use CC atomic builtins on ARM

From
Martin Pitt
Date:
Hello Tom, all,

happy new year everyone!

Tom Lane [2011-12-20 21:14 -0500]:
> So I'm thinking that removing the swpb ASM option is not such a good
> idea.  We could possibly test for __sync_lock_test_and_set first, and
> only use swpb if we're on ARM and don't have the builtin.

Tom Lane [2011-12-21 10:55 -0500]:
> Yeah, that was another thing I found worrisome while googling: there
> were a disturbingly large number of claims that __sync_lock_test_and_set
> and/or __sync_lock_release were flat-out broken on various combinations
> of gcc version and platform.  After reading that, there is no way at all
> that I'd accept your original patch to use these functions everywhere.
>
> For the moment I'm inclined to consider using these functions *only* on
> ARM, so as to limit our exposure to such bugs.  That would also limit
> the risk of using an inappropriate choice of lock width.

OK, fair enough. New patch attached, which does exactly this now.
Third time is the charm!

Thanks,

Martin

--
Martin Pitt                        | http://www.piware.de
Ubuntu Developer (www.ubuntu.com)  | Debian Developer  (www.debian.org)

Attachment

Re: [PATCH v3] Use CC atomic builtins on ARM

From
Tom Lane
Date:
Martin Pitt <mpitt@debian.org> writes:
> Tom Lane [2011-12-21 10:55 -0500]:
>> For the moment I'm inclined to consider using these functions *only* on
>> ARM, so as to limit our exposure to such bugs.  That would also limit
>> the risk of using an inappropriate choice of lock width.

> OK, fair enough. New patch attached, which does exactly this now.
> Third time is the charm!

I've got some minor cosmetic complaints about this, but unless anyone's
got something substantive I will fix those, apply and back-patch it.
There are some folks trying to get a Fedora ARM port running and
I think this might help them with Postgres ...
http://arm.koji.fedoraproject.org/koji/getfile?taskID=246852&name=build.log

            regards, tom lane
Sorry for the late reply, but Heikki, can you get this Itanium
information into s_lock.h as a comment, particularly the information
about the +Ovolatile=__unordered flag?

---------------------------------------------------------------------------

On Mon, Dec 19, 2011 at 11:25:06PM +0200, Heikki Linnakangas wrote:
> On 19.12.2011 22:12, Tom Lane wrote:
> >Noah Misch<noah@leadboat.com>  writes:
> >>On Mon, Dec 19, 2011 at 05:09:11PM +0200, Heikki Linnakangas wrote:
> >>>That is not sufficient on platforms with a weak memory model, like Itanium.
> >
> >>Other processors may observe the lock as held after its release, but there's no
> >>correctness problem.
> >
> >How weak is the memory model, exactly?
> >
> >A correctness problem would ensue if out-of-order stores are possible,
> >ie other processors could observe the lock being freed (and then acquire
> >it) before seeing changes to shared variables that had been made while
> >holding the lock.
> >
> >I'm a little dubious that this applies to Itanium, because I don't see
> >any memory fence instruction in the TAS macro.  If we needed one in
> >S_UNLOCK I'd rather expect there to be one in TAS as well.
>
> There's a pretty good manual called "Implementing Spinlocks on
> Itanium and PA-RISC" from HP at: http://h21007.www2.hp.com/portal/download/files/unprot/itanium/spinlocks.pdf.
> It says, in section "3.2.1 Try to get a spinlock":
>
> > Note also, that the ‘xchg’ instruction always
> > has the ‘acquire’ semantics, so it enforces the correct memory ordering.
>
> But the current implementation seems to be safe, anyway.  In section
> "3.2.3 Freeing a spinlock", that manual says:
>
> > Note, that a simple assignment statement into a volatile variable
> will not work, as we are
> > assuming that the +Ovolatile=__unordered compile option is being used.
>
> So +Ovolatile=__unordered would allow the kind of optimization that
> I thought was possible, and we would have a problem if we used it.
> But the default is more conservative, and a simple assignment does
> work.
>
> I compiled the attached test program on an HP Itanium box, using the
> same flags you get from PostgreSQL's configure on that box. The
> relevant assembler output is:
>
>         xchg4           r14 = [r15], r14           // M [slocktest.c: 66/3]
> //file/line/col slocktest.c/67/3
>         ld1.acq         r16 = [r11]                // M [slocktest.c: 67/3]
>         nop.i           0                          // I
> //file/line/col slocktest.c/68/3
>         st1.rel         [r11] = r10             ;; // M [slocktest.c: 68/3]
> //file/line/col slocktest.c/69/3
>         st4.rel         [r15] = r0                 // M [slocktest.c: 69/3]
> //file/line/col slocktest.c/70/1
>
>
> The trick I missed is that the compiler attaches .rel to all the
> stores and .acq to all the loads through a volatile pointer. gcc
> seems to do the same. So we're safe.
>
>
> It would be interesting to test whether using +Ovolatile=__unordered
> would make a difference to performance. Tacking those memory fence
> modifiers to all the volatile loads and stores might be expensive.
>
> --
>   Heikki Linnakangas
>   EnterpriseDB   http://www.enterprisedb.com

> #include <stdio.h>
>
> #if defined(__GNUC__) || defined(__INTEL_COMPILER)
> #if defined(__ia64__) || defined(__ia64)    /* Intel Itanium */
> #define HAS_TEST_AND_SET
>
> typedef unsigned int slock_t;
>
> #define TAS(lock) tas(lock)
>
> /* On IA64, it's a win to use a non-locking test before the xchg proper */
> #define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
>
> #ifndef __INTEL_COMPILER
>
> static __inline__ int
> tas(volatile slock_t *lock)
> {
>     long int    ret;
>
>     __asm__ __volatile__(
>         "    xchg4     %0=%1,%2    \n"
> :        "=r"(ret), "+m"(*lock)
> :        "r"(1)
> :        "memory");
>     return (int) ret;
> }
>
> #else /* __INTEL_COMPILER */
>
> static __inline__ int
> tas(volatile slock_t *lock)
> {
>     int        ret;
>
>     ret = _InterlockedExchange(lock,1);    /* this is a xchg asm macro */
>
>     return ret;
> }
>
> #endif /* __INTEL_COMPILER */
> #endif     /* __ia64__ || __ia64 */
> #endif    /* defined(__GNUC__) || defined(__INTEL_COMPILER) */
>
> #if defined(__hpux) && defined(__ia64) && !defined(__GNUC__)
>
> #define HAS_TEST_AND_SET
>
> typedef unsigned int slock_t;
>
> #include <ia64/sys/inline.h>
> #define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
> /* On IA64, it's a win to use a non-locking test before the xchg proper */
> #define TAS_SPIN(lock)    (*(lock) ? 1 : TAS(lock))
>
> #endif    /* HPUX on IA64, non gcc */
>
> slock_t lock;
> char shared2;
>
> int main(int argc, char **argv)
> {
>   volatile char *p = &shared2;
>   char local;
>
>   TAS(&lock);
>   local = *p;
>   *p = 123;
>   *((volatile slock_t *) &lock) = 0;
> }

>
> --
> Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-bugs


--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +
On 03.02.2012 02:48, Bruce Momjian wrote:
> Sorry for the late reply, but Heikki, can you get this Itanium
> information into s_lock.h as a comment, particularly the information
> about the +Ovolatile=__unordered flag?

Good idea. I came up with the attached, hope that explains it.

Looking back at the discussions, we concluded that the current code is
safe on gcc, because it implicitly adds the .rel/.acq opcodes to
volatile accesses, and HP's compiler does the same as long as you don't
explicitly disable it with +Ovolatile=__unordered. But what about
Intel's icc compiler? Presumably it's also safe, but looking at Intel's
manuals that I found, I'm not completely sure about it. There's an
option, -m[no-]serialize-volatile, that controls it, but I couldn't
figure out which is the default. Looking at the docs on that from Intel
that I found [1], it seems to me that on Linux, the default is *not*
safe, but on Windows it is.

Sergey, you have dugong in the buildfarm that uses Intel's compiler on
Itanium. Could you verify whether the -mno-serialize-volatile is the
default? If you could for example extract the assembler code generated
by icc for xlog.c, and send it over. Whether it's generating the
.rel/.acq opcodes should be easy to see in the generated code of the
XLogGetLastRemoved() function, for example, which doesn't do much else
than grab a spinlock. On gcc, the -s flag generates the assembly files,
I presume it's the same on icc.

Perhaps we should set those compiler flags explicitly in configure,
regardless of the defaults.

[1]

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-us/cpp/lin/compiler_c/copts/ccpp_options/option_qserialize-volatile.htm

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

Attachment
On 06.02.2012 13:24, Heikki Linnakangas wrote:
> On 03.02.2012 02:48, Bruce Momjian wrote:
>> Sorry for the late reply, but Heikki, can you get this Itanium
>> information into s_lock.h as a comment, particularly the information
>> about the +Ovolatile=__unordered flag?
>
> Good idea. I came up with the attached, hope that explains it.

Committed these additional comments.

> Looking back at the discussions, we concluded that the current code is
> safe on gcc, because it implicitly adds the .rel/.acq opcodes to
> volatile accesses, and HP's compiler does the same as long as you don't
> explicitly disable it with +Ovolatile=__unordered. But what about
> Intel's icc compiler? Presumably it's also safe, but looking at Intel's
> manuals that I found, I'm not completely sure about it. There's an
> option, -m[no-]serialize-volatile, that controls it, but I couldn't
> figure out which is the default. Looking at the docs on that from Intel
> that I found [1], it seems to me that on Linux, the default is *not*
> safe, but on Windows it is.
>
> Sergey, you have dugong in the buildfarm that uses Intel's compiler on
> Itanium. Could you verify whether the -mno-serialize-volatile is the
> default? If you could for example extract the assembler code generated
> by icc for xlog.c, and send it over. Whether it's generating the
> .rel/.acq opcodes should be easy to see in the generated code of the
> XLogGetLastRemoved() function, for example, which doesn't do much else
> than grab a spinlock. On gcc, the -s flag generates the assembly files,
> I presume it's the same on icc.

Sergey confirmed off-list that icc defaults to -mserialize-volatile, so
we're safe.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com