Thread: spinlocks on HP-UX

spinlocks on HP-UX

From

Robert Haas

Date:

27 August 2011, 21:09:22

I was able to obtain access to a 32-core HP-UX server.  I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good.  Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients.  Five minute
runs, shared_buffers=8GB.

1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)

And just for comparison, here are the numbers at scale factor 1000:

1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)

After mulling over the vmstat output for a bit, I began to suspect
spinlock contention.  I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK.  I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation.  First, I did this:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)typedef unsigned int slock_t;
#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
#endif /* HPUX on IA64, non gcc */

That resulted in these numbers.  Scale factor 100:

1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)

Scale factor 1000:

1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)

Then, I did this:

--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)       int                     delays = 0;
    int                     cur_delay = 0;
 

-       while (TAS(lock))
+       while (*lock ? 1 : TAS(lock))       {               /* CPU-specific delay each time through the loop */
     SPIN_DELAY();
 

That resulted in these numbers, at scale factor 100:

1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)

And at scale factor 1000:

1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)

Note sure why I am missing the 64-client results for that last set of
tests, but no matter.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses.  But it seems pretty clear we need to
do it at least for this architecture...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

pasman pasmański

Date:

28 August 2011, 06:41:17

Pity that this patch works only on hpux :(.

But i have an idea: maybe when executor stop at locked row, it should
process next row instead of wait.

Of course if query not contain "order by" or windowing functions.

-- 
------------
pasman

Re: spinlocks on HP-UX

From

Tom Lane

Date:

28 August 2011, 12:35:42

Robert Haas <robertmhaas@gmail.com> writes:
> First, I did this:

> -#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
> +#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))

Seems reasonable, and similar to x86 logic.

> Then, I did this:

> -       while (TAS(lock))
> +       while (*lock ? 1 : TAS(lock))

Er, what?  That sure looks like a manual application of what you'd
already done in the TAS macro.

> Of course, we can't apply the second patch as it stands, because I
> tested it on x86 and it loses.  But it seems pretty clear we need to
> do it at least for this architecture...

Please clarify: when you say "this architecture", are you talking about
IA64 or PA-RISC?  Is there any reason to think that this is specific to
HP-UX rather than any other system on the same architecture?  (I'm sure
I can get access to some IA64 clusters at Red Hat, though maybe not
64-core ones.)

I don't have an objection to the TAS macro change, but I do object to
fooling with the hardware-independent code in s_lock.c ... especially
when the additional improvement seems barely above the noise threshold.
You ought to be able to do whatever you need inside the TAS macro.
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

28 August 2011, 18:51:53

On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> First, I did this:
>
>> -#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
>> +#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
>
> Seems reasonable, and similar to x86 logic.
>
>> Then, I did this:
>
>> -       while (TAS(lock))
>> +       while (*lock ? 1 : TAS(lock))
>
> Er, what?  That sure looks like a manual application of what you'd
> already done in the TAS macro.

Sorry, I blew through that a little too blithely.  If you change TAS()
itself, then even the very first attempt to acquire the lock will try
the unlocked instruction first, whereas changing s_lock() allows you
to do something different in the contended case than you do in the
uncontended case.  We COULD just change the TAS() macro since, in this
case, it seems to make only a minor difference, but what I was
thinking is that we could change s_lock.h to define two macros, TAS()
and TAS_SPIN().  If a particular architecture defines TAS() but not
TAS_SPIN(), then we define TAS_SPIN(x) to be TAS(x).  Then, S_LOCK()
can stay as-is - calling TAS() - but s_lock() can call TAS_SPIN(),
which will normally be the same as TAS() but can be made different on
any architecture where the retry loop should do something different
than the initial attempt.

> Please clarify: when you say "this architecture", are you talking about
> IA64 or PA-RISC?  Is there any reason to think that this is specific to
> HP-UX rather than any other system on the same architecture?  (I'm sure
> I can get access to some IA64 clusters at Red Hat, though maybe not
> 64-core ones.)

I tested on IA64; I don't currently have access to a PA-RISC box. The
documentation I'm looking at implies that the same approach would be
desirable there, but that's just an unsubstantiated rumor at this
point....

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

28 August 2011, 19:16:00

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 28, 2011 at 11:35 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> Then, I did this:
>>> 
>>> - � � � while (TAS(lock))
>>> + � � � while (*lock ? 1 : TAS(lock))

>> Er, what? �That sure looks like a manual application of what you'd
>> already done in the TAS macro.

> Sorry, I blew through that a little too blithely.  If you change TAS()
> itself, then even the very first attempt to acquire the lock will try
> the unlocked instruction first, whereas changing s_lock() allows you
> to do something different in the contended case than you do in the
> uncontended case.

Yeah, I figured out that was probably what you meant a little while
later.  I found a 64-CPU IA64 machine in Red Hat's test labs and am
currently trying to replicate your results; report to follow.

> We COULD just change the TAS() macro since, in this
> case, it seems to make only a minor difference, but what I was
> thinking is that we could change s_lock.h to define two macros, TAS()
> and TAS_SPIN().

Yeah, I was thinking along the same lines, though perhaps the name of
the new macro could use a little bikeshedding.

The comments in s_lock.h note that the unlocked test in x86 TAS is of
uncertain usefulness.  It seems entirely possible to me that we ought
to use a similar design on x86, ie, use the unlocked test only once
we've entered the delay loop.

>> Please clarify: when you say "this architecture", are you talking about
>> IA64 or PA-RISC? �Is there any reason to think that this is specific to
>> HP-UX rather than any other system on the same architecture? �(I'm sure
>> I can get access to some IA64 clusters at Red Hat, though maybe not
>> 64-core ones.)

> I tested on IA64; I don't currently have access to a PA-RISC box. The
> documentation I'm looking at implies that the same approach would be
> desirable there, but that's just an unsubstantiated rumor at this
> point....

Well, I've got a PA-RISC box, but it's only a single processor so it's
not gonna prove much.  Anybody?
        regards, tom lane

Re: spinlocks on HP-UX

From

Tom Lane

Date:

28 August 2011, 20:20:11

I wrote:
> Yeah, I figured out that was probably what you meant a little while
> later.  I found a 64-CPU IA64 machine in Red Hat's test labs and am
> currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2.  I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200.  The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures.  I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors.  Anybody have big
iron with other non-Intel chips?
        regards, tom lane


Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig    Sat Jan  1 13:27:24 2011
--- src/include/storage/s_lock.h    Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ---- {     long int    ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      __asm__ __volatile__(         "    xchg4     %0=%1,%2    \n" :        "=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ---- {     int        ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      ret = _InterlockedExchange(lock,1);    /* this is a xchg asm macro */      return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig    Sat Jan  1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c    Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****     int            delays = 0;     int            cur_delay = 0; 
!     while (TAS(lock))     {         /* CPU-specific delay each time through the loop */         SPIN_DELAY();
--- 96,102 ----     int            delays = 0;     int            cur_delay = 0; 
!     while (*lock ? 1 : TAS(lock))     {         /* CPU-specific delay each time through the loop */
SPIN_DELAY();

Re: spinlocks on HP-UX

From

Robert Haas

Date:

28 August 2011, 20:49:28

On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So this pretty well confirms Robert's results, in particular that all of
> the win from an unlocked test comes from using it in the delay loop.
> Given the lack of evidence that a general change in TAS() is beneficial,
> I'm inclined to vote against it, on the grounds that the extra test is
> surely a loss at some level when there is not contention.
> (IOW, +1 for inventing a second macro to use in the delay loop only.)

Beautiful.  Got a naming preference for that second macro?  I
suggested TAS_SPIN() because it's what you use when you spin, as
opposed to what you use in the uncontended case, but I'm not attached
to that.

> We ought to do similar tests on other architectures.  I found some
> lots-o-processors x86_64 machines at Red Hat, but they don't seem to
> own any PPC systems with more than 8 processors.  Anybody have big
> iron with other non-Intel chips?

Aside from PPC, it would probably be worth testing SPARC and ARM if we
can find machines.  Anything else is probably too old or too marginal
to get excited about.  AFAIK these effects don't manifest with <32
cores, so I suspect that a lot of what's in s_lock.h is irrelevant
just because many of those architectures are too old to exist in
32-core versions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

28 August 2011, 21:00:29

Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> (IOW, +1 for inventing a second macro to use in the delay loop only.)

> Beautiful.  Got a naming preference for that second macro?  I
> suggested TAS_SPIN() because it's what you use when you spin, as
> opposed to what you use in the uncontended case, but I'm not attached
> to that.

I had been thinking TAS_CONTENDED, but on reflection there's not a
strong argument for that over TAS_SPIN.  Do what you will.
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

28 August 2011, 21:02:10

2011/8/28 pasman pasmański <pasman.p@gmail.com>:
> Pity that this patch works only on hpux :(.

Well, not really.  x86 is already well-behaved.  On a 32-core x86 box
running Linux, performs seems to plateau and level off, and then fall
off gradually.  But on ia64, performance just collapses after about 24
cores.  The fact that we don't have that problem everywhere is a good
thing, not a bad thing...

> But i have an idea: maybe when executor stop at locked row, it should
> process next row instead of wait.
>
> Of course if query not contain "order by" or windowing functions.

That wouldn't really help, first of all because you'd then have to
remember to go back to that row (and chances are it would still be
contended then), and second because these aren't row-level locks
anyway.  They're locks on various global data structures, such as the
ProcArray.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 11:20:26

On Sun, Aug 28, 2011 at 8:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Aug 28, 2011 at 7:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> (IOW, +1 for inventing a second macro to use in the delay loop only.)
>
>> Beautiful.  Got a naming preference for that second macro?  I
>> suggested TAS_SPIN() because it's what you use when you spin, as
>> opposed to what you use in the uncontended case, but I'm not attached
>> to that.
>
> I had been thinking TAS_CONTENDED, but on reflection there's not a
> strong argument for that over TAS_SPIN.  Do what you will.

OK, done.  I think while we're tidying up here we ought to do
something about this comment:
*      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
S_UNLOCK() represent*      sequence points, ie, loads and stores of other values must not be moved*      across a lock
orunlock.  In most cases it suffices to make 
the operation*      be done through a "volatile" pointer.

IIUC, this is basically total nonsense.  volatile only constrains the
optimizer, not the CPU; and only with regard to the ordering of one
volatile access vs. another, NOT the ordering of volatile accesses
with any non-volatile accesses that may be floating around.  So its
practical utility for synchronization purposes seems to be nearly
zero.  I think what this should say is that we expect these operations
to act as a full memory fence.  Note that in some other worlds (e.g.
Linux) a spinlock acquisition or release is only required to act as a
half-fence; that is, other loads and stores are allowed to bleed into
the critical section but not out.  However, I think we have been
assuming full-fence behavior.  In either case, claiming that the use
of a volatile pointer is all that's needed here strikes me as pretty
misleading.

In the department of loose ends, there are a bunch of other things
that maybe need cleanup here: (1) the gcc/intel compiler cases on
ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
reconsider the 32-bit x86 case.  Right now TAS() says:
       /*        * Use a non-locking test before asserting the bus lock.  Note that the        * extra test appears to
bea small loss on some x86 platforms 
and a small        * win on others; it's by no means clear that we should keep it.        */

I can't get too excited about spending a lot of effort optimizing
32-bit PostgreSQL on any architecture at this point, but if someone
else is, we might want to check whether it makes more sense to do the
non-locking test only in TAS_SPIN().

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 12:07:57

Robert Haas <robertmhaas@gmail.com> writes:
> OK, done.  I think while we're tidying up here we ought to do
> something about this comment:

>  *      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
> S_UNLOCK() represent
>  *      sequence points, ie, loads and stores of other values must not be moved
>  *      across a lock or unlock.  In most cases it suffices to make
> the operation
>  *      be done through a "volatile" pointer.

> IIUC, this is basically total nonsense.

It could maybe be rewritten for more clarity, but it's far from being
nonsense.  The responsibility for having an actual hardware memory fence
instruction lies with the author of the TAS macro.  But the
responsibility for keeping the compiler from reordering stuff around the
TAS call is that of the *user* of the TAS macro (or spinlock), and in
most cases the way to do that is to make sure that both the spinlock and
the shared data structure are referenced through volatile pointers.
This isn't academic; we've seen bugs from failure to do that.  (BTW,
the reason for not being equivalently tense about LWLock-protected
structures is that the compiler generally won't try to move operations
around an out-of-line function call.  It's the fact that spinlocks are
inline-able that creates the risk here.)

> In the department of loose ends, there are a bunch of other things
> that maybe need cleanup here: (1) the gcc/intel compiler cases on
> ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
> reconsider the 32-bit x86 case.

The results I got yesterday seem sufficient basis to change the
gcc/intel cases for IA64, so I will go do that if you didn't already.
I am also currently running tests on x86_64 and PPC using Red Hat test
machines --- expect results later today.  Red Hat doesn't seem to own
any many-CPU machines that are 32-bit-only, and I rather wonder if there
are any.  It might be that it only makes sense to optimize the x86 paths
for a few CPUs, in which case this test methodology may not be very
helpful.
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 12:29:29

On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> OK, done.  I think while we're tidying up here we ought to do
>> something about this comment:
>
>>  *      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
>> S_UNLOCK() represent
>>  *      sequence points, ie, loads and stores of other values must not be moved
>>  *      across a lock or unlock.  In most cases it suffices to make
>> the operation
>>  *      be done through a "volatile" pointer.
>
>> IIUC, this is basically total nonsense.
>
> It could maybe be rewritten for more clarity, but it's far from being
> nonsense.  The responsibility for having an actual hardware memory fence
> instruction lies with the author of the TAS macro.

Right... but the comment implies that you probably don't need one, and
doesn't even mention that you MIGHT need one.

>> In the department of loose ends, there are a bunch of other things
>> that maybe need cleanup here: (1) the gcc/intel compiler cases on
>> ia64, (2) PA-RISC, (3) ARM, (4) PowerPC... and we should also perhaps
>> reconsider the 32-bit x86 case.
>
> The results I got yesterday seem sufficient basis to change the
> gcc/intel cases for IA64, so I will go do that if you didn't already.

I did not; please go ahead.  I wasn't relishing the idea trying to
figure out how to install gcc to test that case.

> I am also currently running tests on x86_64 and PPC using Red Hat test
> machines --- expect results later today.  Red Hat doesn't seem to own
> any many-CPU machines that are 32-bit-only, and I rather wonder if there
> are any.  It might be that it only makes sense to optimize the x86 paths
> for a few CPUs, in which case this test methodology may not be very
> helpful.

FWIW, I tried spinning unlocked on x86_64 at 32 cores and got a
regression.  Don't have any PPC gear at present.

I think optimizing spinlocks for machines with only a few CPUs is
probably pointless.  Based on what I've seen so far, spinlock
contention even at 16 CPUs is negligible pretty much no matter what
you do.  Whether your implementation is fast or slow isn't going to
matter, because even an inefficient implementation will account for
only a negligible percentage of the total CPU time - much less than 1%
- as opposed to a 64-core machine, where it's not that hard to find
cases where spin-waits consume the *majority* of available CPU time
(recall previous discussion of lseek).  So I'm disinclined to spend
the time it would take to tinker with the 32-bit code, because it will
not matter; for that platform we're better off spending our time
installing a hash table in ScanKeywordLookup().  And there's always a
possibility that AMD and Intel chips could be different, or there
could even be differences between different chip generations from the
same manufacturer, so all in all it seems like a pretty unrewarding
exercise.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 12:42:38

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Robert Haas <robertmhaas@gmail.com> writes:
>>> IIUC, this is basically total nonsense.

>> It could maybe be rewritten for more clarity, but it's far from being
>> nonsense. �The responsibility for having an actual hardware memory fence
>> instruction lies with the author of the TAS macro.

> Right... but the comment implies that you probably don't need one, and
> doesn't even mention that you MIGHT need one.

I think maybe we need to split it into two paragraphs, one addressing
the TAS author and the other for the TAS user.  I'll have a go at that.

> I think optimizing spinlocks for machines with only a few CPUs is
> probably pointless.  Based on what I've seen so far, spinlock
> contention even at 16 CPUs is negligible pretty much no matter what
> you do.

We did find significant differences several years ago, testing on
machines that probably had no more than four cores; that's where the
existing comments in s_lock.h came from.  Whether those tests are
still relevant for today's source code is not obvious though.
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 12:47:55

On Mon, Aug 29, 2011 at 11:42 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Aug 29, 2011 at 11:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Robert Haas <robertmhaas@gmail.com> writes:
>>>> IIUC, this is basically total nonsense.
>
>>> It could maybe be rewritten for more clarity, but it's far from being
>>> nonsense.  The responsibility for having an actual hardware memory fence
>>> instruction lies with the author of the TAS macro.
>
>> Right... but the comment implies that you probably don't need one, and
>> doesn't even mention that you MIGHT need one.
>
> I think maybe we need to split it into two paragraphs, one addressing
> the TAS author and the other for the TAS user.  I'll have a go at that.

OK.

>> I think optimizing spinlocks for machines with only a few CPUs is
>> probably pointless.  Based on what I've seen so far, spinlock
>> contention even at 16 CPUs is negligible pretty much no matter what
>> you do.
>
> We did find significant differences several years ago, testing on
> machines that probably had no more than four cores; that's where the
> existing comments in s_lock.h came from.  Whether those tests are
> still relevant for today's source code is not obvious though.

Hmm, OK.  I guess if you want to put energy into it, I'm not going to
complain too much... just not sure it's the best use of time.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Greg Stark

Date:

29 August 2011, 12:49:11

On Mon, Aug 29, 2011 at 4:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>  *      ANOTHER CAUTION: be sure that TAS(), TAS_SPIN(), and
>> S_UNLOCK() represent
>>  *      sequence points, ie, loads and stores of other values must not be moved
>>  *      across a lock or unlock.  In most cases it suffices to make
>> the operation
>>  *      be done through a "volatile" pointer.
>
>> IIUC, this is basically total nonsense.
>
> It could maybe be rewritten for more clarity, but it's far from being
> nonsense.

The confusion for me is that it's talking about sequence points and
volatile pointers in the same breath as if one implies the other.
Making something a volatile pointer dose not create a sequence point.
It requires that the compiler not move the access or store across any
sequence points that are already there.

It might be helpful to include the actual bug that the comment is
trying to warn against because iirc it was a real case that caused you
to add the volatile modifiers.

--
greg

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 13:00:19

Greg Stark <stark@mit.edu> writes:
> The confusion for me is that it's talking about sequence points and
> volatile pointers in the same breath as if one implies the other.
> Making something a volatile pointer dose not create a sequence point.
> It requires that the compiler not move the access or store across any
> sequence points that are already there.

Well, no, I don't think that's the useful way to think about it.  Modern
compilers seem to only honor sequence points in terms of the semantics
seen by the executing program; they don't believe that they can't
reorder loads/stores freely.  And as long as the memory in question is
only used by the given program, they're right.  But for memory locations
shared with other threads of execution, you have to be careful about the
order of accesses to those locations.  My understanding of "volatile"
is that the compiler is forbidden from altering the order of
volatile-qualified loads and stores *relative to each other*, or from
optimizing away a load or store that seems redundant in the context of
the given program.  That's got nothing to do with sequence points per se.

> It might be helpful to include the actual bug that the comment is
> trying to warn against because iirc it was a real case that caused you
> to add the volatile modifiers.

Well, if there were one and only one bug involved here, it wouldn't be
such a critical problem ...
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 13:53:58

On Mon, Aug 29, 2011 at 12:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Greg Stark <stark@mit.edu> writes:
>> The confusion for me is that it's talking about sequence points and
>> volatile pointers in the same breath as if one implies the other.
>> Making something a volatile pointer dose not create a sequence point.
>> It requires that the compiler not move the access or store across any
>> sequence points that are already there.
>
> Well, no, I don't think that's the useful way to think about it.  Modern
> compilers seem to only honor sequence points in terms of the semantics
> seen by the executing program; they don't believe that they can't
> reorder loads/stores freely.

This discussion seems to miss the fact that there are two levels of
reordering that can happen.  First, the compiler can move things
around.  Second, the CPU can move things around.  Even on x86, for
example, a sequence like this can be reordered:

LOAD A
STORE A
LOAD B

Even though the compiler may emit those instructions in exactly that
order, an x86 CPU can, IIUC, decide to load B before it finishes
storing A, so that the actual apparent execution order as seen by
other CPUs will be either the above, or the above with the last two
instructions reversed.  On a weakly ordered CPU, the load of B could
be moved back even further, before the LOAD of A.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 14:24:46

Robert Haas <robertmhaas@gmail.com> writes:
> This discussion seems to miss the fact that there are two levels of
> reordering that can happen.  First, the compiler can move things
> around.  Second, the CPU can move things around.

Right, I think that's exactly the problem with the previous wording of
that comment; it doesn't address the two logical levels involved.
I've rewritten it, see what you think.
*    Another caution for users of these macros is that it is the caller's*    responsibility to ensure that the
compilerdoesn't re-order accesses*    to shared memory to precede the actual lock acquisition, or follow the*    lock
release. Typically we handle this by using volatile-qualified*    pointers to refer to both the spinlock itself and the
shareddata*    structure being accessed within the spinlocked critical section.*    That fixes it because compilers are
notallowed to re-order accesses*    to volatile objects relative to other such accesses.**    On platforms with weak
memoryordering, the TAS(), TAS_SPIN(), and*    S_UNLOCK() macros must further include hardware-level memory fence*
instructionsto prevent similar re-ordering at the hardware level.*    TAS() and TAS_SPIN() must guarantee that loads
andstores issued after*    the macro are not executed until the lock has been obtained.  Conversely,*    S_UNLOCK()
mustguarantee that loads and stores issued before the macro*    have been executed before the lock is released.

        regards, tom lane

Re: spinlocks on HP-UX

From

Greg Stark

Date:

29 August 2011, 14:35:01

On Mon, Aug 29, 2011 at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:

> Even though the compiler may emit those instructions in exactly that
> order, an x86 CPU can, IIUC, decide to load B before it finishes
> storing A, so that the actual apparent execution order as seen by
> other CPUs will be either the above, or the above with the last two
> instructions reversed.  On a weakly ordered CPU, the load of B could
> be moved back even further, before the LOAD of A.

My understanding of what the comment meant is that there is already a
full memory barrier as far as the CPU is concerned due to the TAS or
whatever, but it's important that there also be a sequence point there
so that the volatile memory access isn't reordered by the compiler to
occur before the memory barrier.

I was going to say the same thing as Tom that sequence points and
volatile pointers have nothing at all to do with each other. However
my brief searching online actually seemed to indicate that in fact the
compiler isn't supposed to reorder volatile memory accesses across
sequence points. That seemed to make sense since I couldn't think of
any other way to rigorously describe the constraints the compiler
should operate under.

--
greg

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 14:57:16

On Mon, Aug 29, 2011 at 1:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> This discussion seems to miss the fact that there are two levels of
>> reordering that can happen.  First, the compiler can move things
>> around.  Second, the CPU can move things around.
>
> Right, I think that's exactly the problem with the previous wording of
> that comment; it doesn't address the two logical levels involved.
> I've rewritten it, see what you think.
>
>  *      Another caution for users of these macros is that it is the caller's
>  *      responsibility to ensure that the compiler doesn't re-order accesses
>  *      to shared memory to precede the actual lock acquisition, or follow the
>  *      lock release.  Typically we handle this by using volatile-qualified
>  *      pointers to refer to both the spinlock itself and the shared data
>  *      structure being accessed within the spinlocked critical section.
>  *      That fixes it because compilers are not allowed to re-order accesses
>  *      to volatile objects relative to other such accesses.
>  *
>  *      On platforms with weak memory ordering, the TAS(), TAS_SPIN(), and
>  *      S_UNLOCK() macros must further include hardware-level memory fence
>  *      instructions to prevent similar re-ordering at the hardware level.
>  *      TAS() and TAS_SPIN() must guarantee that loads and stores issued after
>  *      the macro are not executed until the lock has been obtained.  Conversely,
>  *      S_UNLOCK() must guarantee that loads and stores issued before the macro
>  *      have been executed before the lock is released.

That's definitely an improvement.

I'm actually not convinced that we're entirely consistent here about
what we require the semantics of acquiring and releasing a spinlock to
be.  For example, on x86 and x86_64, we acquire the lock using xchgb,
which acts a full memory barrier.  But when we release the lock, we
just zero out the memory address, which is NOT a full memory barrier.
Stores can't cross it, but non-dependent loads of different locations
can back up over it.  That's pretty close to a full barrier, but it
isn't, quite.  Now, I don't see why that should really cause any
problem, at least for common cases like LWLockAcquire().  If the CPU
prefetches the data protected by the lwlock after we know we've got
the lock before we've actually released the spinlock and returned from
LWLockAcquire(), that should be fine, even good (for performance).
The real problem with being squiffy here is that it's not clear how
weak we can make the fence instruction on weakly ordered architectures
that support multiple types.  Right now we're pretty conservative, but
I think that may be costing us.  I might be wrong; more research is
needed here; but I think that we should at least start to get our head
about what semantics we actually need.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 15:17:49

I wrote:
> I am also currently running tests on x86_64 and PPC using Red Hat test
> machines --- expect results later today.

OK, I ran some more tests.  These are not directly comparable to my
previous results with IA64, because (a) I used RHEL6.2 and gcc 4.4.6;
(b) I used half as many pgbench threads as backends, rather than one
thread per eight backends.  Testing showed that pgbench cannot saturate
more than two backends per thread in this test environment, as shown
for example by this series:

pgbench -c 8 -j 1 -S -T 300 bench    tps = 22091.461409 (including ...
pgbench -c 8 -j 2 -S -T 300 bench    tps = 42587.661755 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 77515.057885 (including ...
pgbench -c 8 -j 8 -S -T 300 bench    tps = 75830.463821 (including ...

I find this entirely astonishing, BTW; the backend is surely doing far
more than twice as much work per query as pgbench.  We need to look into
why pgbench is apparently still such a dog.  However, that's not
tremendously relevant to the question of whether we need an unlocked
test in spinlocks.

These tests were run on a 32-CPU Opteron machine (Sun Fire X4600 M2,
8 quad-core sockets).  Test conditions the same as my IA64 set, except
for the OS and the -j switches:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 9515.435401 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 20239.289880 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 78628.371372 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 143065.596555 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 227349.424654 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 269016.946095 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 253884.095190 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 269235.253012 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench    tps = 9316.195621 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 19852.444846 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 77701.546927 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 138926.775553 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 188485.669320 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 253602.490286 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 251181.310600 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 260812.933702 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 9283.944739 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 20213.208443 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 78824.247744 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 141027.072774 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 201658.416366 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 271035.843105 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 261337.324585 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 271272.921058 (including ...

So basically there is no benefit to the unlocked test on this hardware.
But it doesn't cost much either, which is odd because the last time we
did this type of testing, adding an unlocked test was a "huge loss" on
Opteron.  Apparently AMD improved their handling of the case, and/or
the other changes we've made change the usage pattern completely.

I am hoping to do a similar test on another machine with $bignum Xeon
processors, to see if Intel hardware reacts any differently.  But that
machine is in the Westford office which is currently without power,
so it will have to wait a few days.  (I can no longer get at either
of the machines cited in this mail, either, so if you want to see
more test cases it'll have to wait.)

These tests were run on a 32-processor PPC64 machine (IBM 8406-71Y,
POWER7 architecture; I think it might be 16 cores with hyperthreading,
not sure).  The machine has "only" 6GB of RAM so I set shared_buffers to
4GB, other test conditions the same:

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 8746.076443 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 12297.297308 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 48697.392492 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 94133.227472 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 126822.857978 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 129364.417801 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 125728.697772 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 131566.394880 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench    tps = 8810.484890 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 12336.612804 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 49023.435650 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 96306.706556 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 131731.475778 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 133451.416612 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 110076.269474 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 111339.797242 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 8726.269726 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 12228.415466 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 48227.623829 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 93302.510254 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 130661.097475 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 133009.181697 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 128710.757986 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 133063.460934 (including ...

So basically no value to an unlocked test on this platform either.
        regards, tom lane

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 15:22:00

Robert Haas <robertmhaas@gmail.com> writes:
> I'm actually not convinced that we're entirely consistent here about
> what we require the semantics of acquiring and releasing a spinlock to
> be.  For example, on x86 and x86_64, we acquire the lock using xchgb,
> which acts a full memory barrier.  But when we release the lock, we
> just zero out the memory address, which is NOT a full memory barrier.
> Stores can't cross it, but non-dependent loads of different locations
> can back up over it.  That's pretty close to a full barrier, but it
> isn't, quite.

Right.  That's why I wrote the comment as I did; it says what the actual
requirement is.  There probably are cases where our implementations are
more restrictive than necessary (I hope none where they are weaker).
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

29 August 2011, 15:32:58

On Mon, Aug 29, 2011 at 2:15 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> These tests were run on a 32-CPU Opteron machine (Sun Fire X4600 M2,
> 8 quad-core sockets).  Test conditions the same as my IA64 set, except
> for the OS and the -j switches:
>
> Stock git head:
>
> pgbench -c 1 -j 1 -S -T 300 bench       tps = 9515.435401 (including ...
> pgbench -c 32 -j 16 -S -T 300 bench     tps = 227349.424654 (including ...
>
> These tests were run on a 32-processor PPC64 machine (IBM 8406-71Y,
> POWER7 architecture; I think it might be 16 cores with hyperthreading,
> not sure).  The machine has "only" 6GB of RAM so I set shared_buffers to
> 4GB, other test conditions the same:
>
> Stock git head:
>
> pgbench -c 1 -j 1 -S -T 300 bench       tps = 8746.076443 (including ...
> pgbench -c 32 -j 16 -S -T 300 bench     tps = 126822.857978 (including ...

Stepping beyond the immediate issue of whether we want an unlocked
test in there or not (and I agree that based on these numbers we
don't), there's a clear and puzzling difference between those sets of
numbers.  The Opteron test is showing 32 clients getting about 23.9
times the throughput of a single client, which is not exactly linear
but is at least respectable, whereas the PPC64 test is showing 32
clients getting just 14.5 times the throughput of a single client,
which is pretty significantly less good.  Moreover, cranking it up to
64 clients is squeezing a significant amount of additional work out on
Opteron, but not on PPC64.  The HP-UX/Itanium numbers in my OP give a
ratio of 17.3x - a little better than your PPC64 numbers, but
certainly not great.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 15:41:59

Greg Stark <stark@mit.edu> writes:
> I was going to say the same thing as Tom that sequence points and
> volatile pointers have nothing at all to do with each other. However
> my brief searching online actually seemed to indicate that in fact the
> compiler isn't supposed to reorder volatile memory accesses across
> sequence points. That seemed to make sense since I couldn't think of
> any other way to rigorously describe the constraints the compiler
> should operate under.

It's a bit confusing.  I agree that if the code is written such that
there are two volatile accesses with no intervening sequence point,
the compiler is at liberty to do them in either order; for instance in
foo(*x, *y);

there are no guarantees about which value is fetched first, even if x
and y are volatile-qualified.  What's bothering me is that in, say,
*x = 0;*y = 1;*z = 2;

if x and z are volatile-qualified but y isn't, I believe the compilers
think they are only required to store into *x before *z, and can reorder
the store to *y around either of the others.  So this makes the notion
of a sequence point pretty squishy in itself.
        regards, tom lane

Re: spinlocks on HP-UX

From

"Kevin Grittner"

Date:

29 August 2011, 15:48:31

Robert Haas <robertmhaas@gmail.com> wrote:

> Stepping beyond the immediate issue of whether we want an unlocked
> test in there or not (and I agree that based on these numbers we
> don't), there's a clear and puzzling difference between those sets
> of numbers.  The Opteron test is showing 32 clients getting about
> 23.9 times the throughput of a single client, which is not exactly
> linear but is at least respectable, whereas the PPC64 test is
> showing 32 clients getting just 14.5 times the throughput of a
> single client, which is pretty significantly less good.  Moreover,
> cranking it up to 64 clients is squeezing a significant amount of
> additional work out on Opteron, but not on PPC64.  The
> HP-UX/Itanium numbers in my OP give a ratio of 17.3x - a little
> better than your PPC64 numbers, but certainly not great.

I wouldn't make too much of that without comparing to a STREAM test
(properly configured -- the default array size is likely not to be
large enough for these machines).  On a recently delivered 32 core
machine with 256 GB RAM, I saw numbers like this for just RAM
access:

Threads    Copy       Scale         Add       Triad
1        3332.3721   3374.8146   4500.1954   4309.7392
2        5822.8107   6158.4621   8563.3236   7756.9050
4       12474.9646  12282.3401  16960.7216  15399.2406
8       22353.6013  23502.4389  31911.5206  29574.8124
16      35760.8782  40946.6710  49108.4386  49264.6576
32      47567.3882  44935.4608  52983.9355  52278.1373
64      48428.9939  47851.7320  54141.8830  54560.0520
128     49354.4303  49586.6092  55663.2606  57489.5798
256     45081.3601  44303.1032  49561.3815  50073.3530
512     42271.9688  41689.8609  47362.4190  46388.9720

Graph attached for those who are visually inclined and have support
for the display of JPEG files.

Note that this is a machine which is *not* configured to be
blazingly fast for a single connection, but to scale up well for a
large number of concurrent processes:

http://www.redbooks.ibm.com/redpapers/pdfs/redp4650.pdf

Unless your benchmarks are falling off a lot faster than the STREAM
test on that hardware, I wouldn't worry.

-Kevin

Attachment

cir-stream-tests_html_590087c.jpg

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 August 2011, 16:13:01

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Robert Haas <robertmhaas@gmail.com> wrote:
>> Stepping beyond the immediate issue of whether we want an unlocked
>> test in there or not (and I agree that based on these numbers we
>> don't), there's a clear and puzzling difference between those sets
>> of numbers.  The Opteron test is showing 32 clients getting about
>> 23.9 times the throughput of a single client, which is not exactly
>> linear but is at least respectable, whereas the PPC64 test is
>> showing 32 clients getting just 14.5 times the throughput of a
>> single client, which is pretty significantly less good.
> I wouldn't make too much of that without comparing to a STREAM test
> (properly configured -- the default array size is likely not to be
> large enough for these machines).

Yeah.  One point I didn't mention is that the Opteron machine's memory
is split across 8 NUMA nodes, whereas the PPC machine's isn't.  I would
bet there's a significant difference in aggregate available memory
bandwidth.

Also, if the PPC machine really is hyperthreaded (the internal webpage
for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
indications), that might mean it's not going to scale too well past 16x
the single-thread case.
        regards, tom lane

spinlocks on HP-UX

From

Ants Aasma

Date:

29 August 2011, 18:07:27

Sorry, forgot to cc the list.

On Mon, Aug 29, 2011 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Also, if the PPC machine really is hyperthreaded (the internal webpage
> for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
> indications), that might mean it's not going to scale too well past 16x
> the single-thread case.

According to IBM docs [1], 8406-71Y contains one 8 core POWER7 chip
that is 4-way multi-threaded and has 4 memory channels. X4600M2 should
have 16 memory channels, although at 2/3 the transfer rate. 6GB of
memory is a strange amount for the IBM, according to specs it should take
4 or 8GB DIMMs in pairs. Sounds like the server is split into multiple
partitions.

--
Ants Aasma

[1] http://www.redbooks.ibm.com/redpapers/pdfs/redp4655.pdf

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 01:59:25

Ants Aasma <ants.aasma@eesti.ee> writes:
> On Mon, Aug 29, 2011 at 10:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Also, if the PPC machine really is hyperthreaded (the internal webpage
>> for it says "Hyper? True" but /proc/cpuinfo doesn't provide any clear
>> indications), that might mean it's not going to scale too well past 16x
>> the single-thread case.

> According to IBM docs [1], 8406-71Y contains one 8 core POWER7 chip
> that is 4-way multi-threaded and has 4 memory channels.

Yeah, I looked at the docs.  "Multi threading" is IBM's term for
hyperthreading, that is several instruction streams competing for use of
a single processor core's pipelines.  So the 32 virtual processors on
the machine only really represent 8 physically independent cores,
squaring with the hardware designation.  I found an IBM doc
http://www-03.ibm.com/systems/resources/pwrsysperf_SMT4OnP7.pdf
suggesting that the throughput benefit of 4-way SMT is typically 1.5 to
2X, that is you max out at 1.5 to 2X as much work as you'd get with just
8 virtual processors on the same 8 cores.  So I'd say we're really doing
quite well to get the numbers I got.  (The paper also implies that you
get more benefit from SMT with workloads that incur more memory-access
stalls, so the relatively large working set of this test case is helping
it look good.)

> 6GB of
> memory is a strange amount for the IBM, according to specs it should take
> 4 or 8GB DIMMs in pairs. Sounds like the server is split into multiple
> partitions.

I'm confused about that too.  There definitely seemed to be only 6GB of
available RAM, but there's no way I can see that memory might be
partitioned across different blades.  The blades look pretty independent
...
        regards, tom lane

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 17:06:01

I wrote:
> I am hoping to do a similar test on another machine with $bignum Xeon
> processors, to see if Intel hardware reacts any differently.  But that
> machine is in the Westford office which is currently without power,
> so it will have to wait a few days.

OK, the lights are on again in Westford, so here are some results from
an 8-socket Fujitsu PRIMEQUEST 1800 with 10-core Xeon E7-8870 processors,
hyperthreading enabled for a total of 160 virtual processors.
All test conditions the same as from my Opteron runs yesterday,
except just for the heck of it I ran it up to 160 backends.

Stock git head (of a couple of days ago now):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 4401.589257 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8585.789827 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 36315.227334 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 73841.195884 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 155309.526039 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 77477.101725 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 41301.481915 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 30443.815506 (including ...
pgbench -c 160 -j 80 -S -T 300 bench    tps = 24600.584202 (including ...

Non-locked test in TAS():

pgbench -c 1 -j 1 -S -T 300 bench    tps = 4412.336573 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8739.900806 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 32957.710818 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 71538.032629 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 153892.469308 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 127786.277182 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 92108.895423 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 75382.131814 (including ...
pgbench -c 160 -j 80 -S -T 300 bench    tps = 67277.057981 (including ...

Non-locked test in TAS_SPIN() only:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 4006.626861 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 9020.124850 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 36507.582318 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 69668.921550 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 150886.395754 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 216697.745497 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 171013.266643 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 115205.718495 (including ...
pgbench -c 160 -j 80 -S -T 300 bench    tps = 92073.704665 (including ...

This suggests that (1) an unlocked test in TAS_SPIN might be a good idea
on x86_64 after all, and (2) this test scenario may not be pushing the
system hard enough to expose limitations of the spinlock implementation.

I am now thinking that the reason we saw clear differences in spinlock
implementations years ago, and now are not seeing them except on insane
hardware, is mainly that we've managed to reduce contention at higher
levels of the system.  That doesn't mean spinlocks have become
uninteresting, just that "pgbench -S" isn't the ideal test case for
stressing them.  I'm thinking maybe we need a test scenario that
generates sinval traffic, for example, or forces snapshots to be taken
more often.  Ideas anyone?
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

30 August 2011, 17:24:00

On Tue, Aug 30, 2011 at 4:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> This suggests that (1) an unlocked test in TAS_SPIN might be a good idea
> on x86_64 after all, and (2) this test scenario may not be pushing the
> system hard enough to expose limitations of the spinlock implementation.
>
> I am now thinking that the reason we saw clear differences in spinlock
> implementations years ago, and now are not seeing them except on insane
> hardware, is mainly that we've managed to reduce contention at higher
> levels of the system.  That doesn't mean spinlocks have become
> uninteresting, just that "pgbench -S" isn't the ideal test case for
> stressing them.  I'm thinking maybe we need a test scenario that
> generates sinval traffic, for example, or forces snapshots to be taken
> more often.  Ideas anyone?

On current sources, with a workload that fits into shared_buffers,
pgbench -S hammers the spinlock protecting ProcArrayLock extremely
hard.   I'm sure it's possible to come up with a test case that
hammers them harder, but using a real workload can expose issues (like
aggregate memory bandwidth) that you might not see otherwise.

I am a bit surprised by your test results, because I also tried x86_64
with an unlocked test, also on pgbench -S, and I am pretty sure I got
a regression.  Maybe I'll try rerunning that.  It seems possible that
the x86_64 results depend on the particular sub-architecture and/or
whether HT is in use, which would be kind of a nuisance.

Also, did you happen to measure the amount of user time vs. system
time that your test runs used?  If this is on Linux, I am surprised
that you didn't get killed by the lseek() contention problem on a
machine with that many cores.  I found it to be visible at 32 and
crippling at 64, so I can't even imagine what it would be like at 160.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 17:37:19

Robert Haas <robertmhaas@gmail.com> writes:
> I am a bit surprised by your test results, because I also tried x86_64
> with an unlocked test, also on pgbench -S, and I am pretty sure I got
> a regression.  Maybe I'll try rerunning that.  It seems possible that
> the x86_64 results depend on the particular sub-architecture and/or
> whether HT is in use, which would be kind of a nuisance.

Well, if you consider Opteron as a sub-architecture of x86_64, that was
already true the last time we did this.  So far there have not been
cases where something really good for one implementation was really bad
for another, but someday we'll probably hit that.

> Also, did you happen to measure the amount of user time vs. system
> time that your test runs used?

Did not think about that.  I was considering how to measure the average
context swap rate over each run, so that we could keep an eye out for
the "context swap storm" behavior that's the usual visible-in-top
symptom of these sorts of problems.  But it'd have to be automated;
I'm not going to keep my eyes glued to "top" output for several hours.

I'd be happy to re-run these tests with any RHEL-compatible measurement
scaffolding somebody else provides, but if I have to write it, it
probably won't happen very soon.

> If this is on Linux, I am surprised
> that you didn't get killed by the lseek() contention problem on a
> machine with that many cores.

Hm ... now that you mention it, all of these tests have been using
the latest-and-greatest unreleased RHEL kernels.  Maybe Red Hat already
fixed that contention problem in their kernel?  Have you got a RH
bugzilla number for the issue?
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

30 August 2011, 17:53:38

On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> If this is on Linux, I am surprised
>> that you didn't get killed by the lseek() contention problem on a
>> machine with that many cores.
>
> Hm ... now that you mention it, all of these tests have been using
> the latest-and-greatest unreleased RHEL kernels.  Maybe Red Hat already
> fixed that contention problem in their kernel?  Have you got a RH
> bugzilla number for the issue?

No, I haven't had much luck filing bugs against Red Hat releases, so
I've sort of given up on that.  I did have some off-list
correspondence with a Red Hat engineer who red my blog post, though.

It should be pretty easy to figure it out, though.   Just fire up
pgbench with lots of clients (say, 160) and run vmstat in another
window.  If the machine reports 10% system time, it's fixed.  If it
reports 90% system time, it's not.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 19:33:57

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> If this is on Linux, I am surprised
>>> that you didn't get killed by the lseek() contention problem on a
>>> machine with that many cores.

>> Hm ... now that you mention it, all of these tests have been using
>> the latest-and-greatest unreleased RHEL kernels.

> It should be pretty easy to figure it out, though.   Just fire up
> pgbench with lots of clients (say, 160) and run vmstat in another
> window.  If the machine reports 10% system time, it's fixed.  If it
> reports 90% system time, it's not.

I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
vmstat numbers around 50% user time, 12% system time, 38% idle.
So no lseek problem here, boss.  Kernel calls itself 2.6.32-192.el6.x86_64.
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

30 August 2011, 19:36:49

On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Aug 30, 2011 at 4:37 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>>> If this is on Linux, I am surprised
>>>> that you didn't get killed by the lseek() contention problem on a
>>>> machine with that many cores.
>
>>> Hm ... now that you mention it, all of these tests have been using
>>> the latest-and-greatest unreleased RHEL kernels.
>
>> It should be pretty easy to figure it out, though.   Just fire up
>> pgbench with lots of clients (say, 160) and run vmstat in another
>> window.  If the machine reports 10% system time, it's fixed.  If it
>> reports 90% system time, it's not.
>
> I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
> vmstat numbers around 50% user time, 12% system time, 38% idle.
> So no lseek problem here, boss.  Kernel calls itself 2.6.32-192.el6.x86_64.

Eh, wait a minute.  38% idle time?  Did you use a scale factor that
doesn't fit in shared_buffers?  If so you're probably testing how fast
you pass BufFreelistLock around...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 20:21:57

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
>> vmstat numbers around 50% user time, 12% system time, 38% idle.
>> So no lseek problem here, boss. Kernel calls itself 2.6.32-192.el6.x86_64.

> Eh, wait a minute.  38% idle time?  Did you use a scale factor that
> doesn't fit in shared_buffers?

Nope: -s 100, 8GB shared_buffers, same as all the other tests.

Typical strace of one backend looks like

recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
lseek(10, 0, SEEK_END)                  = 269213696
lseek(11, 0, SEEK_END)                  = 224641024
select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66

No I/O anywhere.  I'm thinking the reported idle time must correspond to
spinlock delays that are long enough to reach the select() calls in
s_lock.  If so, 38% is depressingly high, but it's not out of line with
what we've seen in the past in tests designed to provoke spinlock
contention.

(BTW, this is with the unlocked test added to TAS_SPIN.)
        regards, tom lane

Re: spinlocks on HP-UX

From

Robert Haas

Date:

30 August 2011, 20:48:53

On Tue, Aug 30, 2011 at 7:21 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Tue, Aug 30, 2011 at 6:33 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> I ran it up to "pgbench -c 200 -j 200 -S -T 300 bench" and still see
>>> vmstat numbers around 50% user time, 12% system time, 38% idle.
>>> So no lseek problem here, boss. Kernel calls itself 2.6.32-192.el6.x86_64.
>
>> Eh, wait a minute.  38% idle time?  Did you use a scale factor that
>> doesn't fit in shared_buffers?
>
> Nope: -s 100, 8GB shared_buffers, same as all the other tests.
>
> Typical strace of one backend looks like
>
> recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
> lseek(10, 0, SEEK_END)                  = 269213696
> lseek(11, 0, SEEK_END)                  = 224641024
> sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
> recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
> lseek(10, 0, SEEK_END)                  = 269213696
> lseek(11, 0, SEEK_END)                  = 224641024
> sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
> recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> lseek(10, 0, SEEK_END)                  = 269213696
> lseek(11, 0, SEEK_END)                  = 224641024
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
> recvfrom(9, "Q\0\0\0?SELECT abalance FROM pgbenc"..., 8192, 0, NULL, NULL) = 64
> lseek(10, 0, SEEK_END)                  = 269213696
> lseek(11, 0, SEEK_END)                  = 224641024
> select(0, NULL, NULL, NULL, {0, 1000})  = 0 (Timeout)
> sendto(9, "T\0\0\0!\0\1abalance\0\0\0\241\267\0\3\0\0\0\27\0\4\377\377\377\377"..., 66, 0, NULL, 0) = 66
>
> No I/O anywhere.  I'm thinking the reported idle time must correspond to
> spinlock delays that are long enough to reach the select() calls in
> s_lock.  If so, 38% is depressingly high, but it's not out of line with
> what we've seen in the past in tests designed to provoke spinlock
> contention.
>
> (BTW, this is with the unlocked test added to TAS_SPIN.)

Well, that is mighty interesting.  That strace looks familiar, but I
have never seen a case where the idle time was more than a few
percentage points on this test (well, assuming you're using 9.2
sources, anyway).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tom Lane

Date:

30 August 2011, 21:29:46

I wrote:
> No I/O anywhere.  I'm thinking the reported idle time must correspond to
> spinlock delays that are long enough to reach the select() calls in
> s_lock.  If so, 38% is depressingly high, but it's not out of line with
> what we've seen in the past in tests designed to provoke spinlock
> contention.

I tried increasing MAX_SPINS_PER_DELAY from 1000 to 10000.  (Again, this
is with the unlocked test added to TAS_SPIN.)  This resulted in a very
significant drop in the reported idle-time percentage, down to 10% or so
at full load; but unfortunately the TPS numbers got worse for the higher
end of the curve:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 4526.914824 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8183.815526 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 34637.075173 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 68792.550304 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 159195.038317 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 220544.912947 (including ...
pgbench -c 96 -j 48 -S -T 300 bench    tps = 147367.793544 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 79187.042252 (including ...
pgbench -c 160 -j 80 -S -T 300 bench    tps = 43957.912879 (including ...

So that confirms the idea that the reported idle time corresponds to
s_lock select() sleeps.  Unfortunately, it doesn't appear to lead to
anything that would result in increasing performance.  I suppose the
reason that performance gets worse, even though we've presumably
eliminated some process context swaps, is that we have more cache line
contention for whichever spinlock(s) they're all fighting over.
        regards, tom lane

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

06 September 2011, 05:33:22

Hi,

I am interested in this thread because I may be able to borrow a big
IBM machine and might be able to do some tests on it if it somewhat
contributes enhancing PostgreSQL. Is there anything I can do for this?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> I was able to obtain access to a 32-core HP-UX server.  I repeated the
> pgbench -S testing that I have previously done on Linux, and found
> that the results were not too good.  Here are the results at scale
> factor 100, on 9.2devel, with various numbers of clients.  Five minute
> runs, shared_buffers=8GB.
> 
> 1:tps = 5590.070816 (including connections establishing)
> 8:tps = 37660.233932 (including connections establishing)
> 16:tps = 67366.099286 (including connections establishing)
> 32:tps = 82781.624665 (including connections establishing)
> 48:tps = 18589.995074 (including connections establishing)
> 64:tps = 16424.661371 (including connections establishing)
> 
> And just for comparison, here are the numbers at scale factor 1000:
> 
> 1:tps = 4751.768608 (including connections establishing)
> 8:tps = 33621.474490 (including connections establishing)
> 16:tps = 58959.043171 (including connections establishing)
> 32:tps = 78801.265189 (including connections establishing)
> 48:tps = 21635.234969 (including connections establishing)
> 64:tps = 18611.863567 (including connections establishing)
> 
> After mulling over the vmstat output for a bit, I began to suspect
> spinlock contention.  I took a look at document called "Implementing
> Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
> Ekqvist and David Graves and available via the HP web site, which
> states that when spinning on a spinlock on these machines, you should
> use a regular, unlocked test first and use the atomic test only when
> the unlocked test looks OK.  I tried implementing this in two ways,
> and both produced results which are FAR superior to our current
> implementation.  First, I did this:
> 
> --- a/src/include/storage/s_lock.h
> +++ b/src/include/storage/s_lock.h
> @@ -726,7 +726,7 @@ tas(volatile slock_t *lock)
>  typedef unsigned int slock_t;
> 
>  #include <ia64/sys/inline.h>
> -#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
> +#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
> 
>  #endif /* HPUX on IA64, non gcc */
> 
> That resulted in these numbers.  Scale factor 100:
> 
> 1:tps = 5569.911714 (including connections establishing)
> 8:tps = 37365.364468 (including connections establishing)
> 16:tps = 63596.261875 (including connections establishing)
> 32:tps = 95948.157678 (including connections establishing)
> 48:tps = 90708.253920 (including connections establishing)
> 64:tps = 100109.065744 (including connections establishing)
> 
> Scale factor 1000:
> 
> 1:tps = 4878.332996 (including connections establishing)
> 8:tps = 33245.469907 (including connections establishing)
> 16:tps = 56708.424880 (including connections establishing)
> 48:tps = 69652.232635 (including connections establishing)
> 64:tps = 70593.208637 (including connections establishing)
> 
> Then, I did this:
> 
> --- a/src/backend/storage/lmgr/s_lock.c
> +++ b/src/backend/storage/lmgr/s_lock.c
> @@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)
>         int                     delays = 0;
>         int                     cur_delay = 0;
> 
> -       while (TAS(lock))
> +       while (*lock ? 1 : TAS(lock))
>         {
>                 /* CPU-specific delay each time through the loop */
>                 SPIN_DELAY();
> 
> That resulted in these numbers, at scale factor 100:
> 
> 1:tps = 5564.059494 (including connections establishing)
> 8:tps = 37487.090798 (including connections establishing)
> 16:tps = 66061.524760 (including connections establishing)
> 32:tps = 96535.523905 (including connections establishing)
> 48:tps = 92031.618360 (including connections establishing)
> 64:tps = 106813.631701 (including connections establishing)
> 
> And at scale factor 1000:
> 
> 1:tps = 4980.338246 (including connections establishing)
> 8:tps = 33576.680072 (including connections establishing)
> 16:tps = 55618.677975 (including connections establishing)
> 32:tps = 73589.442746 (including connections establishing)
> 48:tps = 70987.026228 (including connections establishing)
> 
> Note sure why I am missing the 64-client results for that last set of
> tests, but no matter.
> 
> Of course, we can't apply the second patch as it stands, because I
> tested it on x86 and it loses.  But it seems pretty clear we need to
> do it at least for this architecture...
> 
> -- 
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: spinlocks on HP-UX

From

Robert Haas

Date:

06 September 2011, 10:47:49

On Tue, Sep 6, 2011 at 4:33 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> I am interested in this thread because I may be able to borrow a big
> IBM machine and might be able to do some tests on it if it somewhat
> contributes enhancing PostgreSQL. Is there anything I can do for this?

That would be great.  What I've been using as a test case is pgbench
-S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
shared_buffers=8GB.

I think what you'd want to compare is the performance of unpatched
master, vs. the performance with this line added to s_lock.h for your
architecture:

#define TAS_SPIN(lock)  (*(lock) ? 1 : TAS(lock))

We've now added that line for ia64 (the line is present in two
different places in the file, one for GCC and the other for HP's
compiler).  So the question is whether we need it for any other
architectures.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

06 September 2011, 11:11:14

> That would be great.  What I've been using as a test case is pgbench
> -S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
> shared_buffers=8GB.
> 
> I think what you'd want to compare is the performance of unpatched
> master, vs. the performance with this line added to s_lock.h for your
> architecture:
> 
> #define TAS_SPIN(lock)  (*(lock) ? 1 : TAS(lock))
> 
> We've now added that line for ia64 (the line is present in two
> different places in the file, one for GCC and the other for HP's
> compiler).  So the question is whether we need it for any other
> architectures.

Ok. Let me talk to IBM guys...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

18 October 2011, 01:11:44

>> That would be great.  What I've been using as a test case is pgbench
>> -S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
>> shared_buffers=8GB.
>> 
>> I think what you'd want to compare is the performance of unpatched
>> master, vs. the performance with this line added to s_lock.h for your
>> architecture:
>> 
>> #define TAS_SPIN(lock)  (*(lock) ? 1 : TAS(lock))
>> 
>> We've now added that line for ia64 (the line is present in two
>> different places in the file, one for GCC and the other for HP's
>> compiler).  So the question is whether we need it for any other
>> architectures.
> 
> Ok. Let me talk to IBM guys...

With help from IBM Japan Ltd. we did some tests on a larger IBM
machine than Tom Lane has used for his
test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
In his case it was IBM 8406-71Y, which has 8 physical cores and
4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
physical cores and 4SMT(128 threadings), 256GB RAM.

The test method was same as the one in the article above. The
differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
size(8GB).

We tested 3 methods to enhance spin lock contention:

1) Add "hint" parameter to lwarx op which is usable POWER6 or later  architecure.

2) Add non-locked test in TAS()

3) #1 + #2

We saw small performance enhancement with #1, larger one with #2 and
even better with #1+#2.

Stock git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 10356.306513 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 21841.10285 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 63800.868529 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 144872.64726 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 120943.238461 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 108144.933981 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 92202.782791 (including ...

With hint (method #1):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 11198.1872 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 21390.592014 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 74423.488089 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 153766.351105 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 134313.758113 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 129392.154047 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 105506.948058 (including ...

Non-locked test in TAS() (method #2):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 10537.893154 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 22019.388666 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 78763.930379 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 142791.99724 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 222008.903675 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 209912.691058 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 199437.23965 (including ...

With hint and non-locked test in TAS (#1 + #2)

pgbench -c 1 -j 1 -S -T 300 bench    tps = 11419.881375 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 21919.530209 (including ...
pgbench -c 8 -j 4 -S -T 300 bench    tps = 74788.242876 (including ...
pgbench -c 16 -j 8 -S -T 300 bench    tps = 156354.988564 (including ...
pgbench -c 32 -j 16 -S -T 300 bench    tps = 240521.495 (including ...
pgbench -c 64 -j 32 -S -T 300 bench    tps = 235709.272642 (including ...
pgbench -c 128 -j 64 -S -T 300 bench    tps = 220135.729663 (including ...

Since each core usage is around 50% in the benchmark, there is room for 
further performance improvement by eliminating other contentions, tuning 
compiler option etc.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: spinlocks on HP-UX

From

Robert Haas

Date:

18 October 2011, 01:34:54

On Tue, Oct 18, 2011 at 12:11 AM, Tatsuo Ishii <ishii@postgresql.org> wrote:
>>> That would be great.  What I've been using as a test case is pgbench
>>> -S -c $NUM_CPU_CORES -j $NUM_CPU_CORES with scale factor 100 and
>>> shared_buffers=8GB.
>>>
>>> I think what you'd want to compare is the performance of unpatched
>>> master, vs. the performance with this line added to s_lock.h for your
>>> architecture:
>>>
>>> #define TAS_SPIN(lock)  (*(lock) ? 1 : TAS(lock))
>>>
>>> We've now added that line for ia64 (the line is present in two
>>> different places in the file, one for GCC and the other for HP's
>>> compiler).  So the question is whether we need it for any other
>>> architectures.
>>
>> Ok. Let me talk to IBM guys...
>
> With help from IBM Japan Ltd. we did some tests on a larger IBM
> machine than Tom Lane has used for his
> test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
> In his case it was IBM 8406-71Y, which has 8 physical cores and
> 4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
> physical cores and 4SMT(128 threadings), 256GB RAM.
>
> The test method was same as the one in the article above. The
> differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
> size(8GB).
>
> We tested 3 methods to enhance spin lock contention:
>
> 1) Add "hint" parameter to lwarx op which is usable POWER6 or later
>   architecure.
>
> 2) Add non-locked test in TAS()
>
> 3) #1 + #2
>
> We saw small performance enhancement with #1, larger one with #2 and
> even better with #1+#2.

Hmm, so you added the non-locked test in TAS()?  Did you try adding it
just to TAS_SPIN()?  On Itanium, I found that it was slightly better
to do it only in TAS_SPIN() - i.e. in the contended case.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

18 October 2011, 02:05:26

>> With help from IBM Japan Ltd. we did some tests on a larger IBM
>> machine than Tom Lane has used for his
>> test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
>> In his case it was IBM 8406-71Y, which has 8 physical cores and
>> 4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
>> physical cores and 4SMT(128 threadings), 256GB RAM.
>>
>> The test method was same as the one in the article above. The
>> differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
>> size(8GB).
>>
>> We tested 3 methods to enhance spin lock contention:
>>
>> 1) Add "hint" parameter to lwarx op which is usable POWER6 or later
>>   architecure.
>>
>> 2) Add non-locked test in TAS()
>>
>> 3) #1 + #2
>>
>> We saw small performance enhancement with #1, larger one with #2 and
>> even better with #1+#2.
>
> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
> to do it only in TAS_SPIN() - i.e. in the contended case.

The actual test was performed by one of our engineers in my company
(Toshihiro Kitagawa). I think the answer to your question is yes, but
let me talk to him to make it sure.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: spinlocks on HP-UX

From

Pavan Deolasee

Date:

18 October 2011, 03:21:02

On Tue, Oct 18, 2011 at 10:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:

> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
> to do it only in TAS_SPIN() - i.e. in the contended case.
>

Would it be a good change for S_LOCK() to use TAS_SPIN()  as well ?

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: spinlocks on HP-UX

From

Robert Haas

Date:

18 October 2011, 09:57:57

On Tue, Oct 18, 2011 at 2:20 AM, Pavan Deolasee
<pavan.deolasee@gmail.com> wrote:
> On Tue, Oct 18, 2011 at 10:04 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
>> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
>> to do it only in TAS_SPIN() - i.e. in the contended case.
>
> Would it be a good change for S_LOCK() to use TAS_SPIN()  as well ?

Well, that would be sort of missing the point of why we invented
TAS_SPIN() in the first place.  What we found on Itanium is that using
the unlocked test always was better than never doing it, but what was
even slightly better was to use the unlocked first test *only when
spinning*.  In other words, on the very first go-around, we use the
atomic instruction right away.  Only if that fails do we switch to
using the unlocked test first.

Now it's possible that on some other architecture it's better to do
the unlocked test first every time.  But it seems somewhat unlikely,
because in the hopefully-common case where the spinlock is
uncontended, it's just a waste.  If you're having enough spinlock
contention that the first TAS() is failing frequently, you need to fix
the underlying cause of the spinlock contention...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

28 December 2011, 08:04:05

>> With help from IBM Japan Ltd. we did some tests on a larger IBM
>> machine than Tom Lane has used for his
>> test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
>> In his case it was IBM 8406-71Y, which has 8 physical cores and
>> 4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
>> physical cores and 4SMT(128 threadings), 256GB RAM.
>>
>> The test method was same as the one in the article above. The
>> differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
>> size(8GB).
>>
>> We tested 3 methods to enhance spin lock contention:
>>
>> 1) Add "hint" parameter to lwarx op which is usable POWER6 or later
>>   architecure.
>>
>> 2) Add non-locked test in TAS()
>>
>> 3) #1 + #2
>>
>> We saw small performance enhancement with #1, larger one with #2 and
>> even better with #1+#2.
>
> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
> to do it only in TAS_SPIN() - i.e. in the contended case.

Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
Japan. Also this patch deal with older Power architectures which do
not have "hint" argument of lwarx opcode.

According to him, the patch resulted in much better performance stock
git head.

Stock git head without patch:
pgbench -c 1 -j 1 -S -T 300    tps = 11360.472691 (including ...
pgbench -c 2 -j 1 -S -T 300    tps = 22173.943133 (including ...
pgbench -c 4 -j 2 -S -T 300    tps = 43397.331641 (including ...
pgbench -c 8 -j 4 -S -T 300    tps = 73469.073714 (including ...
pgbench -c 16 -j 8 -S -T 300   tps = 151094.270443 (including ...
pgbench -c 32 -j 16 -S -T 300  tps = 166752.637452 (including ...
pgbench -c 64 -j 32 -S -T 300  tps = 148139.338204 (including ...
pgbench -c 128 -j 64 -S -T 300 tps = 115412.622895 (including ...

Stock git head with patch:
pgbench -c 1 -j 1 -S -T 300    tps = 11103.370854 (including ...
pgbench -c 2 -j 1 -S -T 300    tps = 22118.907582 (including ...
pgbench -c 4 -j 2 -S -T 300    tps = 42608.641820 (including ...
pgbench -c 8 -j 4 -S -T 300    tps = 77592.862639 (including ...
pgbench -c 16 -j 8 -S -T 300   tps = 150469.841892 (including ...
pgbench -c 32 -j 16 -S -T 300  tps = 267726.082168 (including ...
pgbench -c 64 -j 32 -S -T 300  tps = 322582.271713 (including ...
pgbench -c 128 -j 64 -S -T 300 tps = 273071.683663 (including ...

(Graph is attached)

Test environment:
Power 750 (32 physical cores, virtually 128 cores using SMT4)
mem: 256GB
OS: RHEL6.1 kernel 2.6.32-131.0.15.el6.ppc64
gcc version 4.4.5 20110214 (Red Hat 4.4.5-6)
PostgreSQL Git head (0510b62d91151b9d8c1fe1aa15c9cf3ffe9bf25b)
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: spinlocks on HP-UX

From

Heikki Linnakangas

Date:

28 December 2011, 16:19:08

On 28.12.2011 14:03, Tatsuo Ishii wrote:
>>> With help from IBM Japan Ltd. we did some tests on a larger IBM
>>> machine than Tom Lane has used for his
>>> test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
>>> In his case it was IBM 8406-71Y, which has 8 physical cores and
>>> 4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
>>> physical cores and 4SMT(128 threadings), 256GB RAM.
>>>
>>> The test method was same as the one in the article above. The
>>> differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
>>> size(8GB).
>>>
>>> We tested 3 methods to enhance spin lock contention:
>>>
>>> 1) Add "hint" parameter to lwarx op which is usable POWER6 or later
>>>    architecure.
>>>
>>> 2) Add non-locked test in TAS()
>>>
>>> 3) #1 + #2
>>>
>>> We saw small performance enhancement with #1, larger one with #2 and
>>> even better with #1+#2.
>>
>> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
>> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
>> to do it only in TAS_SPIN() - i.e. in the contended case.
>
> Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
> Japan. Also this patch deal with older Power architectures which do
> not have "hint" argument of lwarx opcode.
>
> According to him, the patch resulted in much better performance stock
> git head.

Impressive results.

config/c-compiler.m4 doesn't seem like the right place for the configure 
test. Would there be any harm in setting the lwarx hint always; what 
would happen on older ppc processors that don't support it?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: spinlocks on HP-UX

From

Tatsuo Ishii

Date:

28 December 2011, 18:28:16

> On 28.12.2011 14:03, Tatsuo Ishii wrote:
>>>> With help from IBM Japan Ltd. we did some tests on a larger IBM
>>>> machine than Tom Lane has used for his
>>>> test(http://archives.postgresql.org/message-id/8292.1314641721@sss.pgh.pa.us).
>>>> In his case it was IBM 8406-71Y, which has 8 physical cores and
>>>> 4SMT(32 threadings). Ours is IBM Power 750 Express, which has 32
>>>> physical cores and 4SMT(128 threadings), 256GB RAM.
>>>>
>>>> The test method was same as the one in the article above. The
>>>> differences are OS(RHEL 6.1), gcc version (4.4.5) and shared buffer
>>>> size(8GB).
>>>>
>>>> We tested 3 methods to enhance spin lock contention:
>>>>
>>>> 1) Add "hint" parameter to lwarx op which is usable POWER6 or later
>>>>    architecure.
>>>>
>>>> 2) Add non-locked test in TAS()
>>>>
>>>> 3) #1 + #2
>>>>
>>>> We saw small performance enhancement with #1, larger one with #2 and
>>>> even better with #1+#2.
>>>
>>> Hmm, so you added the non-locked test in TAS()?  Did you try adding it
>>> just to TAS_SPIN()?  On Itanium, I found that it was slightly better
>>> to do it only in TAS_SPIN() - i.e. in the contended case.
>>
>> Here is new patch using TAS_SPIN(), created by Manabu Ori from IBM
>> Japan. Also this patch deal with older Power architectures which do
>> not have "hint" argument of lwarx opcode.
>>
>> According to him, the patch resulted in much better performance stock
>> git head.
> 
> Impressive results.
> 
> config/c-compiler.m4 doesn't seem like the right place for the
> configure test. Would there be any harm in setting the lwarx hint
> always; what would happen on older ppc processors that don't support
> it?

I think the load module just fails to run in this case, but I'd like
to confirm. Ori-san?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: spinlocks on HP-UX

From

Tom Lane

Date:

28 December 2011, 19:09:28

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> config/c-compiler.m4 doesn't seem like the right place for the configure 
> test. Would there be any harm in setting the lwarx hint always; what 
> would happen on older ppc processors that don't support it?

More to the point, a configure test only proves whether the
build machine can deal with the flag, not whether the machine
the executables will ultimately run on knows what the flag means.
We cannot assume that the build and execution boxes are the same.
(In general, AC_TRY_RUN tests are best avoided because of this.)
        regards, tom lane

Re: spinlocks on HP-UX

From

Manabu Ori

Date:

28 December 2011, 22:36:43

2011/12/29 Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>><br />> >
Impressiveresults.<br />> ><br />> > config/c-compiler.m4 doesn't seem like the right place for the<br />
>> configure test. Would there be any harm in setting the lwarx hint<br />> > always; what would happen on
olderppc processors that don't support<br />> > it?<br />><br />> I think the load module just fails to run
inthis case, but I'd like<br /> > to confirm. Ori-san?<br /><br />I don't know where is the right config/*.m4 to
placethis kind of<br />configure test. Do you have any idea?<br /><br />I believe lwarx hint would be no harm for
recentPowerPC processors.<br /> What I tested are:<br /><br />  (1) Built postgres on POWER6 + RHEL5, which got lwarx
hint<br/>      included. Then copy these src tree to POWER5 + RHEL4 and<br />      run "make test", finished
successfully.<br/><br />  (2) Lwarx test in configure failed on POWER5 + RHEL4.<br /><br />Note that POWER6 understands
lwarxhint and POWER5 doesn't.<br />RHEL5 binutils supports lwarx hint and RHEL4 binutils doesn't.<br /><br />The only
concernis for very old PowerPC.<br />Referring to Power Instruction Set Architecture manual(*1), on<br /> some
processorsthat precede PowerISA v2.00, executing lwarx with<br />hint will cause the illegal instruction error.<br
/><br/>Lwarx test in configure should fail on these kind of processors,<br />guessing from my test(2).<br /><br /> (*1)
p.689of <a
href="https://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf">https://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf</a><br
/><br/>Regards,<br />Manabu Ori<br /><br />

Re: spinlocks on HP-UX

From

Heikki Linnakangas

Date:

29 December 2011, 16:46:07

On 29.12.2011 04:36, Manabu Ori wrote:
> I believe lwarx hint would be no harm for recent PowerPC processors.
> What I tested are:
>
>    (1) Built postgres on POWER6 + RHEL5, which got lwarx hint
>        included. Then copy these src tree to POWER5 + RHEL4 and
>        run "make test", finished successfully.
>
>    (2) Lwarx test in configure failed on POWER5 + RHEL4.
>
> Note that POWER6 understands lwarx hint and POWER5 doesn't.
> RHEL5 binutils supports lwarx hint and RHEL4 binutils doesn't.
>
> The only concern is for very old PowerPC.
> Referring to Power Instruction Set Architecture manual(*1), on
> some processors that precede PowerISA v2.00, executing lwarx with
> hint will cause the illegal instruction error.
>
> Lwarx test in configure should fail on these kind of processors,
> guessing from my test(2).

The Linux kernel does this (arch/powerpc/include/asm/ppc-opcode.h):

> 127 /*
> 128  * Only use the larx hint bit on 64bit CPUs. e500v1/v2 based CPUs will treat a
> 129  * larx with EH set as an illegal instruction.
> 130  */
> 131 #ifdef CONFIG_PPC64
> 132 #define __PPC_EH(eh)    (((eh) & 0x1) << 0)
> 133 #else
> 134 #define __PPC_EH(eh)    0
> 135 #endif

We can't copy-paste code from Linux directly, and I'm not sure I like 
that particular phrasing of the macro, but perhaps we should steal the 
idea and only use the hint on 64-bit PowerPC processors? I presume all 
the processors that support the hint are 64-bit, so the question is, is 
there any 64-bit PowerPC processors that would get upset about it? It's 
quite arbitrary to tie it to the word length, but if it works as a 
dividing line in practice, I'm fine with it.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com

Re: spinlocks on HP-UX

From

Tom Lane

Date:

29 December 2011, 18:26:36

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
> The Linux kernel does this (arch/powerpc/include/asm/ppc-opcode.h):

Yeah, I was looking at that too.

> We can't copy-paste code from Linux directly, and I'm not sure I like 
> that particular phrasing of the macro, but perhaps we should steal the 
> idea and only use the hint on 64-bit PowerPC processors?

The info that I've found says that the hint exists beginning in POWER6,
and there were certainly 64-bit Power machines before that.  However,
it might be that the only machines that actually spit up on the hint bit
(rather than ignore it) were 32-bit, in which case this would be a
usable heuristic.  Not sure how we can research that ... do we want to
just assume the kernel guys know what they're doing?
        regards, tom lane