spinlocks on HP-UX - Mailing list pgsql-hackers

From Robert Haas
Subject spinlocks on HP-UX
Date
Msg-id CA+TgmoZvATZV+eLh3U35jaNnwwzLL5ewUU_-t0X=T0Qwas+ZdA@mail.gmail.com
Whole thread Raw
Responses Re: spinlocks on HP-UX
Re: spinlocks on HP-UX
List pgsql-hackers
I was able to obtain access to a 32-core HP-UX server.  I repeated the
pgbench -S testing that I have previously done on Linux, and found
that the results were not too good.  Here are the results at scale
factor 100, on 9.2devel, with various numbers of clients.  Five minute
runs, shared_buffers=8GB.

1:tps = 5590.070816 (including connections establishing)
8:tps = 37660.233932 (including connections establishing)
16:tps = 67366.099286 (including connections establishing)
32:tps = 82781.624665 (including connections establishing)
48:tps = 18589.995074 (including connections establishing)
64:tps = 16424.661371 (including connections establishing)

And just for comparison, here are the numbers at scale factor 1000:

1:tps = 4751.768608 (including connections establishing)
8:tps = 33621.474490 (including connections establishing)
16:tps = 58959.043171 (including connections establishing)
32:tps = 78801.265189 (including connections establishing)
48:tps = 21635.234969 (including connections establishing)
64:tps = 18611.863567 (including connections establishing)

After mulling over the vmstat output for a bit, I began to suspect
spinlock contention.  I took a look at document called "Implementing
Spinlocks on the Intel Itanium Architecture and PA-RISC", by Tor
Ekqvist and David Graves and available via the HP web site, which
states that when spinning on a spinlock on these machines, you should
use a regular, unlocked test first and use the atomic test only when
the unlocked test looks OK.  I tried implementing this in two ways,
and both produced results which are FAR superior to our current
implementation.  First, I did this:

--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -726,7 +726,7 @@ tas(volatile slock_t *lock)typedef unsigned int slock_t;
#include <ia64/sys/inline.h>
-#define TAS(lock) _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE)
+#define TAS(lock) (*(lock) ? 1 : _Asm_xchg(_SZ_W, lock, 1, _LDHINT_NONE))
#endif /* HPUX on IA64, non gcc */

That resulted in these numbers.  Scale factor 100:

1:tps = 5569.911714 (including connections establishing)
8:tps = 37365.364468 (including connections establishing)
16:tps = 63596.261875 (including connections establishing)
32:tps = 95948.157678 (including connections establishing)
48:tps = 90708.253920 (including connections establishing)
64:tps = 100109.065744 (including connections establishing)

Scale factor 1000:

1:tps = 4878.332996 (including connections establishing)
8:tps = 33245.469907 (including connections establishing)
16:tps = 56708.424880 (including connections establishing)
48:tps = 69652.232635 (including connections establishing)
64:tps = 70593.208637 (including connections establishing)

Then, I did this:

--- a/src/backend/storage/lmgr/s_lock.c
+++ b/src/backend/storage/lmgr/s_lock.c
@@ -96,7 +96,7 @@ s_lock(volatile slock_t *lock, const char *file, int line)       int                     delays = 0;
    int                     cur_delay = 0;
 

-       while (TAS(lock))
+       while (*lock ? 1 : TAS(lock))       {               /* CPU-specific delay each time through the loop */
     SPIN_DELAY();
 

That resulted in these numbers, at scale factor 100:

1:tps = 5564.059494 (including connections establishing)
8:tps = 37487.090798 (including connections establishing)
16:tps = 66061.524760 (including connections establishing)
32:tps = 96535.523905 (including connections establishing)
48:tps = 92031.618360 (including connections establishing)
64:tps = 106813.631701 (including connections establishing)

And at scale factor 1000:

1:tps = 4980.338246 (including connections establishing)
8:tps = 33576.680072 (including connections establishing)
16:tps = 55618.677975 (including connections establishing)
32:tps = 73589.442746 (including connections establishing)
48:tps = 70987.026228 (including connections establishing)

Note sure why I am missing the 64-client results for that last set of
tests, but no matter.

Of course, we can't apply the second patch as it stands, because I
tested it on x86 and it loses.  But it seems pretty clear we need to
do it at least for this architecture...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Inputting relative datetimes
Next
From: Robert Haas
Date:
Subject: Re: cheaper snapshots redux