Re: spinlocks on HP-UX - Mailing list pgsql-hackers

From Tom Lane
Subject Re: spinlocks on HP-UX
Date
Msg-id 22039.1314573597@sss.pgh.pa.us
Whole thread Raw
In response to Re: spinlocks on HP-UX  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: spinlocks on HP-UX
List pgsql-hackers
I wrote:
> Yeah, I figured out that was probably what you meant a little while
> later.  I found a 64-CPU IA64 machine in Red Hat's test labs and am
> currently trying to replicate your results; report to follow.

OK, these results are on a 64-processor SGI IA64 machine (AFAICT, 64
independent sockets, no hyperthreading or any funny business); 124GB
in 32 NUMA nodes; running RHEL5.7, gcc 4.1.2.  I built today's git
head with --enable-debug (but not --enable-cassert) and ran with all
default configuration settings except shared_buffers = 8GB and
max_connections = 200.  The test database is initialized at -s 100.
I did not change the database between runs, but restarted the postmaster
and then did this to warm the caches a tad:

pgbench -c 1 -j 1 -S -T 30 bench

Per-run pgbench parameters are as shown below --- note in particular
that I assigned one pgbench thread per 8 backends.

The numbers are fairly variable even with 5-minute runs; I did each
series twice so you could get a feeling for how much.

Today's git head:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5835.213934 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8499.223161 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15197.126952 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 30803.255561 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 65795.356797 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 81644.914241 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40059.202836 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 21309.615001 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5787.310115 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8747.104236 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14655.369995 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 28287.254924 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61614.715187 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 79754.640518 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 40334.994324 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 23285.271257 (including ...

With modified TAS macro (see patch 1 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 6171.454468 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8709.003728 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14902.731035 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29789.744482 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59991.549128 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 117369.287466 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 112583.144495 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 110231.305282 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5670.097936 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8230.786940 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14785.952481 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29335.875139 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 59605.433837 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 108884.294519 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 110387.439978 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 109046.121191 (including ...

With unlocked test in s_lock.c delay loop only (patch 2 below):

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5426.491088 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8787.939425 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 15720.801359 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 33711.102718 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 61829.180234 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 109781.655020 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107132.848280 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106533.630986 (including ...

run 2:

pgbench -c 1 -j 1 -S -T 300 bench    tps = 5705.283316 (including ...
pgbench -c 2 -j 1 -S -T 300 bench    tps = 8442.798662 (including ...
pgbench -c 8 -j 1 -S -T 300 bench    tps = 14423.723837 (including ...
pgbench -c 16 -j 2 -S -T 300 bench    tps = 29112.751995 (including ...
pgbench -c 32 -j 4 -S -T 300 bench    tps = 62258.984033 (including ...
pgbench -c 64 -j 8 -S -T 300 bench    tps = 107741.988800 (including ...
pgbench -c 96 -j 12 -S -T 300 bench    tps = 107138.968981 (including ...
pgbench -c 128 -j 16 -S -T 300 bench    tps = 106110.215138 (including ...

So this pretty well confirms Robert's results, in particular that all of
the win from an unlocked test comes from using it in the delay loop.
Given the lack of evidence that a general change in TAS() is beneficial,
I'm inclined to vote against it, on the grounds that the extra test is
surely a loss at some level when there is not contention.
(IOW, +1 for inventing a second macro to use in the delay loop only.)

We ought to do similar tests on other architectures.  I found some
lots-o-processors x86_64 machines at Red Hat, but they don't seem to
own any PPC systems with more than 8 processors.  Anybody have big
iron with other non-Intel chips?
        regards, tom lane


Patch 1: change TAS globally, non-HPUX code:

*** src/include/storage/s_lock.h.orig    Sat Jan  1 13:27:24 2011
--- src/include/storage/s_lock.h    Sun Aug 28 13:32:47 2011
***************
*** 228,233 ****
--- 228,240 ---- {     long int    ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      __asm__ __volatile__(         "    xchg4     %0=%1,%2    \n" :        "=r"(ret), "+m"(*lock)
***************
*** 243,248 ****
--- 250,262 ---- {     int        ret; 
+     /*
+      * Use a non-locking test before the locking instruction proper.  This
+      * appears to be a very significant win on many-core IA64.
+      */
+     if (*lock)
+         return 1;
+      ret = _InterlockedExchange(lock,1);    /* this is a xchg asm macro */      return ret;

Patch 2: change s_lock only (same as Robert's quick hack):

*** src/backend/storage/lmgr/s_lock.c.orig    Sat Jan  1 13:27:09 2011
--- src/backend/storage/lmgr/s_lock.c    Sun Aug 28 14:02:29 2011
***************
*** 96,102 ****     int            delays = 0;     int            cur_delay = 0; 
!     while (TAS(lock))     {         /* CPU-specific delay each time through the loop */         SPIN_DELAY();
--- 96,102 ----     int            delays = 0;     int            cur_delay = 0; 
!     while (*lock ? 1 : TAS(lock))     {         /* CPU-specific delay each time through the loop */
SPIN_DELAY();


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Why buildfarm member anchovy is failing on 8.2 and 8.3 branches
Next
From: Robert Haas
Date:
Subject: Re: spinlocks on HP-UX