Spinlocks, yet again: analysis and proposed patches - Mailing list pgsql-hackers

The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
performance problem.

I started investigating this by trying to understand why the "futex
spinlock" patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments.  It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others.  Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64.  Specifically,
s_lock.h uses this for both architectures:

    /* Use a non-locking test before asserting the bus lock */
    __asm__ __volatile__(
        "    cmpb    $0,%1    \n"
        "    jne        1f    \n"
        "    lock             \n"
        "    xchgb    %0,%1   \n"
        "1: \n"

and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons.  So the "non-locking test" is
not a win at all on x86_64.  As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly.  Why this should be is not completely
obvious.  I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it.  Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles.  Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.

The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does.  However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.

I looked into several other methods of doing the spinlock delay
instead.  I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:

1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled.  This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it.  It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil.  Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor.  In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)

2. Reduce the length of the select() delays in s_lock.  The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay.  I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.

3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions.  This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock.  (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.)  I modified the code to use an
integer modulo operation as part of the test for SPINS_PER_DELAY;
on most processors this should keep the processor's attention and
keep it away from the system bus for several times longer than an
average instruction.  (It's relevant here that the "rep;nop" used
to slow the spin loop on recent Intel processors appears to be
completely ignored by Opterons.  Basically we need a non-hardware-
specific trick to slow the loop.)

4. Modify SPINS_PER_DELAY.  Looping more times instead of entering the
kernel is a win for SMP boxes: whenever you give up and call the kernel
to delay, it's likely that the spinlock holder will have released the
lock before you even get all the way into the kernel, let alone get
rescheduled again.  I found that pushing SPINS_PER_DELAY above 200
caused the select delay rate to fall to almost nil in this test case.
However, that would be counterproductive for uniprocessors where the
optimal strategy is normally to give up the processor immediately.
We've talked about making the spin count user configurable, but I never
cared for that answer.  Instead, I've worked up some code to
automatically adjust the spin count depending on whether we seem to be
in a uniprocessor or multiprocessor.  The idea is to decrease the spin
count (down to a lower limit) whenever s_lock has to delay in order to
get a lock, and to increase the count (up to a limit) whenever we see
the lock come free before we've had to delay.  The former condition
suggests that we are in a uniprocessor and the latter suggests we are in
a multiprocessor.  I recall having complained that this idea would
probably be unstable, but in testing it seems to do pretty nicely at
converging to the minimum spin count on a 1-CPU machine and the maximum
limit on an SMP box.  It could do with more testing though to see if it
works the same for other people.

My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64.  Together, these changes bring
my test case on a 4-way Opteron from

N, runtime:    1 36s    2 61s    4 105s    8 198s

to

N, runtime:    1 35s    2 48s    4 58s    8 105s

At 4 queries, CPU utilization is 100% and the context swap rate nearly
nil, which seems to mean we've gained another (temporary no doubt)
victory over the spinlock problem.

I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4.  The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips.  (If so, we'd just make it conditional to x86_64.)

Comments and testing invited.

            regards, tom lane

*** REL8_0/src/include/storage/s_lock.h    Sun Aug 28 20:41:44 2005
--- new/s_lock.h    Fri Sep  9 14:58:44 2005
***************
*** 120,132 ****
  {
      register slock_t _res = 1;

!     /* Use a non-locking test before asserting the bus lock */
      __asm__ __volatile__(
-         "    cmpb    $0,%1    \n"
-         "    jne        1f        \n"
-         "    lock            \n"
          "    xchgb    %0,%1    \n"
-         "1: \n"
  :        "+q"(_res), "+m"(*lock)
  :
  :        "memory", "cc");
--- 120,128 ----
  {
      register slock_t _res = 1;

!     /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */
      __asm__ __volatile__(
          "    xchgb    %0,%1    \n"
  :        "+q"(_res), "+m"(*lock)
  :
  :        "memory", "cc");
*** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig    Sat Aug 20 19:26:24 2005
--- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c    Sun Sep 11 15:07:20 2005
***************
*** 170,175 ****
--- 170,177 ----

          ProcGlobal->freeProcs = INVALID_OFFSET;

+         ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY;
+
          /*
           * Pre-create the PGPROC structures and create a semaphore for
           * each.
***************
*** 224,232 ****
--- 226,239 ----
      /*
       * Try to get a proc struct from the free list.  If this fails, we
       * must be out of PGPROC structures (not to mention semaphores).
+      *
+      * While we are holding the ProcStructLock, also copy the current
+      * shared estimate of spins_per_delay to local storage.
       */
      SpinLockAcquire(ProcStructLock);

+     set_spins_per_delay(procglobal->spins_per_delay);
+
      myOffset = procglobal->freeProcs;

      if (myOffset != INVALID_OFFSET)
***************
*** 318,338 ****

      Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);

      dummyproc = &DummyProcs[proctype];

      /*
       * dummyproc should not presently be in use by anyone else
       */
      if (dummyproc->pid != 0)
          elog(FATAL, "DummyProc[%d] is in use by PID %d",
               proctype, dummyproc->pid);
      MyProc = dummyproc;

      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
       * up by InitProcGlobal.
       */
-     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
      SHMQueueElemInit(&(MyProc->links));
      MyProc->waitStatus = STATUS_OK;
      MyProc->xid = InvalidTransactionId;
--- 325,362 ----

      Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);

+     /*
+      * Just for paranoia's sake, we use the ProcStructLock to protect
+      * assignment and releasing of DummyProcs entries.
+      *
+      * While we are holding the ProcStructLock, also copy the current
+      * shared estimate of spins_per_delay to local storage.
+      */
+     SpinLockAcquire(ProcStructLock);
+
+     set_spins_per_delay(ProcGlobal->spins_per_delay);
+
      dummyproc = &DummyProcs[proctype];

      /*
       * dummyproc should not presently be in use by anyone else
       */
      if (dummyproc->pid != 0)
+     {
+         SpinLockRelease(ProcStructLock);
          elog(FATAL, "DummyProc[%d] is in use by PID %d",
               proctype, dummyproc->pid);
+     }
      MyProc = dummyproc;

+     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
+
+     SpinLockRelease(ProcStructLock);
+
      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
       * up by InitProcGlobal.
       */
      SHMQueueElemInit(&(MyProc->links));
      MyProc->waitStatus = STATUS_OK;
      MyProc->xid = InvalidTransactionId;
***************
*** 509,514 ****
--- 533,541 ----
      /* PGPROC struct isn't mine anymore */
      MyProc = NULL;

+     /* Update shared estimate of spins_per_delay */
+     procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
+
      SpinLockRelease(ProcStructLock);
  }

***************
*** 532,542 ****
--- 559,576 ----
      /* Release any LW locks I am holding (see notes above) */
      LWLockReleaseAll();

+     SpinLockAcquire(ProcStructLock);
+
      /* Mark dummy proc no longer in use */
      MyProc->pid = 0;

      /* PGPROC struct isn't mine anymore */
      MyProc = NULL;
+
+     /* Update shared estimate of spins_per_delay */
+     ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
+
+     SpinLockRelease(ProcStructLock);
  }


*** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig    Fri Aug 26 10:47:35 2005
--- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c    Sun Sep 11 17:43:20 2005
***************
*** 21,26 ****
--- 21,30 ----
  #include "storage/s_lock.h"
  #include "miscadmin.h"

+
+ static int    spins_per_delay = DEFAULT_SPINS_PER_DELAY;
+
+
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
***************
*** 49,102 ****
       * We loop tightly for awhile, then delay using pg_usleep() and try
       * again. Preferably, "awhile" should be a small multiple of the
       * maximum time we expect a spinlock to be held.  100 iterations seems
!      * about right.  In most multi-CPU scenarios, the spinlock is probably
!      * held by a process on another CPU and will be released before we
!      * finish 100 iterations.  However, on a uniprocessor, the tight loop
!      * is just a waste of cycles, so don't iterate thousands of times.
       *
       * Once we do decide to block, we use randomly increasing pg_usleep()
!      * delays. The first delay is 10 msec, then the delay randomly
!      * increases to about one second, after which we reset to 10 msec and
       * start again.  The idea here is that in the presence of heavy
       * contention we need to increase the delay, else the spinlock holder
       * may never get to run and release the lock.  (Consider situation
       * where spinlock holder has been nice'd down in priority by the
       * scheduler --- it will not get scheduled until all would-be
!      * acquirers are sleeping, so if we always use a 10-msec sleep, there
       * is a real possibility of starvation.)  But we can't just clamp the
       * delay to an upper bound, else it would take a long time to make a
       * reasonable number of tries.
       *
       * We time out and declare error after NUM_DELAYS delays (thus, exactly
       * that many tries).  With the given settings, this will usually take
!      * 3 or so minutes.  It seems better to fix the total number of tries
       * (and thus the probability of unintended failure) than to fix the
       * total time spent.
       *
!      * The pg_usleep() delays are measured in centiseconds (0.01 sec) because
!      * 10 msec is a common resolution limit at the OS level.
       */
! #define SPINS_PER_DELAY        100
  #define NUM_DELAYS            1000
! #define MIN_DELAY_CSEC        1
! #define MAX_DELAY_CSEC        100

!     int            spins = 0;
      int            delays = 0;
!     int            cur_delay = MIN_DELAY_CSEC;

      while (TAS(lock))
      {
          /* CPU-specific delay each time through the loop */
          SPIN_DELAY();

!         /* Block the process every SPINS_PER_DELAY tries */
!         if (++spins > SPINS_PER_DELAY)
          {
              if (++delays > NUM_DELAYS)
                  s_lock_stuck(lock, file, line);

!             pg_usleep(cur_delay * 10000L);

  #if defined(S_LOCK_TEST)
              fprintf(stdout, "*");
--- 53,121 ----
       * We loop tightly for awhile, then delay using pg_usleep() and try
       * again. Preferably, "awhile" should be a small multiple of the
       * maximum time we expect a spinlock to be held.  100 iterations seems
!      * about right as an initial guess.  However, on a uniprocessor the
!      * loop is a waste of cycles, while in a multi-CPU scenario it's usually
!      * better to spin a bit longer than to call the kernel, so we try to
!      * adapt the spin loop count depending on whether we seem to be in
!      * a uniprocessor or multiprocessor.
       *
       * Once we do decide to block, we use randomly increasing pg_usleep()
!      * delays. The first delay is 1 msec, then the delay randomly
!      * increases to about one second, after which we reset to 1 msec and
       * start again.  The idea here is that in the presence of heavy
       * contention we need to increase the delay, else the spinlock holder
       * may never get to run and release the lock.  (Consider situation
       * where spinlock holder has been nice'd down in priority by the
       * scheduler --- it will not get scheduled until all would-be
!      * acquirers are sleeping, so if we always use a 1-msec sleep, there
       * is a real possibility of starvation.)  But we can't just clamp the
       * delay to an upper bound, else it would take a long time to make a
       * reasonable number of tries.
       *
       * We time out and declare error after NUM_DELAYS delays (thus, exactly
       * that many tries).  With the given settings, this will usually take
!      * 2 or so minutes.  It seems better to fix the total number of tries
       * (and thus the probability of unintended failure) than to fix the
       * total time spent.
       *
!      * The pg_usleep() delays are measured in milliseconds because 1 msec
!      * is a common resolution limit at the OS level for newer platforms.
!      * On older platforms the resolution limit is usually 10 msec, in
!      * which case the total delay before timeout will be a bit more.
       */
! #define MIN_SPINS_PER_DELAY    10
! #define MAX_SPINS_PER_DELAY    1000
  #define NUM_DELAYS            1000
! #define MIN_DELAY_MSEC        1
! #define MAX_DELAY_MSEC        1000

!     int            spins = spins_per_delay;
      int            delays = 0;
!     int            cur_delay = 0;

      while (TAS(lock))
      {
          /* CPU-specific delay each time through the loop */
          SPIN_DELAY();

!         /*
!          * Block the process every spins_per_delay tries.
!          *
!          * What we are really testing for here is spins being decremented
!          * to zero.  We insert an unnecessary integer modulo operation
!          * into the test because we'd like this loop to run longer than
!          * just two or three instructions: ideally, the processor should
!          * not be contending for the system bus for a little while here.
!          */
!         if ((--spins % MAX_SPINS_PER_DELAY) == 0)
          {
              if (++delays > NUM_DELAYS)
                  s_lock_stuck(lock, file, line);

!             if (cur_delay == 0)    /* first time to delay? */
!                 cur_delay = MIN_DELAY_MSEC;
!
!             pg_usleep(cur_delay * 1000L);

  #if defined(S_LOCK_TEST)
              fprintf(stdout, "*");
***************
*** 107,119 ****
              cur_delay += (int) (cur_delay *
                (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
              /* wrap back to minimum delay when max is exceeded */
!             if (cur_delay > MAX_DELAY_CSEC)
!                 cur_delay = MIN_DELAY_CSEC;

!             spins = 0;
          }
      }
  }

  /*
   * Various TAS implementations that cannot live in s_lock.h as no inline
--- 126,200 ----
              cur_delay += (int) (cur_delay *
                (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
              /* wrap back to minimum delay when max is exceeded */
!             if (cur_delay > MAX_DELAY_MSEC)
!                 cur_delay = MIN_DELAY_MSEC;

!             spins = spins_per_delay;
          }
      }
+
+     /*
+      * If we were able to acquire the lock without delaying, it's a good
+      * indication we are in a multiprocessor.  If we had to delay, it's
+      * a sign (but not a sure thing) that we are in a uniprocessor.
+      * Hence, we decrement spins_per_delay slowly when we had to delay,
+      * and increase it rapidly when we didn't.  It's expected that
+      * spins_per_delay will converge to the minimum value on a uniprocessor
+      * and to the maximum value on a multiprocessor.
+      *
+      * Note: spins_per_delay is local within our current process.
+      * We want to average these observations across multiple backends,
+      * since it's relatively rare for this function to even get entered,
+      * and so a single backend might not live long enough to converge on
+      * a good value.  That is handled by the two routines below.
+      */
+     if (cur_delay == 0)
+     {
+         /* we never had to delay */
+         spins_per_delay += 100;
+         spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY);
+     }
+     else
+     {
+         spins_per_delay -= 1;
+         spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY);
+     }
+ }
+
+
+ /*
+  * Set local copy of spins_per_delay during backend startup.
+  *
+  * NB: this has to be pretty fast as it is called while holding a spinlock
+  */
+ void
+ set_spins_per_delay(int shared_spins_per_delay)
+ {
+     spins_per_delay = shared_spins_per_delay;
+ }
+
+ /*
+  * Update shared estimate of spins_per_delay during backend exit.
+  *
+  * NB: this has to be pretty fast as it is called while holding a spinlock
+  */
+ int
+ update_spins_per_delay(int shared_spins_per_delay)
+ {
+     /*
+      * We use an exponential moving average with a relatively slow
+      * adaption rate, so that noise in any one backend's result won't
+      * affect the shared value too much.  As long as both inputs are
+      * within the allowed range, the result must be too, so we need not
+      * worry about clamping the result.
+      *
+      * We deliberately truncate rather than rounding; this is so that
+      * single adjustments inside a backend can affect the shared estimate
+      * (see the asymmetric adjustment rules above).
+      */
+     return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
  }
+

  /*
   * Various TAS implementations that cannot live in s_lock.h as no inline
*** /home/postgres/pgsql/src/include/storage/proc.h.orig    Sat Aug 20 19:26:34 2005
--- /home/postgres/pgsql/src/include/storage/proc.h    Sun Sep 11 15:07:09 2005
***************
*** 105,110 ****
--- 105,112 ----
  {
      /* Head of list of free PGPROC structures */
      SHMEM_OFFSET freeProcs;
+     /* Current shared estimate of appropriate spins_per_delay value */
+     int            spins_per_delay;
  } PROC_HDR;


*** /home/postgres/pgsql/src/include/storage/s_lock.h.orig    Sun Aug 28 20:41:34 2005
--- /home/postgres/pgsql/src/include/storage/s_lock.h    Sun Sep 11 15:07:09 2005
***************
*** 826,829 ****
--- 826,835 ----
   */
  extern void s_lock(volatile slock_t *lock, const char *file, int line);

+ /* Support for dynamic adjustment of spins_per_delay */
+ #define DEFAULT_SPINS_PER_DELAY  100
+
+ extern void set_spins_per_delay(int shared_spins_per_delay);
+ extern int    update_spins_per_delay(int shared_spins_per_delay);
+
  #endif     /* S_LOCK_H */

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Spinlocks, yet again: a new test case
Next
From: Kurt Roeckx
Date:
Subject: Re: Spinlocks, yet again: analysis and proposed patches