Thread: Spinlocks, yet again: analysis and proposed patches

Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
The test case I just posted shows that our spinlock code, which we had
thought largely done, is once again becoming a performance bottleneck.
It's time to resurrect some of the ideas we kicked around in early
2002, and didn't pursue because we decided spinlocks weren't our major
performance problem.

I started investigating this by trying to understand why the "futex
spinlock" patch had helped the Red Hat customer I mentioned, when it
hadn't done anything much in last year's experiments.  It turns out
that this customer was using an 8-way Opteron, and the futex patch
helps a lot more on that architecture than others.  Some random
experimentation eventually turned up part of the reason: the x86 TAS
assembly code we are using is pessimal for x86_64.  Specifically,
s_lock.h uses this for both architectures:

    /* Use a non-locking test before asserting the bus lock */
    __asm__ __volatile__(
        "    cmpb    $0,%1    \n"
        "    jne        1f    \n"
        "    lock             \n"
        "    xchgb    %0,%1   \n"
        "1: \n"

and it turns out that deleting the cmpb and jne makes things go
significantly faster on Opterons.  So the "non-locking test" is
not a win at all on x86_64.  As best I can tell from testing,
removing it wins because it causes the rate of spinlock delays
to drop significantly.  Why this should be is not completely
obvious.  I speculate that the Opterons are capable of pulling
in a copy of the cache line that contains the spinlock and then
running 100 iterations of the cmpb test without ever noticing
that the processor holding the lock has now released it.  Without
the cmpb, they are forced by the locking xchgb test to actually
look at the real state of the spinlock each time.

I kinda suspect that the cmpb test is a no-op or loss on all
Intelish processors: it can only be a win if you expect a lot
of contention for the spin lock, but in percentage terms we still
have a very low conflict rate, so in most executions of the TAS
macro, the cmpb is just wasted cycles.  Bottom line: we definitely
don't want it for x86_64, and maybe not at all, but we need more
research to decide the latter.

The second reason that the futex patch is helping is that when
a spinlock delay does occur, it allows the delaying process to be
awoken almost immediately, rather than delaying 10 msec or more
as the existing code does.  However, given that we are only expecting
the spinlock to be held for a couple dozen instructions, using the
kernel futex mechanism is huge overkill --- the in-kernel overhead
to manage the futex state is almost certainly several orders of
magnitude more than the delay we actually want.

I looked into several other methods of doing the spinlock delay
instead.  I think all of these were suggested at one point or
another in our earlier discussions of spinlocks:

1. Use sched_yield() if available: it does just what we want,
ie, yield the processor without forcing any useless time delay
before we can be rescheduled.  This doesn't exist everywhere
but it exists in recent Linuxen, so I tried it.  It made for a
beautiful improvement in my test case numbers: CPU utilization
went to 100% and the context swap rate to almost nil.  Unfortunately,
I also saw fairly frequent "stuck spinlock" panics when running
more queries than there were processors --- this despite increasing
NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
(I speculate that it's set up to only yield the processor to other
processes already affiliated to that processor.  In any case, it
is definitely capable of getting through 10000 yields without
running the guy who's holding the spinlock.)

2. Reduce the length of the select() delays in s_lock.  The current code
delays in quanta of 10msec, but on recent Linuxen (and I think other
platforms too) the scheduling quantum is 1msec, so we can get the
processor back faster if we ask for a 1msec delay.  I tried this and it
is definitely a win on Linux; it doesn't seem to hurt anything on older
systems either, they just round the delay up to 10msec like before.
So I think we should do this, even though it's only a partial answer.

3. Modify the spin loop to do a little more work between TAS() tests.
In the existing code there are only about half a dozen instructions
total in the normal spin loop, most of them very fast, with the result
that the spinning processor does its best to monopolize the system bus
with locked probe instructions.  This is obviously not good, as it'll
interfere with the ability of the spinlock holder to complete its work
and release the lock.  (The bulk of the spinlock uses are for LWLocks,
and with the current data structures the LWLock's own state is usually
going to be in the same cache line as the spinlock proper, so that
cache grabs for the spinlock hurt the LWLock state updating as well
as the eventual spinlock release.)  I modified the code to use an
integer modulo operation as part of the test for SPINS_PER_DELAY;
on most processors this should keep the processor's attention and
keep it away from the system bus for several times longer than an
average instruction.  (It's relevant here that the "rep;nop" used
to slow the spin loop on recent Intel processors appears to be
completely ignored by Opterons.  Basically we need a non-hardware-
specific trick to slow the loop.)

4. Modify SPINS_PER_DELAY.  Looping more times instead of entering the
kernel is a win for SMP boxes: whenever you give up and call the kernel
to delay, it's likely that the spinlock holder will have released the
lock before you even get all the way into the kernel, let alone get
rescheduled again.  I found that pushing SPINS_PER_DELAY above 200
caused the select delay rate to fall to almost nil in this test case.
However, that would be counterproductive for uniprocessors where the
optimal strategy is normally to give up the processor immediately.
We've talked about making the spin count user configurable, but I never
cared for that answer.  Instead, I've worked up some code to
automatically adjust the spin count depending on whether we seem to be
in a uniprocessor or multiprocessor.  The idea is to decrease the spin
count (down to a lower limit) whenever s_lock has to delay in order to
get a lock, and to increase the count (up to a limit) whenever we see
the lock come free before we've had to delay.  The former condition
suggests that we are in a uniprocessor and the latter suggests we are in
a multiprocessor.  I recall having complained that this idea would
probably be unstable, but in testing it seems to do pretty nicely at
converging to the minimum spin count on a 1-CPU machine and the maximum
limit on an SMP box.  It could do with more testing though to see if it
works the same for other people.

My proposal therefore is to do #2, #3, and #4, and to modify the TAS
assembly code at least on x86_64.  Together, these changes bring
my test case on a 4-way Opteron from

N, runtime:    1 36s    2 61s    4 105s    8 198s

to

N, runtime:    1 35s    2 48s    4 58s    8 105s

At 4 queries, CPU utilization is 100% and the context swap rate nearly
nil, which seems to mean we've gained another (temporary no doubt)
victory over the spinlock problem.

I attach two proposed patches: the first removes the cmpb/jne from
the x86 TAS assembly code, and the second one does the s_lock changes
enumerated as points #2, #3, #4.  The first one in particular needs
more testing to see if it hurts performance on any non-Opteron x86
chips.  (If so, we'd just make it conditional to x86_64.)

Comments and testing invited.

            regards, tom lane

*** REL8_0/src/include/storage/s_lock.h    Sun Aug 28 20:41:44 2005
--- new/s_lock.h    Fri Sep  9 14:58:44 2005
***************
*** 120,132 ****
  {
      register slock_t _res = 1;

!     /* Use a non-locking test before asserting the bus lock */
      __asm__ __volatile__(
-         "    cmpb    $0,%1    \n"
-         "    jne        1f        \n"
-         "    lock            \n"
          "    xchgb    %0,%1    \n"
-         "1: \n"
  :        "+q"(_res), "+m"(*lock)
  :
  :        "memory", "cc");
--- 120,128 ----
  {
      register slock_t _res = 1;

!     /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */
      __asm__ __volatile__(
          "    xchgb    %0,%1    \n"
  :        "+q"(_res), "+m"(*lock)
  :
  :        "memory", "cc");
*** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig    Sat Aug 20 19:26:24 2005
--- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c    Sun Sep 11 15:07:20 2005
***************
*** 170,175 ****
--- 170,177 ----

          ProcGlobal->freeProcs = INVALID_OFFSET;

+         ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY;
+
          /*
           * Pre-create the PGPROC structures and create a semaphore for
           * each.
***************
*** 224,232 ****
--- 226,239 ----
      /*
       * Try to get a proc struct from the free list.  If this fails, we
       * must be out of PGPROC structures (not to mention semaphores).
+      *
+      * While we are holding the ProcStructLock, also copy the current
+      * shared estimate of spins_per_delay to local storage.
       */
      SpinLockAcquire(ProcStructLock);

+     set_spins_per_delay(procglobal->spins_per_delay);
+
      myOffset = procglobal->freeProcs;

      if (myOffset != INVALID_OFFSET)
***************
*** 318,338 ****

      Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);

      dummyproc = &DummyProcs[proctype];

      /*
       * dummyproc should not presently be in use by anyone else
       */
      if (dummyproc->pid != 0)
          elog(FATAL, "DummyProc[%d] is in use by PID %d",
               proctype, dummyproc->pid);
      MyProc = dummyproc;

      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
       * up by InitProcGlobal.
       */
-     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
      SHMQueueElemInit(&(MyProc->links));
      MyProc->waitStatus = STATUS_OK;
      MyProc->xid = InvalidTransactionId;
--- 325,362 ----

      Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);

+     /*
+      * Just for paranoia's sake, we use the ProcStructLock to protect
+      * assignment and releasing of DummyProcs entries.
+      *
+      * While we are holding the ProcStructLock, also copy the current
+      * shared estimate of spins_per_delay to local storage.
+      */
+     SpinLockAcquire(ProcStructLock);
+
+     set_spins_per_delay(ProcGlobal->spins_per_delay);
+
      dummyproc = &DummyProcs[proctype];

      /*
       * dummyproc should not presently be in use by anyone else
       */
      if (dummyproc->pid != 0)
+     {
+         SpinLockRelease(ProcStructLock);
          elog(FATAL, "DummyProc[%d] is in use by PID %d",
               proctype, dummyproc->pid);
+     }
      MyProc = dummyproc;

+     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
+
+     SpinLockRelease(ProcStructLock);
+
      /*
       * Initialize all fields of MyProc, except MyProc->sem which was set
       * up by InitProcGlobal.
       */
      SHMQueueElemInit(&(MyProc->links));
      MyProc->waitStatus = STATUS_OK;
      MyProc->xid = InvalidTransactionId;
***************
*** 509,514 ****
--- 533,541 ----
      /* PGPROC struct isn't mine anymore */
      MyProc = NULL;

+     /* Update shared estimate of spins_per_delay */
+     procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
+
      SpinLockRelease(ProcStructLock);
  }

***************
*** 532,542 ****
--- 559,576 ----
      /* Release any LW locks I am holding (see notes above) */
      LWLockReleaseAll();

+     SpinLockAcquire(ProcStructLock);
+
      /* Mark dummy proc no longer in use */
      MyProc->pid = 0;

      /* PGPROC struct isn't mine anymore */
      MyProc = NULL;
+
+     /* Update shared estimate of spins_per_delay */
+     ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
+
+     SpinLockRelease(ProcStructLock);
  }


*** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig    Fri Aug 26 10:47:35 2005
--- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c    Sun Sep 11 17:43:20 2005
***************
*** 21,26 ****
--- 21,30 ----
  #include "storage/s_lock.h"
  #include "miscadmin.h"

+
+ static int    spins_per_delay = DEFAULT_SPINS_PER_DELAY;
+
+
  /*
   * s_lock_stuck() - complain about a stuck spinlock
   */
***************
*** 49,102 ****
       * We loop tightly for awhile, then delay using pg_usleep() and try
       * again. Preferably, "awhile" should be a small multiple of the
       * maximum time we expect a spinlock to be held.  100 iterations seems
!      * about right.  In most multi-CPU scenarios, the spinlock is probably
!      * held by a process on another CPU and will be released before we
!      * finish 100 iterations.  However, on a uniprocessor, the tight loop
!      * is just a waste of cycles, so don't iterate thousands of times.
       *
       * Once we do decide to block, we use randomly increasing pg_usleep()
!      * delays. The first delay is 10 msec, then the delay randomly
!      * increases to about one second, after which we reset to 10 msec and
       * start again.  The idea here is that in the presence of heavy
       * contention we need to increase the delay, else the spinlock holder
       * may never get to run and release the lock.  (Consider situation
       * where spinlock holder has been nice'd down in priority by the
       * scheduler --- it will not get scheduled until all would-be
!      * acquirers are sleeping, so if we always use a 10-msec sleep, there
       * is a real possibility of starvation.)  But we can't just clamp the
       * delay to an upper bound, else it would take a long time to make a
       * reasonable number of tries.
       *
       * We time out and declare error after NUM_DELAYS delays (thus, exactly
       * that many tries).  With the given settings, this will usually take
!      * 3 or so minutes.  It seems better to fix the total number of tries
       * (and thus the probability of unintended failure) than to fix the
       * total time spent.
       *
!      * The pg_usleep() delays are measured in centiseconds (0.01 sec) because
!      * 10 msec is a common resolution limit at the OS level.
       */
! #define SPINS_PER_DELAY        100
  #define NUM_DELAYS            1000
! #define MIN_DELAY_CSEC        1
! #define MAX_DELAY_CSEC        100

!     int            spins = 0;
      int            delays = 0;
!     int            cur_delay = MIN_DELAY_CSEC;

      while (TAS(lock))
      {
          /* CPU-specific delay each time through the loop */
          SPIN_DELAY();

!         /* Block the process every SPINS_PER_DELAY tries */
!         if (++spins > SPINS_PER_DELAY)
          {
              if (++delays > NUM_DELAYS)
                  s_lock_stuck(lock, file, line);

!             pg_usleep(cur_delay * 10000L);

  #if defined(S_LOCK_TEST)
              fprintf(stdout, "*");
--- 53,121 ----
       * We loop tightly for awhile, then delay using pg_usleep() and try
       * again. Preferably, "awhile" should be a small multiple of the
       * maximum time we expect a spinlock to be held.  100 iterations seems
!      * about right as an initial guess.  However, on a uniprocessor the
!      * loop is a waste of cycles, while in a multi-CPU scenario it's usually
!      * better to spin a bit longer than to call the kernel, so we try to
!      * adapt the spin loop count depending on whether we seem to be in
!      * a uniprocessor or multiprocessor.
       *
       * Once we do decide to block, we use randomly increasing pg_usleep()
!      * delays. The first delay is 1 msec, then the delay randomly
!      * increases to about one second, after which we reset to 1 msec and
       * start again.  The idea here is that in the presence of heavy
       * contention we need to increase the delay, else the spinlock holder
       * may never get to run and release the lock.  (Consider situation
       * where spinlock holder has been nice'd down in priority by the
       * scheduler --- it will not get scheduled until all would-be
!      * acquirers are sleeping, so if we always use a 1-msec sleep, there
       * is a real possibility of starvation.)  But we can't just clamp the
       * delay to an upper bound, else it would take a long time to make a
       * reasonable number of tries.
       *
       * We time out and declare error after NUM_DELAYS delays (thus, exactly
       * that many tries).  With the given settings, this will usually take
!      * 2 or so minutes.  It seems better to fix the total number of tries
       * (and thus the probability of unintended failure) than to fix the
       * total time spent.
       *
!      * The pg_usleep() delays are measured in milliseconds because 1 msec
!      * is a common resolution limit at the OS level for newer platforms.
!      * On older platforms the resolution limit is usually 10 msec, in
!      * which case the total delay before timeout will be a bit more.
       */
! #define MIN_SPINS_PER_DELAY    10
! #define MAX_SPINS_PER_DELAY    1000
  #define NUM_DELAYS            1000
! #define MIN_DELAY_MSEC        1
! #define MAX_DELAY_MSEC        1000

!     int            spins = spins_per_delay;
      int            delays = 0;
!     int            cur_delay = 0;

      while (TAS(lock))
      {
          /* CPU-specific delay each time through the loop */
          SPIN_DELAY();

!         /*
!          * Block the process every spins_per_delay tries.
!          *
!          * What we are really testing for here is spins being decremented
!          * to zero.  We insert an unnecessary integer modulo operation
!          * into the test because we'd like this loop to run longer than
!          * just two or three instructions: ideally, the processor should
!          * not be contending for the system bus for a little while here.
!          */
!         if ((--spins % MAX_SPINS_PER_DELAY) == 0)
          {
              if (++delays > NUM_DELAYS)
                  s_lock_stuck(lock, file, line);

!             if (cur_delay == 0)    /* first time to delay? */
!                 cur_delay = MIN_DELAY_MSEC;
!
!             pg_usleep(cur_delay * 1000L);

  #if defined(S_LOCK_TEST)
              fprintf(stdout, "*");
***************
*** 107,119 ****
              cur_delay += (int) (cur_delay *
                (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
              /* wrap back to minimum delay when max is exceeded */
!             if (cur_delay > MAX_DELAY_CSEC)
!                 cur_delay = MIN_DELAY_CSEC;

!             spins = 0;
          }
      }
  }

  /*
   * Various TAS implementations that cannot live in s_lock.h as no inline
--- 126,200 ----
              cur_delay += (int) (cur_delay *
                (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
              /* wrap back to minimum delay when max is exceeded */
!             if (cur_delay > MAX_DELAY_MSEC)
!                 cur_delay = MIN_DELAY_MSEC;

!             spins = spins_per_delay;
          }
      }
+
+     /*
+      * If we were able to acquire the lock without delaying, it's a good
+      * indication we are in a multiprocessor.  If we had to delay, it's
+      * a sign (but not a sure thing) that we are in a uniprocessor.
+      * Hence, we decrement spins_per_delay slowly when we had to delay,
+      * and increase it rapidly when we didn't.  It's expected that
+      * spins_per_delay will converge to the minimum value on a uniprocessor
+      * and to the maximum value on a multiprocessor.
+      *
+      * Note: spins_per_delay is local within our current process.
+      * We want to average these observations across multiple backends,
+      * since it's relatively rare for this function to even get entered,
+      * and so a single backend might not live long enough to converge on
+      * a good value.  That is handled by the two routines below.
+      */
+     if (cur_delay == 0)
+     {
+         /* we never had to delay */
+         spins_per_delay += 100;
+         spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY);
+     }
+     else
+     {
+         spins_per_delay -= 1;
+         spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY);
+     }
+ }
+
+
+ /*
+  * Set local copy of spins_per_delay during backend startup.
+  *
+  * NB: this has to be pretty fast as it is called while holding a spinlock
+  */
+ void
+ set_spins_per_delay(int shared_spins_per_delay)
+ {
+     spins_per_delay = shared_spins_per_delay;
+ }
+
+ /*
+  * Update shared estimate of spins_per_delay during backend exit.
+  *
+  * NB: this has to be pretty fast as it is called while holding a spinlock
+  */
+ int
+ update_spins_per_delay(int shared_spins_per_delay)
+ {
+     /*
+      * We use an exponential moving average with a relatively slow
+      * adaption rate, so that noise in any one backend's result won't
+      * affect the shared value too much.  As long as both inputs are
+      * within the allowed range, the result must be too, so we need not
+      * worry about clamping the result.
+      *
+      * We deliberately truncate rather than rounding; this is so that
+      * single adjustments inside a backend can affect the shared estimate
+      * (see the asymmetric adjustment rules above).
+      */
+     return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
  }
+

  /*
   * Various TAS implementations that cannot live in s_lock.h as no inline
*** /home/postgres/pgsql/src/include/storage/proc.h.orig    Sat Aug 20 19:26:34 2005
--- /home/postgres/pgsql/src/include/storage/proc.h    Sun Sep 11 15:07:09 2005
***************
*** 105,110 ****
--- 105,112 ----
  {
      /* Head of list of free PGPROC structures */
      SHMEM_OFFSET freeProcs;
+     /* Current shared estimate of appropriate spins_per_delay value */
+     int            spins_per_delay;
  } PROC_HDR;


*** /home/postgres/pgsql/src/include/storage/s_lock.h.orig    Sun Aug 28 20:41:34 2005
--- /home/postgres/pgsql/src/include/storage/s_lock.h    Sun Sep 11 15:07:09 2005
***************
*** 826,829 ****
--- 826,835 ----
   */
  extern void s_lock(volatile slock_t *lock, const char *file, int line);

+ /* Support for dynamic adjustment of spins_per_delay */
+ #define DEFAULT_SPINS_PER_DELAY  100
+
+ extern void set_spins_per_delay(int shared_spins_per_delay);
+ extern int    update_spins_per_delay(int shared_spins_per_delay);
+
  #endif     /* S_LOCK_H */

Re: Spinlocks, yet again: analysis and proposed patches

From
Kurt Roeckx
Date:
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> 
> I kinda suspect that the cmpb test is a no-op or loss on all
> Intelish processors: it can only be a win if you expect a lot
> of contention for the spin lock, but in percentage terms we still
> have a very low conflict rate, so in most executions of the TAS
> macro, the cmpb is just wasted cycles.  Bottom line: we definitely
> don't want it for x86_64, and maybe not at all, but we need more
> research to decide the latter.

I think an important question is wether this is for x86_64 in
general, of opteron specific.  It could be that it's not the same
on Intel's EM64Ts.

One reason this might behave differently for opterons is that
it's a cc-NUMA instead of the "normal" SMP.

Something else to consider is the OS you're using.  I've been
told that Linux isn't that good in NUMA and FreeBSD might be
better.


Kurt



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Kurt Roeckx <kurt@roeckx.be> writes:
> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
>> I kinda suspect that the cmpb test is a no-op or loss on all
>> Intelish processors:

> I think an important question is wether this is for x86_64 in
> general, of opteron specific.  It could be that it's not the same
> on Intel's EM64Ts.

Good point --- anyone have one to try?

> Something else to consider is the OS you're using.  I've been
> told that Linux isn't that good in NUMA and FreeBSD might be
> better.

It's hard to see how the OS could affect behavior at the level of
processor cache operations --- unless they did something truly
spectacularly stupid, like mark main memory non-cacheable.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Kurt Roeckx <kurt@roeckx.be> writes:
> > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> >> I kinda suspect that the cmpb test is a no-op or loss on all
> >> Intelish processors:
>
> > I think an important question is wether this is for x86_64 in
> > general, of opteron specific.  It could be that it's not the same
> > on Intel's EM64Ts.
>
> Good point --- anyone have one to try?

I've got one I can test on.  I need to upgrade the kernel and some other
things on it though (it's running 2.6.8 atm, and an older
Debian/unstable which I should probably bring up to current).

I'll work on it starting now and post results once I get some.
Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64.  Together, these changes bring
> my test case on a 4-way Opteron from
>
> N, runtime:    1 36s    2 61s    4 105s    8 198s

em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

N, runtime:    1 31s    2 47s    4 86s    8 159s

> to
>
> N, runtime:    1 35s    2 48s    4 58s    8 105s

N, runtime:    1 32s    2 53s    4 90s    8 169s

CPU utilization is definitely higher when running with the patch though.
Hope this helps, happy to do additional testing if necessary.
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:

> N, runtime:    1 31s    2 47s    4 86s    8 159s

> N, runtime:    1 32s    2 53s    4 90s    8 169s

Er, which (or both) of the two patches did you apply here?
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12:
>
> > N, runtime:    1 31s    2 47s    4 86s    8 159s
>
> > N, runtime:    1 32s    2 53s    4 90s    8 169s
>
> Er, which (or both) of the two patches did you apply here?

Applied both, sorry that wasn't clear.
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Er, which (or both) of the two patches did you apply here?

> Applied both, sorry that wasn't clear.

Thanks.  If you've got the time, could you try the two patches
separately and see what you get?
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64.  Together, these changes bring
> my test case on a 4-way Opteron from
>

Some changes are based on tests and heuristics, so can we make them into the 
configure script so the choice could be made automatically?

Regards,
Qingqing 




Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >> Er, which (or both) of the two patches did you apply here?
>
> > Applied both, sorry that wasn't clear.
>
> Thanks.  If you've got the time, could you try the two patches
> separately and see what you get?

Sure.

CVS Head:

N, runtime:    1 31s    2 47s    4 86s    8 159s

With just slock-no-cmpb.patch:

N, runtime:    1 32s    2 39s    4 82s    8 167s

With just spin-delay.patch

N, runtime:    1 32s    2 52s    4 94s    8 164s

With both:

N, runtime:    1 32s    2 53s    4 90s    8 169s
Hope that helps,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > Something else to consider is the OS you're using.  I've been
> > told that Linux isn't that good in NUMA and FreeBSD might be
> > better.
> 
> It's hard to see how the OS could affect behavior at the level of
> processor cache operations --- unless they did something truly
> spectacularly stupid, like mark main memory non-cacheable.

Well it could schedule processes on processors in ways that force less than
optimal memory usage patterns.

But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA
machines what you've "been told" about Linux not being that good in NUMA.
Really, these kind of cargo cult anecdotes are pretty pointless.

-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Stephen Frost (sfrost@snowman.net) wrote:
> > Thanks.  If you've got the time, could you try the two patches
> > separately and see what you get?
>
> Sure.
[...]

Just to be clear- this was from a completely default 'make install'
using the Debian configure options from 8.0.3 (which aren't that
particularly interesting really- nls, integer-datetimes, debug,
disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld,
enable-thread-safety).  If it'd be useful for the test I can adjust
whichever parameters are appropriate (work_mem, cache_size, etc).

There's absolutely nothing else going on except for these test, a few
shells, top, etc, on the box.
Enjoy,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
> Some changes are based on tests and heuristics, so can we make them into the 
> configure script so the choice could be made automatically?

It's a bit premature to propose that, when we don't yet know if the
suggested changes are a win or loss under any conditions, let alone
how to automatically tell the difference.  Right now we are in the
find-out-the-facts mode ...
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Kirkwood
Date:
Tom Lane wrote:

> I attach two proposed patches: the first removes the cmpb/jne from
> the x86 TAS assembly code, and the second one does the s_lock changes
> enumerated as points #2, #3, #4.  The first one in particular needs
> more testing to see if it hurts performance on any non-Opteron x86
> chips.  (If so, we'd just make it conditional to x86_64.)
> 

2x PIII 1G 2G Freebsd 6.0Beta4

8.1beta1 (2005-08-28):

N runtime: 1 85s   2 139s  4 220s


8.1beta1 (2005-08-28) + patch 1 (s_lock.h only)

N runtime: 1 89s   2 137s  4 229s


8.1beta1 (2005-08-28) + patch 2

N runtime: 1 84s   2 108s  4 214s



Observe the interesting little speed improvement for patch 2 with 2 
processes (seems to be repeatable).

Let me know if you want to see vmstat output for any of these.

regards

Mark




Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:
> Comments and testing invited.

I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T). 
(Configured with 
"CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts").
The results were pretty stable (around .2 seconds). I would not trust the 
numbers for N=2, linux, at least 2.4 is not good at not scheduling two 
running processes on two different HTs on the same core. Those values also 
had the most variance (> 1s). All other measures were quite stable over 
several runs.

CVS tip from 2005-09-12 ~16:00
1: 57s   2: 82s   4: 124s   8: 237s

with only slock-no-cmpb.patch applied
1: 55s   2: 79s   4: 119s   8: 229s

with only spin-delay.patch applied
1: 56s   2: 79s   4: 124s   8: 235s

with both patches applied
1: 55s   2: 78s   4: 124s   8: 235s


compare to 7.4.8 on the same machine ;-)
1: 92s   2: 235s  4: 474s   8: did not try ...


It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch 
does not really help much on this machine. That seems to match Stephen 
Frost's results with EM64T, if I read them correctly.

The cs rate is about 150 on CVS tip without patches and below 100 with the 
patches (all three cases).
With 7.4.8 its 230000-280000 with N>1. 8.1 is clearly the winner here. Great 
work, Tom.

I hope some more data helps.

Best Regards,
Michael Paesold



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Michael Paesold" <mpaesold@gmx.at> writes:
> It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch 
> does not really help much on this machine. That seems to match Stephen 
> Frost's results with EM64T, if I read them correctly.

Yeah, it's interesting that you both see slock-no-cmpb as saving some
cycles and the second patch as giving them back.  I wonder whether the
integer-modulo-to-slow-the-loop trick is counterproductive on your
machines.  You both seem to be using hardware that will recognize
rep;nop and maybe that's all that's needed.

I probably should have broken down the spindelay patch into multiple
components.  But it's only a small change --- could you try simplifying
the patched line
    if ((--spins % MAX_SPINS_PER_DELAY) == 0)

to
    if (--spins == 0)

and see how the patch does that way?
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:

> I probably should have broken down the spindelay patch into multiple
> components.  But it's only a small change --- could you try simplifying
> the patched line
> 
> if ((--spins % MAX_SPINS_PER_DELAY) == 0)
> 
> to
> 
> if (--spins == 0)
> 
> and see how the patch does that way?

I'll do tomorrow morning (CEST, i.e. in about 11 hours).
Best Regards,
Michael Paesold


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
I wrote:
> ... and see how the patch does that way?

BTW, please do look at "vmstat 1" while running the test case
corresponding to your number of processors.  It's hard to tell from the
runtime alone whether the patch is fully accomplishing its goal of
reducing wasted cycles.  If you see user CPU percent go to 100 and
context swaps drop to a low value, though, you know it's working.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Marko Kreen
Date:
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> The second reason that the futex patch is helping is that when
> a spinlock delay does occur, it allows the delaying process to be
> awoken almost immediately, rather than delaying 10 msec or more
> as the existing code does.  However, given that we are only expecting
> the spinlock to be held for a couple dozen instructions, using the
> kernel futex mechanism is huge overkill --- the in-kernel overhead
> to manage the futex state is almost certainly several orders of
> magnitude more than the delay we actually want.

Why do you think so?  AFAIK on uncontented case there will be no
kernel access, only atomic inc/dec.  On contented case you'll
want task switch anyway, so the futex managing should not
matter.  Also this mechanism is specifically optimized for 
inter-process locking, I don't think you can get more efficient
mechanism from side-effects from generic syscalls.

If you don't want Linux-specific locking in core code, then
it's another matter...


> 1. Use sched_yield() if available: it does just what we want,
> ie, yield the processor without forcing any useless time delay
> before we can be rescheduled.  This doesn't exist everywhere
> but it exists in recent Linuxen, so I tried it.  It made for a
> beautiful improvement in my test case numbers: CPU utilization
> went to 100% and the context swap rate to almost nil.  Unfortunately,
> I also saw fairly frequent "stuck spinlock" panics when running
> more queries than there were processors --- this despite increasing
> NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
> anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor.  In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)

This is intended behaviour of sched_yield.

http://lwn.net/Articles/31462/
http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2


-- 
marko



Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
I wrote:
> I'll do tomorrow morning (CEST, i.e. in about 11 hours).

These are the tests with the change:
> if ((--spins % MAX_SPINS_PER_DELAY) == 0)
>
> to
>
> if (--spins == 0)

I have called the resulting patch (spin-delay + this change) spin-delay-2.

again with only slock-no-cmpb applied
1: 55s   4: 119s (cpu ~97%)

with only spin-delay-2 applied
1: 56s   4: 125s (cpu ~99.5%)

with slock-no-cmpb and spin-delay-2 applied
1: 56s   4: 125s (cpu ~100%)

(Note that the cpu averages are not calulated but estimated by looking at 
vmstat.)

Yesterdays results:
> CVS tip from 2005-09-12 ~16:00
> 1: 57s   2: 82s   4: 124s   8: 237s
>
> with only slock-no-cmpb.patch applied
> 1: 55s   2: 79s   4: 119s   8: 229s
>
> with only spin-delay.patch applied
> 1: 56s   2: 79s   4: 124s   8: 235s
>
> with both patches applied
> 1: 55s   2: 78s   4: 124s   8: 235s


To have other data, I have retested the patches on a single-cpu Intel P4 
3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon 
results it's clear that this is in reality only one cpu. While the runtime 
for N=1 is better than the other system, for N=4 it's already worse. The 
situation with the patches is quite different, though. Unfortunatly.

CVS tip from 2005-09-12:
1: 36s   2: 77s (cpu ~85%)    4: 159s (cpu ~98%)

only slock-no-cmpb:
1: 36s   2: 81s (cpu ~79%)    4: 177s (cpu ~94%)
(doesn't help this time)

only spin-delay:
1: 36s   2: 86s (cpu =100%)   4: 157s (cpu =100%)
(bad runtime for N=2 (repeatable), cpu not doing real work here?)

slock-no-cmpb and spin-delay:
1: 36s   2: 106s (cpu =100%)  4: 192s (cpu =100%)
(it gets worse)

only spin-delay-2:
1: 36s   2: 85s (cpu =100%)   4: 160s (cpu =100%)
(quite the same as spin-delay)

slock-no-cmpb and spin-delay-2:
1: 36s   2: 109s (cpu =100%)  4: 198s (cpu =100%)
(worse again)

CS rate was low (20-50) for all tests, increasing for N>2 which has to be 
expected. For this system I see no case for applying these patches.

Is there a portable way to detect the CPU we are running on? Do you have any 
other idea how to implement the delays?

Best Regards,
Michael Paesold 



Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Marko Kreen <marko@l-t.ee> writes:

> > (I speculate that it's set up to only yield the processor to other
> > processes already affiliated to that processor.  In any case, it
> > is definitely capable of getting through 10000 yields without
> > running the guy who's holding the spinlock.)

Maybe it should try sched_yield once and then use select after that?

-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Marko Kreen <marko@l-t.ee> writes:
> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
>> However, given that we are only expecting
>> the spinlock to be held for a couple dozen instructions, using the
>> kernel futex mechanism is huge overkill --- the in-kernel overhead
>> to manage the futex state is almost certainly several orders of
>> magnitude more than the delay we actually want.

> Why do you think so?  AFAIK on uncontented case there will be no
> kernel access, only atomic inc/dec.

In the uncontended case, we never even enter s_lock() and so the entire
mechanism of yielding is irrelevant.  The problem that's being exposed
by these test cases is that on multiprocessors, you can see a
significant rate of spinlock contention (order of 100 events/second,
which is still a tiny fraction of the number of TAS calls) and our
existing mechanism for dealing with contention is just not efficient
enough.

> On contented case you'll want task switch anyway, so the futex
> managing should not matter.

No, we DON'T want a task switch.  That's the entire point: in a
multiprocessor, it's a good bet that the spinlock is held by a task
running on another processor, and doing a task switch will take orders
of magnitude longer than just spinning until the lock is released.
You should yield only after spinning long enough to make it a strong
probability that the spinlock is held by a process that's lost the
CPU and needs to be rescheduled.

> If you don't want Linux-specific locking in core code, then
> it's another matter...

Well, it's true, we don't particularly want a one-platform solution,
but if it did what we wanted we might hold our noses and use it anyway.

(I think, BTW, that using futexes at the spinlock level is misguided;
what would be interesting would be to see if we could substitute for
both LWLock and spinlock logic with one futex-based module.)

>> I also saw fairly frequent "stuck spinlock" panics when running
>> more queries than there were processors --- this despite increasing
>> NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
>> anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
>> (I speculate that it's set up to only yield the processor to other
>> processes already affiliated to that processor.  In any case, it
>> is definitely capable of getting through 10000 yields without
>> running the guy who's holding the spinlock.)

> This is intended behaviour of sched_yield.

> http://lwn.net/Articles/31462/
> http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2

No; that page still says specifically "So a process calling
sched_yield() now must wait until all other runnable processes in the
system have used up their time slices before it will get the processor
again."  I can prove that that is NOT what happens, at least not on
a multi-CPU Opteron with current FC4 kernel.  However, if the newer
kernels penalize a process calling sched_yield as heavily as this page
claims, then it's not what we want anyway ...
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Marko Kreen <marko@l-t.ee> writes:

> On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> > The second reason that the futex patch is helping is that when
> > a spinlock delay does occur, it allows the delaying process to be
> > awoken almost immediately, rather than delaying 10 msec or more
> > as the existing code does.  However, given that we are only expecting
> > the spinlock to be held for a couple dozen instructions, using the
> > kernel futex mechanism is huge overkill --- the in-kernel overhead
> > to manage the futex state is almost certainly several orders of
> > magnitude more than the delay we actually want.
> 
> Why do you think so?  AFAIK on uncontented case there will be no
> kernel access, only atomic inc/dec.  On contented case you'll
> want task switch anyway, so the futex managing should not
> matter.  Also this mechanism is specifically optimized for 
> inter-process locking, I don't think you can get more efficient
> mechanism from side-effects from generic syscalls.
> 
> If you don't want Linux-specific locking in core code, then
> it's another matter...

How does the futex using code compare with the new patches on this new
benchmark test?

-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> Marko Kreen <marko@l-t.ee> writes:
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor.  In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)

> Maybe it should try sched_yield once and then use select after that?

I tried that, actually, but it didn't seem to offer any particular
benefit.  (Maybe it would have helped more on older Linux kernels before
they changed sched_yield?)

I'm feeling even more disenchanted with sched_yield now that Marko
pointed out that the behavior was changed recently.  Here we have a
construct that is not portable cross-platform, does not act as
documented in its man page, and the kernel guys feel free to whack its
behavior around in major ways without documenting that either.  It seems
to be a crap-shoot whether it will be useful on any particular machine
or not.  At least with the select() code we can be reasonably confident
we know what will happen.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> > On contented case you'll want task switch anyway, so the futex
> > managing should not matter.
> 
> No, we DON'T want a task switch.  That's the entire point: in a
> multiprocessor, it's a good bet that the spinlock is held by a task
> running on another processor, and doing a task switch will take orders
> of magnitude longer than just spinning until the lock is released.
> You should yield only after spinning long enough to make it a strong
> probability that the spinlock is held by a process that's lost the
> CPU and needs to be rescheduled.

Does the futex code make any attempt to record the CPU of the process grabbing
the lock? Clearly it wouldn't be a guarantee of anything but if it's only used
for short-lived spinlocks while acquiring longer lived locks then maybe?

> No; that page still says specifically "So a process calling
> sched_yield() now must wait until all other runnable processes in the
> system have used up their time slices before it will get the processor
> again."  I can prove that that is NOT what happens, at least not on
> a multi-CPU Opteron with current FC4 kernel.  However, if the newer
> kernels penalize a process calling sched_yield as heavily as this page
> claims, then it's not what we want anyway ...

Well it would be no worse than select or any other random i/o syscall.

It seems to me what you've found is an outright bug in the linux scheduler.
Perhaps posting it to linux-kernel would be worthwhile.

-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Josh Berkus
Date:
Tom, All:

> It seems to me what you've found is an outright bug in the linux scheduler.
> Perhaps posting it to linux-kernel would be worthwhile.

For people using this on Linux 2.6, which scheduler are you using?   Deadline 
is the recommended one for databases, and does offer significant (+5-8%) 
benefits in some heavy-contention environments, at least in OSDL tests.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Spinlocks, yet again: analysis and proposed patches

From
Douglas McNaught
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Tom, All:
>
>> It seems to me what you've found is an outright bug in the linux scheduler.
>> Perhaps posting it to linux-kernel would be worthwhile.
>
> For people using this on Linux 2.6, which scheduler are you using?   Deadline 
> is the recommended one for databases, and does offer significant (+5-8%) 
> benefits in some heavy-contention environments, at least in OSDL tests.

I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

-Doug


Re: Spinlocks, yet again: analysis and proposed patches

From
Douglas McNaught
Date:
Greg Stark <gsstark@mit.edu> writes:

> Tom Lane <tgl@sss.pgh.pa.us> writes:
>
>> No; that page still says specifically "So a process calling
>> sched_yield() now must wait until all other runnable processes in the
>> system have used up their time slices before it will get the processor
>> again."  I can prove that that is NOT what happens, at least not on
>> a multi-CPU Opteron with current FC4 kernel.  However, if the newer
>> kernels penalize a process calling sched_yield as heavily as this page
>> claims, then it's not what we want anyway ...
>
> Well it would be no worse than select or any other random i/o syscall.
>
> It seems to me what you've found is an outright bug in the linux scheduler.
> Perhaps posting it to linux-kernel would be worthwhile.

People have complained on l-k several times about the 2.6
sched_yield() behavior; the response has basically been "if you rely
on any particular sched_yield() behavior for synchronization, your app
is broken--it's not a synchronization primitive."

-Doug


Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Douglas McNaught <doug@mcnaught.org> writes:

> > It seems to me what you've found is an outright bug in the linux scheduler.
> > Perhaps posting it to linux-kernel would be worthwhile.
> 
> People have complained on l-k several times about the 2.6
> sched_yield() behavior; the response has basically been "if you rely
> on any particular sched_yield() behavior for synchronization, your app
> is broken--it's not a synchronization primitive."

They're not talking about this. They're talking about applications that spin
on sched_yield() and expect it to reduce cpu load as if the process were
calling sleep().

What Tom found was that some processes are never scheduled when sched_yield is
called. There's no reason that should be happening.

-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Douglas McNaught
Date:
Greg Stark <gsstark@mit.edu> writes:

> What Tom found was that some processes are never scheduled when sched_yield is
> called. There's no reason that should be happening.

Yeah, that would probably be a bug...

-Doug


Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Wong
Date:
On Tue, 13 Sep 2005 12:21:45 -0400
Douglas McNaught <doug@mcnaught.org> wrote:

> Josh Berkus <josh@agliodbs.com> writes:
> 
> > Tom, All:
> >
> >> It seems to me what you've found is an outright bug in the linux scheduler.
> >> Perhaps posting it to linux-kernel would be worthwhile.
> >
> > For people using this on Linux 2.6, which scheduler are you using?   Deadline 
> > is the recommended one for databases, and does offer significant (+5-8%) 
> > benefits in some heavy-contention environments, at least in OSDL tests.
> 
> I thought 'deadline' was an I/O scheduler, not a CPU scheduler?

Yes, that's correct.  That's an i/o elevator algorithm.

Mark


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Douglas McNaught <doug@mcnaught.org> writes:
> Greg Stark <gsstark@mit.edu> writes:
>> What Tom found was that some processes are never scheduled when sched_yield is
>> called. There's no reason that should be happening.

> Yeah, that would probably be a bug...

I suspect the kernel hackers might consider it deliberate in a
NUMA-aware kernel.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I'm feeling even more disenchanted with sched_yield now that Marko
> pointed out that the behavior was changed recently.  Here we have a

To be fair, I'm not entirely sure 'recently' is quite the right word.
It sounds like it changed during the 2.5 development cycle; which was
between major kernel releases, and looks like it was around Dec, 2003.
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Michael Paesold" <mpaesold@gmx.at> writes:
> To have other data, I have retested the patches on a single-cpu Intel P4 
> 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon 
> results it's clear that this is in reality only one cpu. While the runtime 
> for N=1 is better than the other system, for N=4 it's already worse. The 
> situation with the patches is quite different, though. Unfortunatly.

> CVS tip from 2005-09-12:
> 1: 36s   2: 77s (cpu ~85%)    4: 159s (cpu ~98%)

> only slock-no-cmpb:
> 1: 36s   2: 81s (cpu ~79%)    4: 177s (cpu ~94%)
> (doesn't help this time)

Hm.  This is the first configuration we've seen in which slock-no-cmpb
was a loss.  Could you double-check that result?

> Is there a portable way to detect the CPU we are running on? Do you have any 
> other idea how to implement the delays?

I can't see any reasonable way to do runtime switching of the cmpb test
--- whatever logic we put in to control it would cost as much or more
than the cmpb anyway :-(.  I think that has to be a compile-time choice.
From my perspective it'd be acceptable to remove the cmpb only for
x86_64, since only there does it seem to be a really significant win.
On the other hand it seems that removing the cmpb is a net win on most
x86 setups too, so maybe we should just do it and accept that there are
some cases where it's not perfect.  Looking back at the original
discussion, I see that the cmpb was added with little discussion and no
specific performance testing, so it shouldn't be presumed to have a lot
of weight behind it:
http://archives.postgresql.org/pgsql-patches/2003-12/msg00352.php

The other thing that's going on in the patch is dynamic adjustment of
spins_per_delay, and that could certainly be made much smarter than it
is now.  For instance, it'd cost almost nothing to make the lower and
upper bounds for spins_per_delay be variables instead of constants.
We could set the bounds during postmaster startup based on some criteria
yet to be determined.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Marko Kreen
Date:
On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote:
> Marko Kreen <marko@l-t.ee> writes:
> > On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote:
> >> However, given that we are only expecting
> >> the spinlock to be held for a couple dozen instructions, using the
> >> kernel futex mechanism is huge overkill --- the in-kernel overhead
> >> to manage the futex state is almost certainly several orders of
> >> magnitude more than the delay we actually want.
> 
> > Why do you think so?  AFAIK on uncontented case there will be no
> > kernel access, only atomic inc/dec.
> 
> In the uncontended case, we never even enter s_lock() and so the entire
> mechanism of yielding is irrelevant.  The problem that's being exposed
> by these test cases is that on multiprocessors, you can see a
> significant rate of spinlock contention (order of 100 events/second,
> which is still a tiny fraction of the number of TAS calls) and our
> existing mechanism for dealing with contention is just not efficient
> enough.
> 
> > On contented case you'll want task switch anyway, so the futex
> > managing should not matter.
> 
> No, we DON'T want a task switch.  That's the entire point: in a
> multiprocessor, it's a good bet that the spinlock is held by a task
> running on another processor, and doing a task switch will take orders
> of magnitude longer than just spinning until the lock is released.
> You should yield only after spinning long enough to make it a strong
> probability that the spinlock is held by a process that's lost the
> CPU and needs to be rescheduled.
         
Hmm.  I guess this could be separated into 2 cases:

1. Light load - both lock owner and lock requester wont get  scheduled while busy (owner in critical section, waiter
spinning.)
2. Big load - either or both of them gets scheduled while busy.  (waiter is scheduled by OS or voluntarily by eg.
callingselect())
 

So my impression is that currently you optimize for #1 at the
expense of #2, while with futexes you'd optimize for #2 at
the expense of #1.  Additionally I'm pretty convinced that
futexes give you most efficient implementation for #2, as kernel
knows what processes are waiting on particular lock so it can
make best decisions for scheduling.

> > If you don't want Linux-specific locking in core code, then
> > it's another matter...
> 
> Well, it's true, we don't particularly want a one-platform solution,
> but if it did what we wanted we might hold our noses and use it anyway.
> 
> (I think, BTW, that using futexes at the spinlock level is misguided;
> what would be interesting would be to see if we could substitute for
> both LWLock and spinlock logic with one futex-based module.)

Use pthreads ;)

> >> I also saw fairly frequent "stuck spinlock" panics when running
> >> more queries than there were processors --- this despite increasing
> >> NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
> >> anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
> >> (I speculate that it's set up to only yield the processor to other
> >> processes already affiliated to that processor.  In any case, it
> >> is definitely capable of getting through 10000 yields without
> >> running the guy who's holding the spinlock.)
> 
> > This is intended behaviour of sched_yield.
> 
> > http://lwn.net/Articles/31462/
> > http://marc.theaimsgroup.com/?l=linux-kernel&m=112432727428224&w=2
> 
> No; that page still says specifically "So a process calling
> sched_yield() now must wait until all other runnable processes in the
> system have used up their time slices before it will get the processor
> again."  I can prove that that is NOT what happens, at least not on
> a multi-CPU Opteron with current FC4 kernel.  However, if the newer
> kernels penalize a process calling sched_yield as heavily as this page
> claims, then it's not what we want anyway ...

My fault.  As I saw that there is problem with sched_yield, I
said "I bet this is because of behaviour change" and only
skimmed the exact details.  But the point that sched_yield is
not meant for such usage still stands.

About fast yielding, comment on sys_sched_yield() says:
* sys_sched_yield - yield the current processor to other threads.** this function yields the current CPU by moving the
callingthread* to the expired array. If there are no other threads running on this* CPU then this function will
return.

So there just is nothing else to schedule on that CPU.

-- 
marko



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Marko Kreen <marko@l-t.ee> writes:
> Hmm.  I guess this could be separated into 2 cases:

> 1. Light load - both lock owner and lock requester wont get
>    scheduled while busy (owner in critical section, waiter
>    spinning.)
> 2. Big load - either or both of them gets scheduled while busy.
>    (waiter is scheduled by OS or voluntarily by eg. calling select())

Don't forget that the coding rules for our spinlocks say that you
mustn't hold any such lock for more than a couple dozen instructions,
and certainly any kernel call while holding the lock is Right Out.
There is *no* case where the holder of a spinlock is going to
voluntarily give up the CPU.  The design intention was that the
odds of losing the CPU while holding a spinlock would be negligibly
small, simply because we don't hold it very long.

> About fast yielding, comment on sys_sched_yield() says:
>  * sys_sched_yield - yield the current processor to other threads.
>  *
>  * this function yields the current CPU by moving the calling thread
>  * to the expired array. If there are no other threads running on this
>  * CPU then this function will return.

Mph.  So that's pretty much exactly what I suspected...


I just had a thought: it seems that the reason we are seeing a
significant issue here is that on SMP machines, the cost of trading
exclusively-owned cache lines back and forth between processors is
so high that the TAS instructions (specifically the xchgb, in the x86
cases) represent a significant fraction of backend execution time all
by themselves.  (We know this is the case due to oprofile results,
see discussions from last April.)  What that means is that there's a
fair chance of a process losing its timeslice immediately after the
xchgb.  Which is precisely the scenario we do not want, if the process
successfully acquired the spinlock by means of the xchgb.

We could ameliorate this if there were a way to acquire ownership of the
cache line without necessarily winning the spinlock.  I'm imagining
that we insert a "dummy" locked instruction just ahead of the xchgb,
which touches the spinlock in such a way as to not change its state.
(xchgb won't do for this, but maybe one of the other lockable
instructions will.)  We do the xchgb just after this one.  The idea is
that if we don't own the cache line, the first instruction causes it to
be faulted into the processor's cache, and if our timeslice expires
while that is happening, we lose the processor without having acquired
the spinlock.  This assumes that once we've got the cache line, the
xchgb that actually does the work can get executed with not much
extra time spent and only low probability of someone else stealing the
cache line back first.

The fact that cmpb isn't helping proves that getting the cache line in a
read-only fashion does *not* do enough to protect the xchgb in this way.
But maybe another locking instruction would.  Comments?
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
I wrote:
> We could ameliorate this if there were a way to acquire ownership of the
> cache line without necessarily winning the spinlock.

Another thought came to mind: maybe the current data layout for LWLocks
is bad.  Right now, the spinlock that protects each LWLock data struct
is itself part of the struct, and since the structs aren't large (circa
20 bytes), the whole thing is usually all in the same cache line.  This
had seemed like a good idea at the time, on the theory that once you'd
obtained the spinlock you'd also have pulled in the LWLock contents.
But that's only optimal under an assumption of low contention.  If
there's high contention for the spinlock, then another processor
spinning on the lock will be continuously taking away ownership of the
cache line and thus slowing down the guy who's got the lock and is
trying to examine/update the LWLock contents.

Maybe it'd be better to allocate the spinlocks off by themselves.
Then, spinning processors would not affect the processor that's updating
the LWLock; only when it finishes doing that and needs to clear the
spinlock will it have to contend with the spinners for the cache line
containing the spinlock.

This would add an instruction or so to LWLockAcquire and LWLockRelease,
and would be of no benefit on uniprocessors, but it might be worth doing
for multiprocessors.  Another patch to test ...

I'm starting to think that we might have to succumb to having a compile
option "optimize for multiprocessor" or "optimize for single processor".
It's pretty hard to see how we'd alter a data structure decision like
this on the fly.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Min Xu (Hsu)"
Date:
On Tue, 13 Sep 2005 Tom Lane wrote :
> I wrote:
> > We could ameliorate this if there were a way to acquire ownership of the
> > cache line without necessarily winning the spinlock.
> 
> Another thought came to mind: maybe the current data layout for LWLocks
> is bad.  Right now, the spinlock that protects each LWLock data struct
> is itself part of the struct, and since the structs aren't large (circa
> 20 bytes), the whole thing is usually all in the same cache line.  This
> had seemed like a good idea at the time, on the theory that once you'd
> obtained the spinlock you'd also have pulled in the LWLock contents.
> But that's only optimal under an assumption of low contention.  If
> there's high contention for the spinlock, then another processor
> spinning on the lock will be continuously taking away ownership of the
> cache line and thus slowing down the guy who's got the lock and is
> trying to examine/update the LWLock contents.

If this were the case, perhaps first fetch the spin lock with read-only
permission should have helped. Modern processors have store buffers to
not let a store miss to slow down the processor. Therefore, if processor
A has the spin lock and it is examining and updating the LWLock on the
the same cache line, as long as processor B doesn't try to write the
same cache line, processor A won't be slowed down. What would happen
is that multiple writing are coalescing in the write buffer and they
are going to be flushed to the memory when processor A regain the write
permission.

I still think your first scenario is possible. That's when final
processor B gets the lock, its time slice runs out. This sounds to
me is likely to cause a chain of context switches.

> Maybe it'd be better to allocate the spinlocks off by themselves.
> Then, spinning processors would not affect the processor that's updating
> the LWLock; only when it finishes doing that and needs to clear the
> spinlock will it have to contend with the spinners for the cache line
> containing the spinlock.
> 
> This would add an instruction or so to LWLockAcquire and LWLockRelease,
> and would be of no benefit on uniprocessors, but it might be worth doing
> for multiprocessors.  Another patch to test ...
> 
> I'm starting to think that we might have to succumb to having a compile
> option "optimize for multiprocessor" or "optimize for single processor".
> It's pretty hard to see how we'd alter a data structure decision like
> this on the fly.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Have you searched our list archives?
> 
>                http://archives.postgresql.org


Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I'm starting to think that we might have to succumb to having a compile
> option "optimize for multiprocessor" or "optimize for single processor".
> It's pretty hard to see how we'd alter a data structure decision like
> this on the fly.

I'd really hate to see this happen.  In this case I don't think the
change you're proposing would have much of an impact on a uniprocessor
machine.  Having seperate compile-time options for uniprocessor and
multiprocessor would open the gates for potentially other changes which
*would* have a more serious impact on one or the other when compiled for
the opposite.  I think this would be a serious problem for binary
distributions and correspondingly their users.

Happy to test these changes, btw.

Also, I'm redoing my tests from the other patches with all the various
combinations; I'm a bit concerned that the recompiles I did (which were
just 'make', not 'make clean', 'make') for them previously didn't
actually recompile everything necessary.  Sorry if the results differ.
Hope to have that all done this evening.
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
On Tue, 13 Sep 2005, Stephen Frost wrote:

> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > I'm starting to think that we might have to succumb to having a compile
> > option "optimize for multiprocessor" or "optimize for single processor".
> > It's pretty hard to see how we'd alter a data structure decision like
> > this on the fly.
>
> I'd really hate to see this happen.  In this case I don't think the
> change you're proposing would have much of an impact on a uniprocessor
> machine.  Having seperate compile-time options for uniprocessor and
> multiprocessor would open the gates for potentially other changes which
> *would* have a more serious impact on one or the other when compiled for
> the opposite.  I think this would be a serious problem for binary
> distributions and correspondingly their users.

It does make it painful for distribution/package maintainers but I think
the potential benefits for single/multi-CPU architectures are high. It
means that our lock intrinsic on uniprocessors can just be a lock/delay
loop without any spinning -- which is a complete waste of time if you only
have one CPU.

Doing this at runtime involvevs some pretty significant uglification of
low level code I think.

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Gavin Sherry (swm@linuxworld.com.au) wrote:
> It does make it painful for distribution/package maintainers but I think
> the potential benefits for single/multi-CPU architectures are high. It
> means that our lock intrinsic on uniprocessors can just be a lock/delay
> loop without any spinning -- which is a complete waste of time if you only
> have one CPU.
>
> Doing this at runtime involvevs some pretty significant uglification of
> low level code I think.

I suspect distributors would go for the multi-cpu setup (especially if
a uniprocessor build is *broken* for multiprocessor) and then in a
lot of cases you end up not actually getting any benefit.  I'm afraid
you'd also end up having to tell alot of people who complain to
recompile, who will then complain back to their distributors, etc.

I suppose another option would be to provide seperate packages...  Could
this be done as a shared library so it's more 'plug-and-play' to switch
between the two?  I dunno, just trying to think about how to deal with
this without making it suck terribly bad for me (as a Debian user and
maintainer).

Certainly makes me wish there was a way to do it which made the kernel
handle the uniprocessor/multiprocessor question and did locking as
appropriate for those.  Ah, well.
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Min Xu (Hsu)" <xu@cs.wisc.edu> writes:
> ...If this were the case, perhaps first fetch the spin lock with read-only
> permission should have helped.

But the cmpb instruction in the 8.0 version of TAS would have done that,
and I think we've already established that the cmpb is a loss on most
machines (except maybe single-physical-CPU Xeons).  I suggested in my
other message that it might help to grab write permission on the cache
line before actually trying to acquire the spin lock --- but I don't see
how getting read-only permission first will help.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> I suspect distributors would go for the multi-cpu setup (especially if
> a uniprocessor build is *broken* for multiprocessor) and then in a
> lot of cases you end up not actually getting any benefit.  I'm afraid
> you'd also end up having to tell alot of people who complain to
> recompile, who will then complain back to their distributors, etc.

Yeah.  Being in charge of Red Hat's packaging of PG, I feel that pain as
keenly as anybody ... and I *know* RH will not be interested in shipping
two different packages.  If we go this way, the RH distributions will
use the --optimize-multi switch, because that's where the money is.

The bottom line here is that we will have to make some compromises:
if we want one-size-fits-all code, it will not be optimal for every
single architecture.  If we don't do one-size-fits-all, then we will
pay for it in various other ways.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Min Xu (Hsu)"
Date:
On Tue, 13 Sep 2005 Tom Lane wrote :
> "Min Xu (Hsu)" <xu@cs.wisc.edu> writes:
> > ...If this were the case, perhaps first fetch the spin lock with read-only
> > permission should have helped.
> 
> But the cmpb instruction in the 8.0 version of TAS would have done that,
> and I think we've already established that the cmpb is a loss on most
> machines (except maybe single-physical-CPU Xeons).  I suggested in my
> other message that it might help to grab write permission on the cache
> line before actually trying to acquire the spin lock --- but I don't see
> how getting read-only permission first will help.

Yes, I agree. What I was trying to say was that if the second scenario
you hypothesize were true, i.e. fetching write-able line before actually
trying to acquire the spin lock would cause another processor to slow
down its execution inside the critical section, then fetching read-only
lines would have helped. As you said, however, experimental results
shows fetching read-only lines didn't help, which led me wonder whether the
second scenario your described was really happening.



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Min Xu (Hsu)" <xu@cs.wisc.edu> writes:
> ... As you said, however, experimental results
> shows fetching read-only lines didn't help, which led me wonder whether the
> second scenario your described was really happening.

I don't know --- we haven't tried it.  I do intend to work up some
patches for these ideas, and then we can test them.

Or, if you happen to be wide awake at the moment, feel free to take a
shot at it ... I'm out of steam and won't get to it before tomorrow...
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Min Xu (Hsu)"
Date:
On Tue, 13 Sep 2005 Tom Lane wrote :
> "Min Xu (Hsu)" <xu@cs.wisc.edu> writes:
> > ... As you said, however, experimental results
> > shows fetching read-only lines didn't help, which led me wonder whether the
> > second scenario your described was really happening.
> 
> I don't know --- we haven't tried it.  I do intend to work up some
> patches for these ideas, and then we can test them.

I see. I look forward to it.

> Or, if you happen to be wide awake at the moment, feel free to take a
> shot at it ... I'm out of steam and won't get to it before tomorrow...

I am newbie here. Please don't let my comments discourage you from
doing anything. I enjoyed very much following your discussion and analysis
on this problem.


Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:

> But the cmpb instruction in the 8.0 version of TAS would have done that,
> and I think we've already established that the cmpb is a loss on most
> machines (except maybe single-physical-CPU Xeons).

Note that this was a regular Pentium 4 system, not a Xeon.

Best Regards,
Michael Paesold


Re: Spinlocks, yet again: analysis and proposed patches

From
"Dave Page"
Date:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
> Sent: 13 September 2005 23:03
> To: Marko Kreen
> Cc: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Spinlocks, yet again: analysis and
> proposed patches
>
> I'm starting to think that we might have to succumb to having
> a compile
> option "optimize for multiprocessor" or "optimize for single
> processor".
> It's pretty hard to see how we'd alter a data structure decision like
> this on the fly.

That would be a major pita for packagers of binary distributions.

Regards, Dave.


Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Stephen Frost <sfrost@snowman.net> writes:

> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > I'm starting to think that we might have to succumb to having a compile
> > option "optimize for multiprocessor" or "optimize for single processor".
> > It's pretty hard to see how we'd alter a data structure decision like
> > this on the fly.
> 
> I'd really hate to see this happen.  

I suspect you're thinking too hard about this. It's not like some LWLocks
would need SMP spinlocks and others uniprocessor locks. And it's not like it's
going to change dynamically.

Wouldn't it just be a couple:

if (multiprocessor) { complex version
} else { simple version
}


Ugly to be sure but it's not going to spread to other areas of the source base
and you know the if statement isn't going to be growing any more arms. 

It's no uglier than doing the same thing with an #ifdef anyways.



If you really really want to have two versions and you think using function
pointers would impose too big a performance penalty you could play linker
games. But that way lies madness.


-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:
> "Michael Paesold" <mpaesold@gmx.at> writes:
>> To have other data, I have retested the patches on a single-cpu Intel P4
>> 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 
>> dual-Xeon
>> results it's clear that this is in reality only one cpu. While the 
>> runtime
>> for N=1 is better than the other system, for N=4 it's already worse. The
>> situation with the patches is quite different, though. Unfortunatly.
>
>> CVS tip from 2005-09-12:
>> 1: 36s   2: 77s (cpu ~85%)    4: 159s (cpu ~98%)
>
>> only slock-no-cmpb:
>> 1: 36s   2: 81s (cpu ~79%)    4: 177s (cpu ~94%)
>> (doesn't help this time)
>
> Hm.  This is the first configuration we've seen in which slock-no-cmpb
> was a loss.  Could you double-check that result?

The first tests were compiled with 
CFLAGS='-O2 -mcpu=pentium4 -march=pentium4'. I had redone the tests with 
just CFLAGS='-O2' yesterday. The difference was just about a second, but the 
result from the patch was the same. The results for N=4 and N=8 show the 
positive effect more clearly.

configure: CFLAGS='-O2' --enable-casserts
On RHEL 4.1, gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1)

CVS tip from 2005-09-12:
1: 37s   2: 78s      4: 159s       8: 324

only slock-no-cmpb:
1: 37s   2: 82s (5%) 4: 178s (12%) 8: 362 (12%)

configure:  --enable-casserts

(Btw. I have always done "make clean ; make ; make install" between tests)

Best Regards,
Michael Paesold

> I can't see any reasonable way to do runtime switching of the cmpb test
> --- whatever logic we put in to control it would cost as much or more
> than the cmpb anyway :-(.  I think that has to be a compile-time choice.
> From my perspective it'd be acceptable to remove the cmpb only for
> x86_64, since only there does it seem to be a really significant win.
> On the other hand it seems that removing the cmpb is a net win on most
> x86 setups too, so maybe we should just do it and accept that there are
> some cases where it's not perfect.

How many test cases do we have yet?
Summary of the effects without the cmpb instruction seems to be:

8-way Opteron:           better
Dual/HT Xeon w/o EM64T:  better
Dual/HT EM64T:           better for N<=cpus, worse for N>cpus (Stephen's)
HT P4 w/o EM64T:         worse (stronger for N>cpus)

Have I missed other reports that did test the slock-no-cmpb.patch alone?
Two of the systems with positive effects are x86_64, one is an older 
high-end Intel x86 chip. The negative effect is on a low-cost Pentium 4 with 
only hyper threading. According to the mentions thread's title, this was an 
optimization for hyperthreading, not regular multi-cpus.

We could have more data, especially on newer and high-end systems. Could 
some of you test the slock-no-cmpb.patch? You'll need an otherwise idle 
system to get repeatable results.

http://archives.postgresql.org/pgsql-hackers/2005-09/msg00565.php
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

I have re-attached the relevant files from Tom's posts because in the 
archive it's not clear anymore what should go into which file. See 
instructions in the first messages above.

The patch applies to CVS tip with
patch -p1 < slock-no-cmpb.patch

Best Regards,
Michael Paesold


Re: Spinlocks, yet again: analysis and proposed patches

From
Martijn van Oosterhout
Date:
On Tue, Sep 13, 2005 at 08:14:47PM -0400, Stephen Frost wrote:
> I suppose another option would be to provide seperate packages...  Could
> this be done as a shared library so it's more 'plug-and-play' to switch
> between the two?  I dunno, just trying to think about how to deal with
> this without making it suck terribly bad for me (as a Debian user and
> maintainer).

Note that the Linux kernel has played with moving spinlock code out of
line. Due to the effect of having so many, it ended up that the memory
saved by moving the code out of line actually benefitted overall. An
unconditional call to a function can't be that expensive, surely.

However, it would *have* to be in the same compile unit, no shared
libraries. ELF imposes some overhead for calls in different compile
units, using function pointers won't save you (see Ulrich Drepper
paper on it).

However, to make it flexible you would need a pointer to a function
pointer. Fortunatly these variables won't change often so the function
pointer and the function itself should be in the cache if used often
enough.

Finally, the kernel solves the problem by saying, if you compile for
uniprocessor, optimize the code away, since a lot of the issues don't
apply. My main concern is how you even detect the number of processors
in a portable way...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
Tom, et al.,

Updated, with full recompiles between everything and the new
modification:

N, runtime:
Tip:                    1 31s   2 37s   4 86s   8 159s
no-cmpb:                1 32s   2 43s   4 83s   8 168s
spin:                   1 32s   2 51s   4 84s   8 160s
spin+mod:               1 32s   2 51s   4 89s   8 158s
spin+no-cmpb:           1 32s   2 51s   4 87s   8 163s
spin+mod+no-cmpb:       1 32s   2 50s   4 86s   8 161s

Unfortunately, the results don't seem to be terribly consistent between
runs anyway:

Run 2:
Tip:                    1 32s   2 43s   4 87s   8 160s
no-cmpb:                1 31s   2 47s   4 83s   8 167s
spin:                   1 32s   2 52s   4 88s   8 154s
spin+no-cmpb:           1 32s   2 51s   4 102s  8 166s
spin+mod:               1 32s   2 53s   4 85s   8 154s
spin+mod+no-cmpb:       1 32s   2 51s   4 91s   8 161s

Hope it helps,
Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
I wrote:
> We could ameliorate this if there were a way to acquire ownership of the
> cache line without necessarily winning the spinlock.  I'm imagining
> that we insert a "dummy" locked instruction just ahead of the xchgb,
> which touches the spinlock in such a way as to not change its state.

I tried this, using this tas code:

static __inline__ int
tas(volatile slock_t *lock)
{register slock_t _res = 1;register slock_t _dummy = 0;
/* Use a locking test before trying to take the spinlock *//* xchg implies a LOCK prefix, so no need to say LOCK for it
*/__asm____volatile__(    "    lock            \n"    "    xaddb    %2,%1    \n"    "    xchgb    %0,%1    \n"
 
:        "+q"(_res), "+m"(*lock), "+q"(_dummy)
:
:        "memory", "cc");return (int) _res;
}

At least on Opteron, it's a loser.  The previous best results (with
slock-no-cmpb and spin-delay patches) were1 31s    2 42s    4 51s    8 100s
and with this instead of slock-no-cmpb,1 33s    2 45s    4 55s    8 104s

The xadd may indeed be helping in terms of protecting the xchg from
end-of-timeslice --- the rate of select() delays is really tiny, one
every few seconds, which is better than I saw before.  But the extra
cost of the extra locked operation isn't getting repaid overall.
Feel free to try it on other hardware, but it doesn't look promising.

BTW, I also determined that on that 4-way Opteron box, the integer
modulo idea doesn't make any difference --- that is, spin-delay and
what Michael called spin-delay-2 are the same speed.  I think I had
tried the modulo before adding the variable spin delay, and it did
help in that configuration; but most likely, it was just helping by
stretching out the amount of time spent looping before entering the
kernel.  So we can drop that idea too.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
I wrote:
> Another thought came to mind: maybe the current data layout for LWLocks
> is bad.  Right now, the spinlock that protects each LWLock data struct
> is itself part of the struct, and since the structs aren't large (circa
> 20 bytes), the whole thing is usually all in the same cache line. ...
> Maybe it'd be better to allocate the spinlocks off by themselves.

Well, this is odd.  I made up a patch to do this (attached) and found
that it pretty much sucks.  Still the 4-way Opteron, previous best
(slock-no-cmpb and spin-delay-2):
    1 31s    2 42s    4 51s    8 100s
with lwlock-separate added:
    1 31s    2 52s    4 106s    8 213s

What I had expected to see was a penalty in the single-thread case (due
to instructions added to the LWLock functions), but there isn't any.
I did not expect to see a factor-of-2 penalty for four threads.

I guess what this means is that there's no real problem with losing the
cache line while manipulating the LWLock, which is what the patch was
intended to prevent.  Instead, we're paying for swapping two cache lines
(the spinlock and the LWLock) across processors instead of just one line.
But that should at worst be a 2x inflation of the time previously spent
in LWLockAcquire/Release, which is surely not yet all of the application
;-).  Why the heck is this so bad?  Should we expect that apparently
minor changes in shared data structures might be costing equivalently
huge penalties in SMP performance elsewhere?

Unfortunately I don't have root on the Opteron and can't run oprofile.
But I'd really like to see some oprofile stats from these two cases
so we can figure out what in the world is going on here.  Can anyone
help?

            regards, tom lane

*** src/backend/storage/lmgr/lwlock.c.orig    Sat Aug 20 19:26:24 2005
--- src/backend/storage/lmgr/lwlock.c    Wed Sep 14 11:50:09 2005
***************
*** 27,35 ****
  #include "storage/spin.h"


  typedef struct LWLock
  {
!     slock_t        mutex;            /* Protects LWLock and queue of PGPROCs */
      bool        releaseOK;        /* T if ok to release waiters */
      char        exclusive;        /* # of exclusive holders (0 or 1) */
      int            shared;            /* # of shared holders (0..MaxBackends) */
--- 27,44 ----
  #include "storage/spin.h"


+ /*
+  * Each LWLock has an associated spinlock, which we consider as locking the
+  * content of the LWLock struct as well as the associated queue of waiting
+  * PGPROCs.  To minimize cache-line contention on multiprocessors, the
+  * spinlocks are kept physically separate from their LWLocks --- we don't
+  * want a spinning processor to interfere with access to the LWLock for
+  * the processor holding the spinlock.
+  */
+
  typedef struct LWLock
  {
!     slock_t       *mutex;            /* Protects LWLock and queue of PGPROCs */
      bool        releaseOK;        /* T if ok to release waiters */
      char        exclusive;        /* # of exclusive holders (0 or 1) */
      int            shared;            /* # of shared holders (0..MaxBackends) */
***************
*** 39,44 ****
--- 48,65 ----
  } LWLock;

  /*
+  * We allocate CACHE_LINE_SIZE bytes for the fixed LWLocks' spinlocks.
+  * We assume that the fixed locks are few enough and heavily used enough
+  * that it's worth trying to put each one's spinlock in its own cache line.
+  * The dynamic locks are assumed to be many and less heavily hit, so they
+  * get just MAXALIGN space apiece.  (Packing spinlocks tightly sounds like a
+  * bad idea on machines where slock_t is just a byte ... even for lesser-used
+  * spinlocks, there would be enough locks per cache line to create a
+  * contention issue.)
+  */
+ #define CACHE_LINE_SIZE        64
+
+ /*
   * This points to the array of LWLocks in shared memory.  Backends inherit
   * the pointer by fork from the postmaster.  LWLockIds are indexes into
   * the array.
***************
*** 135,144 ****
      Size        size;
      int            numLocks = NumLWLocks();

!     /* Allocate the LWLocks plus space for shared allocation counter. */
      size = mul_size(numLocks, sizeof(LWLock));

!     size = add_size(size, 2 * sizeof(int));

      return size;
  }
--- 156,174 ----
      Size        size;
      int            numLocks = NumLWLocks();

!     /* Space for the array of LWLock structs. */
      size = mul_size(numLocks, sizeof(LWLock));

!     /* Space for dynamic allocation counter and MAXALIGN padding. */
!     size = add_size(size, 2 * sizeof(int) + MAXIMUM_ALIGNOF);
!
!     /* Space for the spinlocks --- see notes for CACHE_LINE_SIZE. */
!     size = add_size(size,
!                     mul_size((int) NumFixedLWLocks,
!                              Max(sizeof(slock_t), CACHE_LINE_SIZE)));
!     size = add_size(size,
!                     mul_size(numLocks - (int) NumFixedLWLocks,
!                              Max(sizeof(slock_t), MAXIMUM_ALIGNOF)));

      return size;
  }
***************
*** 153,182 ****
      int            numLocks = NumLWLocks();
      Size        spaceLocks = LWLockShmemSize();
      LWLock       *lock;
      int            id;

      /* Allocate space */
      LWLockArray = (LWLock *) ShmemAlloc(spaceLocks);

      /*
       * Initialize all LWLocks to "unlocked" state
       */
      for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
      {
!         SpinLockInit(&lock->mutex);
          lock->releaseOK = true;
          lock->exclusive = 0;
          lock->shared = 0;
          lock->head = NULL;
          lock->tail = NULL;
-     }

!     /*
!      * Initialize the dynamic-allocation counter at the end of the array
!      */
!     LWLockCounter = (int *) lock;
!     LWLockCounter[0] = (int) NumFixedLWLocks;
!     LWLockCounter[1] = numLocks;
  }


--- 183,229 ----
      int            numLocks = NumLWLocks();
      Size        spaceLocks = LWLockShmemSize();
      LWLock       *lock;
+     char       *mutexptr;
      int            id;

      /* Allocate space */
      LWLockArray = (LWLock *) ShmemAlloc(spaceLocks);

      /*
+      * Set up dynamic-allocation counter at the end of the array of structs.
+      * (Since struct LWLock contains ints, we should not need any extra
+      * alignment padding here.)
+      */
+     LWLockCounter = (int *) (LWLockArray + numLocks);
+     LWLockCounter[0] = (int) NumFixedLWLocks;
+     LWLockCounter[1] = numLocks;
+
+     /*
+      * We place the spinlocks after the counter, maxaligning for safety.
+      */
+     mutexptr = (char *) (LWLockCounter + 2);
+     mutexptr = (char *) MAXALIGN(mutexptr);
+
+     /*
       * Initialize all LWLocks to "unlocked" state
       */
      for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
      {
!         slock_t       *mutex = (slock_t *) mutexptr;
!
!         SpinLockInit(mutex);
!         lock->mutex = mutex;
          lock->releaseOK = true;
          lock->exclusive = 0;
          lock->shared = 0;
          lock->head = NULL;
          lock->tail = NULL;

!         if (id < (int) NumFixedLWLocks)
!             mutexptr += Max(sizeof(slock_t), CACHE_LINE_SIZE);
!         else
!             mutexptr += Max(sizeof(slock_t), MAXIMUM_ALIGNOF);
!     }
  }


***************
*** 207,212 ****
--- 254,260 ----
  LWLockAcquire(LWLockId lockid, LWLockMode mode)
  {
      volatile LWLock *lock = LWLockArray + lockid;
+     volatile slock_t *mutex = lock->mutex;
      PGPROC       *proc = MyProc;
      bool        retry = false;
      int            extraWaits = 0;
***************
*** 253,259 ****
          bool        mustwait;

          /* Acquire mutex.  Time spent holding mutex should be short! */
!         SpinLockAcquire_NoHoldoff(&lock->mutex);

          /* If retrying, allow LWLockRelease to release waiters again */
          if (retry)
--- 301,307 ----
          bool        mustwait;

          /* Acquire mutex.  Time spent holding mutex should be short! */
!         SpinLockAcquire_NoHoldoff(mutex);

          /* If retrying, allow LWLockRelease to release waiters again */
          if (retry)
***************
*** 304,310 ****
          lock->tail = proc;

          /* Can release the mutex now */
!         SpinLockRelease_NoHoldoff(&lock->mutex);

          /*
           * Wait until awakened.
--- 352,358 ----
          lock->tail = proc;

          /* Can release the mutex now */
!         SpinLockRelease_NoHoldoff(mutex);

          /*
           * Wait until awakened.
***************
*** 336,342 ****
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(&lock->mutex);

      /* Add lock to list of locks held by this backend */
      held_lwlocks[num_held_lwlocks++] = lockid;
--- 384,390 ----
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(mutex);

      /* Add lock to list of locks held by this backend */
      held_lwlocks[num_held_lwlocks++] = lockid;
***************
*** 359,364 ****
--- 407,413 ----
  LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
  {
      volatile LWLock *lock = LWLockArray + lockid;
+     volatile slock_t *mutex = lock->mutex;
      bool        mustwait;

      PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, lock);
***************
*** 375,381 ****
      HOLD_INTERRUPTS();

      /* Acquire mutex.  Time spent holding mutex should be short! */
!     SpinLockAcquire_NoHoldoff(&lock->mutex);

      /* If I can get the lock, do so quickly. */
      if (mode == LW_EXCLUSIVE)
--- 424,430 ----
      HOLD_INTERRUPTS();

      /* Acquire mutex.  Time spent holding mutex should be short! */
!     SpinLockAcquire_NoHoldoff(mutex);

      /* If I can get the lock, do so quickly. */
      if (mode == LW_EXCLUSIVE)
***************
*** 400,406 ****
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(&lock->mutex);

      if (mustwait)
      {
--- 449,455 ----
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(mutex);

      if (mustwait)
      {
***************
*** 424,429 ****
--- 473,479 ----
  LWLockRelease(LWLockId lockid)
  {
      volatile LWLock *lock = LWLockArray + lockid;
+     volatile slock_t *mutex = lock->mutex;
      PGPROC       *head;
      PGPROC       *proc;
      int            i;
***************
*** 446,452 ****
          held_lwlocks[i] = held_lwlocks[i + 1];

      /* Acquire mutex.  Time spent holding mutex should be short! */
!     SpinLockAcquire_NoHoldoff(&lock->mutex);

      /* Release my hold on lock */
      if (lock->exclusive > 0)
--- 496,502 ----
          held_lwlocks[i] = held_lwlocks[i + 1];

      /* Acquire mutex.  Time spent holding mutex should be short! */
!     SpinLockAcquire_NoHoldoff(mutex);

      /* Release my hold on lock */
      if (lock->exclusive > 0)
***************
*** 494,500 ****
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(&lock->mutex);

      /*
       * Awaken any waiters I removed from the queue.
--- 544,550 ----
      }

      /* We are done updating shared state of the lock itself. */
!     SpinLockRelease_NoHoldoff(mutex);

      /*
       * Awaken any waiters I removed from the queue.

Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:


>I wrote:
>> We could ameliorate this if there were a way to acquire ownership of the
>> cache line without necessarily winning the spinlock.  I'm imagining
>> that we insert a "dummy" locked instruction just ahead of the xchgb,
>> which touches the spinlock in such a way as to not change its state.
> 
> I tried this, using this tas code:
...

I have tried it on the P4 w/ HT.

CVS tip             1: 37s  2: 78s  4: 159s  8: 324
only slock-no-cmpb  1: 37s  2: 82s  4: 178s  8: 362
only dummy-locking  1: 44s  2: 96s  4: 197s ...

I guess there is no reason to try any more.

Best Regards,
Michael Paesold


Re: Spinlocks, yet again: analysis and proposed patches

From
"Michael Paesold"
Date:
Tom Lane wrote:
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent.  Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-).  Why the heck is this so bad?  Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?
>
> Unfortunately I don't have root on the Opteron and can't run oprofile.
> But I'd really like to see some oprofile stats from these two cases
> so we can figure out what in the world is going on here.  Can anyone
> help?

I will try the patch here and see if it gives the same result. If so I could 
try to run with oprofile if you can give me a quick start.

Best Regards,
Michael Paesold 



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
I wrote:
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent.  Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-).  Why the heck is this so bad?  Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?

I did some oprofile work and found that the cost seems to be because
(1) there's an increase in spinlock contention (more time spent in
s_lock), and (2) there's more time spent in LWLockAcquire/Release.
I'm not entirely clear about the reason for (1), but (2) apparently
is because of the extra cache line swapping, as posited above.  In
the oprofile trace it's clear that the extra cost comes exactly in the
statements that touch fields of the shared LWLock, which are the places
where you might have to wait to acquire a cache line.

I thought for a bit that the problem might come from having chosen to
put a pointer to the spinlock into each LWLock; fetching that pointer
implies an additional access to the contended cache line.  However,
changing the data structure to a simple linear array of spinlocks
didn't help.  So that whole idea seems to have died a painful death.

One other interesting result is that the data structure change neither
helped nor hurt on an EM64T machine with 2 physical (4 logical)
processors.  This is also x86_64, but evidently the caching behavior
is totally different from Opterons.

One thing that did seem to help a little bit was padding the LWLocks
to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
the array starts on a 32-byte boundary.  This ensures that we won't have
any LWLocks crossing cache lines --- contended access to such an LWLock
would probably incur the sort of large penalty seen above, because you'd
be trading two cache lines back and forth not one.  It seems that the
important locks are not split that way in CVS tip, because the gain
wasn't much, but I wonder whether some effect like this might explain
some of the unexplainable performance changes we've noticed in the past
(eg, in the dbt2 results).  A seemingly unrelated small change in the
size of other data structures in shared memory might move things around
enough to make a performance-critical lock cross a cache line boundary.

On regular x86, the LWLock struct is by chance exactly 16 bytes, so
there's no alignment issue.  But 64-bit platforms and platforms where
slock_t is bigger than char are exposed to this risk.

I'm going to go ahead and make that change, since it doesn't seem likely
to have any downside.  It might be interesting to look into forcing
proper alignment of the shared buffer headers as well.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
On Thu, 15 Sep 2005, Tom Lane wrote:

> One thing that did seem to help a little bit was padding the LWLocks
> to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
> the array starts on a 32-byte boundary.  This ensures that we won't have
> any LWLocks crossing cache lines --- contended access to such an LWLock
> would probably incur the sort of large penalty seen above, because you'd
> be trading two cache lines back and forth not one.  It seems that the
> important locks are not split that way in CVS tip, because the gain
> wasn't much, but I wonder whether some effect like this might explain
> some of the unexplainable performance changes we've noticed in the past
> (eg, in the dbt2 results).  A seemingly unrelated small change in the
> size of other data structures in shared memory might move things around
> enough to make a performance-critical lock cross a cache line boundary.

What about padding the LWLock to 64 bytes on these architectures. Both P4
and Opteron have 64 byte cache lines, IIRC. This would ensure that a
cacheline doesn't hold two LWLocks.

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
"Simon Riggs"
Date:
> Tom Lane wrote
> I'm going to go ahead and make that change, since it doesn't
> seem likely
> to have any downside.  It might be interesting to look into forcing
> proper alignment of the shared buffer headers as well.

Just catching up on your mails - all of that sounds good so far.

Everything mentioned so far talks about spinlocks in the general sense,
rather than with respect to particular locks.

The different lock types we have are held for different amounts of time
and are also subject to differing amounts of contention. I think it
would be useful to put in a capability to tune each class of lock
according to the possibility for delay and contention on it.

Long delay, low contention e.g. CheckpointLock => waiter should sleep
immediately
Medium delay, high contention e.g. WALWriteLock => waiter spins if at
head of queue, else sleeps immediately
Short delay, high contention e.g. BufMappingLock => waiter spins and
retries forever, cos the lock is coming soon
Short delay, low contention, sometimes long waits at end of VACUUM
e.g. FreeSpaceLock => waiter spins, then eventually sleeps

I'm not sure whether you'll agree with my characterisation of these
locks, but the main thing I'm trying to get across is that
once-size-fits-all isn't the optimal approach. I'm not saying either
that we end up with individual characterisations for each lock type, but
a few key ones could be catered for differently.

Best Regards, Simon Riggs



Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
> What about padding the LWLock to 64 bytes on these architectures. Both P4
> and Opteron have 64 byte cache lines, IIRC. This would ensure that a
> cacheline doesn't hold two LWLocks.

I tried that first, actually, but it was a net loss.  I guess enlarging
the array that much wastes too much cache space.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
On Thu, 15 Sep 2005, Tom Lane wrote:

> Gavin Sherry <swm@linuxworld.com.au> writes:
> > What about padding the LWLock to 64 bytes on these architectures. Both P4
> > and Opteron have 64 byte cache lines, IIRC. This would ensure that a
> > cacheline doesn't hold two LWLocks.
>
> I tried that first, actually, but it was a net loss.  I guess enlarging
> the array that much wastes too much cache space.


Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
the cmpb/jump removed I get:

[swm@backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real    0m54.362s
user    0m0.003s
sys     0m0.009s

real    1m9.788s
user    0m0.011s
sys     0m0.013s

real    2m8.870s
user    0m0.016s
sys     0m0.028s
[swm@backup pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done

real    0m55.544s
user    0m0.006s
sys     0m0.007s

real    1m9.313s
user    0m0.007s
sys     0m0.018s

real    2m1.769s
user    0m0.017s
sys     0m0.027s

This compares to the following, which is unpadded but has cmpb/jump
removed but is otherwise vanilla:

1: 55: 2: 111: 4: 207

The decrease is small, but it's there.

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Gregory Maxwell <gmaxwell@gmail.com> writes:
> might be useful to align the structure so it always crosses two lines
> and measure the performance difference.. the delta could be basically
> attributed to the cache line bouncing since even one additional bounce
> would overwhelm the other performance effects from the changed
> alignment.

Good idea.  I goosed the struct declaration and setup code to arrange
that the BufMappingLock's spinlock and the rest of its data were in
different cache lines instead of the same one.  The results (still
on Red Hat's 4-way Opteron):

previous best code (slock-no-cmpb and spin-delay-2):    1 31s    2 42s    4 51s    8 100s
with LWLock padded to 32 bytes and correctly aligned:    1 31s    2 41s    4 51s    8 97s
with LWLocks 32 bytes, but deliberately misaligned:    1 30s    2 50s    4 102s    8 200s

There is no other reason than having to touch multiple cache lines for
the second and third cases to be different: the array indexing code
should be exactly the same.

These last numbers are pretty close to what I got from the
separated-spinlock patch:    1 31s    2 52s    4 106s    8 213s
So it seems there's no doubt that it's the doubled cache traffic that
was causing most of the problem there.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
> Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and
> the cmpb/jump removed I get:
> [    1 55s    2 69s    4 128s ]
> This compares to the following, which is unpadded but has cmpb/jump
> removed but is otherwise vanilla:
> 1: 55: 2: 111: 4: 207

Hmm, that's pretty significant.  I tried it on a Xeon EM64T (thanks to
Stephen Frost for providing access), also 2 phys 4 log, and get:

Yesterday's CVS tip:1 32s    2 46s    4 88s    8 168s
plus no-cmpb and spindelay2:1 32s    2 48s    4 100s    8 177s
plus just-committed code to pad LWLock to 32:1 33s    2 50s    4 98s    8 179s
alter to pad to 64:1 33s    2 38s    4 108s    8 180s

I don't know what to make of the 2-process time going down while
4-process goes up; that seems just weird.  But both numbers are
repeatable.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Gregory Maxwell
Date:
On 9/15/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yesterday's CVS tip:
>         1 32s   2 46s   4 88s   8 168s
> plus no-cmpb and spindelay2:
>         1 32s   2 48s   4 100s  8 177s
> plus just-committed code to pad LWLock to 32:
>         1 33s   2 50s   4 98s   8 179s
> alter to pad to 64:
>         1 33s   2 38s   4 108s  8 180s
>
> I don't know what to make of the 2-process time going down while
> 4-process goes up; that seems just weird.  But both numbers are
> repeatable.

It is odd.

In the two process case there is, assuming random behavior, a 1/2
chance that you've already got the right line, but in the 4 process
case only a 1/4 chance (since we're on a 4 way box). This would
explain why we don't see as much cost in the intentionally misaligned
case. You'd expect the a similar pattern of improvement with the
64byte alignment (some in the two process case, but more in the 4
case), but here we see more improvement in the two way case.

If I had to guess I might say that the 64byte alignment is removing
much of the unneeded line bouncing in the the two process case but is
at the same time creating more risk of bouncing caused by aliasing.
Since two processes have 1/2 chance the aliasing isn't a problem so
the change is a win, but in the four process case it's no longer a win
because with aliasing there is still a lot of fighting over the cache
lines even if you pack well, and the decrease in packing makes odd
aliasing somewhat more likely. This might also explain why the
misaligned case performed so poorly in the 4process case, since the
misalignment didn't just increase the cost 2x, it also increased the
likelihood of a bogus bounce due to aliasing..

If this is the case, then it may be possible through very careful
memory alignment to make sure that no two high contention locks that
are likely to be contended at once share the same line (through either
aliasing or through being directly within the same line).

Then again I could be completely wrong, my understanding of
multiprocessor cache coherency is very limited, and I have no clue how
cache aliasing fits into it... So the above is just uninformed
conjecture.


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Gregory Maxwell <gmaxwell@gmail.com> writes:
> If I had to guess I might say that the 64byte alignment is removing
> much of the unneeded line bouncing in the the two process case but is
> at the same time creating more risk of bouncing caused by aliasing.

It's an idea ... not sure if it's right or not.

One thing I have noticed both on Xeon HT (2-physical-4-logical) and
on Opteron (4 real processors) is that the 2-process times are
significantly more variable than the 1, 4, or 8-process times: repeating
the measurements shows variance around the 10% level for 2 processes
but only around 1% for the others.

What I think this means is that the kernel is scheduling the 2 processes
onto 2 processors chosen-at-random, without awareness of whether those
two processors are on the same chip (in the Xeon case) or have closer
NUMA affinity (in the Opteron case).  The other test cases are all
symmetric and there's no way for the kernel to blow it too badly, but
in the 2-process case it will be very visible whether the kernel
understands the hardware or not.  And the impression I have (which
might be out of date) is that current Linux kernel releases are not
very bright about these things.

How does that tie into the point at hand about different results for
64-byte vs 32-byte LWLocks?  Not sure, but I feel it's related somehow.

Anyway, it'd be interesting to get some test results for these things
on SMP machines running non-Linux kernels.  Also, we ought to be more
rigorous about reporting exactly which kernel we are using for any
particular test.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
I thought I'd throw SPARC into the equation (SPARC IIIi, in a dual SunFire
v250):

vanilla HEAD from ~1 week ago:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real    1m49.037s
user    0m0.008s
sys     0m0.016s

real    2m3.482s
user    0m0.014s
sys     0m0.026s

real    3m54.215s
user    0m0.028s
sys     0m0.046s

With the spin-delay patch:


real    1m50.791s
user    0m0.008s
sys     0m0.016s

real    2m0.553s
user    0m0.015s
sys     0m0.026s

real    3m58.185s
user    0m0.027s
sys     0m0.048s

Padding the LWLock to 64 bytes:

bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done

real    1m49.594s
user    0m0.008s
sys     0m0.016s

real    1m59.141s
user    0m0.015s
sys     0m0.026s

real    3m56.236s
user    0m0.027s
sys     0m0.047s

Padded to 128 bytes (I'm not sure about the cache line size):

real    1m50.498s
user    0m0.008s
sys     0m0.016s

real    2m0.169s
user    0m0.015s
sys     0m0.026s

real    3m56.891s
user    0m0.027s
sys     0m0.048s

----

These details seem suspicious to me when compared to those we're seeing on
x86 platforms. The reason is that padding the LWLock out to 64 or 128
bytes makes no difference (maybe the cache line is bigger?). The good
thing is, the difference between 1 and 2 instances of the test is small --
precisely what we want.

Thanks,

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
Josh Berkus
Date:
Tom,

> What I think this means is that the kernel is scheduling the 2 processes
> onto 2 processors chosen-at-random, without awareness of whether those
> two processors are on the same chip (in the Xeon case) or have closer
> NUMA affinity (in the Opteron case).

That would be consistent with my experience with HT, and the reason why many 
software vendors recommend disabling it.  Not sure about NUMA.

-- 
Josh Berkus
Aglio Database Solutions
San Francisco


Re: Spinlocks, yet again: analysis and proposed patches

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Tom,
> 
> > What I think this means is that the kernel is scheduling the 2 processes
> > onto 2 processors chosen-at-random, without awareness of whether those
> > two processors are on the same chip (in the Xeon case) or have closer
> > NUMA affinity (in the Opteron case).
> 
> That would be consistent with my experience with HT, and the reason why many 
> software vendors recommend disabling it.  Not sure about NUMA.

What version of linux was this test with? What you describe was certainly well
known for Linux 2.4 and earlier. It was well on its way to becoming
established dogma cargo-cult style. 

However I was under the impression that 2.6 had moved beyond that problem.
It would be very interesting to know if 2.6 still suffers from this.

Also, those who recommend disabling HT on this basis should realize it didn't
penalize the >2 process case the same way. You may be getting a benefit only
in the cases where your system isn't heavily loaded. Ie, only when it doesn't
really matter.


-- 
greg



Re: Spinlocks, yet again: analysis and proposed patches

From
Stephen Frost
Date:
* Greg Stark (gsstark@mit.edu) wrote:
> However I was under the impression that 2.6 had moved beyond that problem.
> It would be very interesting to know if 2.6 still suffers from this.

The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
was better about this too, but I don't have another explanation for it.
CONFIG_SMT was enabled, etc..
Thanks,
    Stephen

Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Stephen Frost <sfrost@snowman.net> writes:
> * Greg Stark (gsstark@mit.edu) wrote:
>> However I was under the impression that 2.6 had moved beyond that problem.
>> It would be very interesting to know if 2.6 still suffers from this.

> The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
> was better about this too, but I don't have another explanation for it.

The 4-way Opteron I've been using at Red Hat is running
2.6.12-1.1398_FC4smp (Fedora Core 4 obviously).  Red Hat in particular
has been working hard in this area, and I thought that their recent
kernels included NUMA fixes that weren't yet accepted upstream (at least
not in the stable kernel branches).  But it seems there's still a ways
to go yet.

It'd be real interesting to see comparable numbers from some non-Linux
kernels, particularly commercial systems like Solaris.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
On Sat, 17 Sep 2005, Tom Lane wrote:

> Stephen Frost <sfrost@snowman.net> writes:
> > * Greg Stark (gsstark@mit.edu) wrote:
> >> However I was under the impression that 2.6 had moved beyond that problem.
> >> It would be very interesting to know if 2.6 still suffers from this.
>
> > The tests on the em64t at my place were using 2.6.12.  I had thought 2.6
> > was better about this too, but I don't have another explanation for it.
>
> The 4-way Opteron I've been using at Red Hat is running
> 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously).  Red Hat in particular
> has been working hard in this area, and I thought that their recent
> kernels included NUMA fixes that weren't yet accepted upstream (at least
> not in the stable kernel branches).  But it seems there's still a ways
> to go yet.
>
> It'd be real interesting to see comparable numbers from some non-Linux
> kernels, particularly commercial systems like Solaris.

Did you see the Solaris results I posted?

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
Gavin Sherry <swm@linuxworld.com.au> writes:
> On Sat, 17 Sep 2005, Tom Lane wrote:
>> It'd be real interesting to see comparable numbers from some non-Linux
>> kernels, particularly commercial systems like Solaris.

> Did you see the Solaris results I posted?

Are you speaking of
http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
?

That doesn't seem directly relevant to the point, because it's for a
2-CPU machine; so there's no way to run a test case that uses more than
one but less than all the processors.  In either the "one" or "all"
cases, performance ought to be pretty stable regardless of whether the
kernel understands about any processor asymmetries that may exist in
the hardware.  Not to mention that I don't know of any asymmetries in
a dual SPARC anyway.  We really need to test this on comparable
hardware, which I guess means we need Solaris/x86 on something with
hyperthreading or known NUMA asymmetry.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Jim C. Nasby"
Date:
On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
> Gavin Sherry <swm@linuxworld.com.au> writes:
> > On Sat, 17 Sep 2005, Tom Lane wrote:
> >> It'd be real interesting to see comparable numbers from some non-Linux
> >> kernels, particularly commercial systems like Solaris.
> 
> > Did you see the Solaris results I posted?
> 
> Are you speaking of
> http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
> ?
> 
> That doesn't seem directly relevant to the point, because it's for a
> 2-CPU machine; so there's no way to run a test case that uses more than
> one but less than all the processors.  In either the "one" or "all"
> cases, performance ought to be pretty stable regardless of whether the
> kernel understands about any processor asymmetries that may exist in
> the hardware.  Not to mention that I don't know of any asymmetries in
> a dual SPARC anyway.  We really need to test this on comparable
> hardware, which I guess means we need Solaris/x86 on something with
> hyperthreading or known NUMA asymmetry.

I have access to a 4-way Opteron 852 running Solaris 10. What patches
would you like me to test?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Spinlocks, yet again: analysis and proposed patches

From
Gavin Sherry
Date:
On Sun, 18 Sep 2005, Jim C. Nasby wrote:

> On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote:
> > Gavin Sherry <swm@linuxworld.com.au> writes:
> > > On Sat, 17 Sep 2005, Tom Lane wrote:
> > >> It'd be real interesting to see comparable numbers from some non-Linux
> > >> kernels, particularly commercial systems like Solaris.
> >
> > > Did you see the Solaris results I posted?
> >
> > Are you speaking of
> > http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php
> > ?
> >
> > That doesn't seem directly relevant to the point, because it's for a
> > 2-CPU machine; so there's no way to run a test case that uses more than
> > one but less than all the processors.  In either the "one" or "all"
> > cases, performance ought to be pretty stable regardless of whether the
> > kernel understands about any processor asymmetries that may exist in
> > the hardware.  Not to mention that I don't know of any asymmetries in
> > a dual SPARC anyway.  We really need to test this on comparable
> > hardware, which I guess means we need Solaris/x86 on something with
> > hyperthreading or known NUMA asymmetry.
>
> I have access to a 4-way Opteron 852 running Solaris 10. What patches
> would you like me to test?

These ones here:

http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php

Gavin


Re: Spinlocks, yet again: analysis and proposed patches

From
Bruce Momjian
Date:
I have removed this TODO item:
* Research use of sched_yield() for spinlock acquisition failure


---------------------------------------------------------------------------

Tom Lane wrote:
> Greg Stark <gsstark@mit.edu> writes:
> > Marko Kreen <marko@l-t.ee> writes:
> > (I speculate that it's set up to only yield the processor to other
> > processes already affiliated to that processor.  In any case, it
> > is definitely capable of getting through 10000 yields without
> > running the guy who's holding the spinlock.)
> 
> > Maybe it should try sched_yield once and then use select after that?
> 
> I tried that, actually, but it didn't seem to offer any particular
> benefit.  (Maybe it would have helped more on older Linux kernels before
> they changed sched_yield?)
> 
> I'm feeling even more disenchanted with sched_yield now that Marko
> pointed out that the behavior was changed recently.  Here we have a
> construct that is not portable cross-platform, does not act as
> documented in its man page, and the kernel guys feel free to whack its
> behavior around in major ways without documenting that either.  It seems
> to be a crap-shoot whether it will be useful on any particular machine
> or not.  At least with the select() code we can be reasonably confident
> we know what will happen.
> 
>             regards, tom lane
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
> 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Spinlocks, yet again: analysis and proposed patches

From
"Alexander Kirpa"
Date:
The fact that cmpb isn't helping proves that getting the cache line 
in a
read-only fashion does *not* do enough to protect the xchgb in this 
way.
But maybe another locking instruction would.  Comments?
        regards, tom lane

Hi, Tom!

Possible you help next link
'Replay: Unknown Features of the NetBurst Core'
http://www.xbitlabs.com/articles/cpu/display/replay.html

Best regards,Alexander Kirpa



Re: Spinlocks, yet again: analysis and proposed patches

From
Bruce Momjian
Date:
Is there a TODO here?

---------------------------------------------------------------------------

Tom Lane wrote:
> The test case I just posted shows that our spinlock code, which we had
> thought largely done, is once again becoming a performance bottleneck.
> It's time to resurrect some of the ideas we kicked around in early
> 2002, and didn't pursue because we decided spinlocks weren't our major
> performance problem.
> 
> I started investigating this by trying to understand why the "futex
> spinlock" patch had helped the Red Hat customer I mentioned, when it
> hadn't done anything much in last year's experiments.  It turns out
> that this customer was using an 8-way Opteron, and the futex patch
> helps a lot more on that architecture than others.  Some random
> experimentation eventually turned up part of the reason: the x86 TAS
> assembly code we are using is pessimal for x86_64.  Specifically,
> s_lock.h uses this for both architectures:
> 
>     /* Use a non-locking test before asserting the bus lock */
>     __asm__ __volatile__(
>         "    cmpb    $0,%1    \n"
>         "    jne        1f    \n"
>         "    lock             \n"
>         "    xchgb    %0,%1   \n"
>         "1: \n"
> 
> and it turns out that deleting the cmpb and jne makes things go
> significantly faster on Opterons.  So the "non-locking test" is
> not a win at all on x86_64.  As best I can tell from testing,
> removing it wins because it causes the rate of spinlock delays
> to drop significantly.  Why this should be is not completely
> obvious.  I speculate that the Opterons are capable of pulling
> in a copy of the cache line that contains the spinlock and then
> running 100 iterations of the cmpb test without ever noticing
> that the processor holding the lock has now released it.  Without
> the cmpb, they are forced by the locking xchgb test to actually
> look at the real state of the spinlock each time.
> 
> I kinda suspect that the cmpb test is a no-op or loss on all
> Intelish processors: it can only be a win if you expect a lot
> of contention for the spin lock, but in percentage terms we still
> have a very low conflict rate, so in most executions of the TAS
> macro, the cmpb is just wasted cycles.  Bottom line: we definitely
> don't want it for x86_64, and maybe not at all, but we need more
> research to decide the latter.
> 
> The second reason that the futex patch is helping is that when
> a spinlock delay does occur, it allows the delaying process to be
> awoken almost immediately, rather than delaying 10 msec or more
> as the existing code does.  However, given that we are only expecting
> the spinlock to be held for a couple dozen instructions, using the
> kernel futex mechanism is huge overkill --- the in-kernel overhead
> to manage the futex state is almost certainly several orders of
> magnitude more than the delay we actually want.
> 
> I looked into several other methods of doing the spinlock delay
> instead.  I think all of these were suggested at one point or
> another in our earlier discussions of spinlocks:
> 
> 1. Use sched_yield() if available: it does just what we want,
> ie, yield the processor without forcing any useless time delay
> before we can be rescheduled.  This doesn't exist everywhere
> but it exists in recent Linuxen, so I tried it.  It made for a
> beautiful improvement in my test case numbers: CPU utilization
> went to 100% and the context swap rate to almost nil.  Unfortunately,
> I also saw fairly frequent "stuck spinlock" panics when running
> more queries than there were processors --- this despite increasing
> NUM_DELAYS to 10000 in s_lock.c.  So I don't trust sched_yield
> anymore.  Whatever it's doing in Linux 2.6 isn't what you'd expect.
> (I speculate that it's set up to only yield the processor to other
> processes already affiliated to that processor.  In any case, it
> is definitely capable of getting through 10000 yields without
> running the guy who's holding the spinlock.)
> 
> 2. Reduce the length of the select() delays in s_lock.  The current code
> delays in quanta of 10msec, but on recent Linuxen (and I think other
> platforms too) the scheduling quantum is 1msec, so we can get the
> processor back faster if we ask for a 1msec delay.  I tried this and it
> is definitely a win on Linux; it doesn't seem to hurt anything on older
> systems either, they just round the delay up to 10msec like before.
> So I think we should do this, even though it's only a partial answer.
> 
> 3. Modify the spin loop to do a little more work between TAS() tests.
> In the existing code there are only about half a dozen instructions
> total in the normal spin loop, most of them very fast, with the result
> that the spinning processor does its best to monopolize the system bus
> with locked probe instructions.  This is obviously not good, as it'll
> interfere with the ability of the spinlock holder to complete its work
> and release the lock.  (The bulk of the spinlock uses are for LWLocks,
> and with the current data structures the LWLock's own state is usually
> going to be in the same cache line as the spinlock proper, so that
> cache grabs for the spinlock hurt the LWLock state updating as well
> as the eventual spinlock release.)  I modified the code to use an
> integer modulo operation as part of the test for SPINS_PER_DELAY;
> on most processors this should keep the processor's attention and
> keep it away from the system bus for several times longer than an
> average instruction.  (It's relevant here that the "rep;nop" used
> to slow the spin loop on recent Intel processors appears to be
> completely ignored by Opterons.  Basically we need a non-hardware-
> specific trick to slow the loop.)
> 
> 4. Modify SPINS_PER_DELAY.  Looping more times instead of entering the
> kernel is a win for SMP boxes: whenever you give up and call the kernel
> to delay, it's likely that the spinlock holder will have released the
> lock before you even get all the way into the kernel, let alone get
> rescheduled again.  I found that pushing SPINS_PER_DELAY above 200
> caused the select delay rate to fall to almost nil in this test case.
> However, that would be counterproductive for uniprocessors where the
> optimal strategy is normally to give up the processor immediately.
> We've talked about making the spin count user configurable, but I never
> cared for that answer.  Instead, I've worked up some code to
> automatically adjust the spin count depending on whether we seem to be
> in a uniprocessor or multiprocessor.  The idea is to decrease the spin
> count (down to a lower limit) whenever s_lock has to delay in order to
> get a lock, and to increase the count (up to a limit) whenever we see
> the lock come free before we've had to delay.  The former condition
> suggests that we are in a uniprocessor and the latter suggests we are in
> a multiprocessor.  I recall having complained that this idea would
> probably be unstable, but in testing it seems to do pretty nicely at
> converging to the minimum spin count on a 1-CPU machine and the maximum
> limit on an SMP box.  It could do with more testing though to see if it
> works the same for other people.
> 
> My proposal therefore is to do #2, #3, and #4, and to modify the TAS
> assembly code at least on x86_64.  Together, these changes bring
> my test case on a 4-way Opteron from
> 
> N, runtime:    1 36s    2 61s    4 105s    8 198s
> 
> to
> 
> N, runtime:    1 35s    2 48s    4 58s    8 105s
> 
> At 4 queries, CPU utilization is 100% and the context swap rate nearly
> nil, which seems to mean we've gained another (temporary no doubt)
> victory over the spinlock problem.
> 
> I attach two proposed patches: the first removes the cmpb/jne from
> the x86 TAS assembly code, and the second one does the s_lock changes
> enumerated as points #2, #3, #4.  The first one in particular needs
> more testing to see if it hurts performance on any non-Opteron x86
> chips.  (If so, we'd just make it conditional to x86_64.)
> 
> Comments and testing invited.
> 
>             regards, tom lane
> 

Content-Description: slock-no-cmpb.patch

> *** REL8_0/src/include/storage/s_lock.h    Sun Aug 28 20:41:44 2005
> --- new/s_lock.h    Fri Sep  9 14:58:44 2005
> ***************
> *** 120,132 ****
>   {
>       register slock_t _res = 1;
>   
> !     /* Use a non-locking test before asserting the bus lock */
>       __asm__ __volatile__(
> -         "    cmpb    $0,%1    \n"
> -         "    jne        1f        \n"
> -         "    lock            \n"
>           "    xchgb    %0,%1    \n"
> -         "1: \n"
>   :        "+q"(_res), "+m"(*lock)
>   :
>   :        "memory", "cc");
> --- 120,128 ----
>   {
>       register slock_t _res = 1;
>   
> !     /* xchg implies a LOCK prefix, so no need to say LOCK explicitly */
>       __asm__ __volatile__(
>           "    xchgb    %0,%1    \n"
>   :        "+q"(_res), "+m"(*lock)
>   :
>   :        "memory", "cc");

Content-Description: spin-delay.patch

> *** /home/postgres/pgsql/src/backend/storage/lmgr/proc.c.orig    Sat Aug 20 19:26:24 2005
> --- /home/postgres/pgsql/src/backend/storage/lmgr/proc.c    Sun Sep 11 15:07:20 2005
> ***************
> *** 170,175 ****
> --- 170,177 ----
>   
>           ProcGlobal->freeProcs = INVALID_OFFSET;
>   
> +         ProcGlobal->spins_per_delay = DEFAULT_SPINS_PER_DELAY;
> + 
>           /*
>            * Pre-create the PGPROC structures and create a semaphore for
>            * each.
> ***************
> *** 224,232 ****
> --- 226,239 ----
>       /*
>        * Try to get a proc struct from the free list.  If this fails, we
>        * must be out of PGPROC structures (not to mention semaphores).
> +      *
> +      * While we are holding the ProcStructLock, also copy the current
> +      * shared estimate of spins_per_delay to local storage.
>        */
>       SpinLockAcquire(ProcStructLock);
>   
> +     set_spins_per_delay(procglobal->spins_per_delay);
> + 
>       myOffset = procglobal->freeProcs;
>   
>       if (myOffset != INVALID_OFFSET)
> ***************
> *** 318,338 ****
>   
>       Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);
>   
>       dummyproc = &DummyProcs[proctype];
>   
>       /*
>        * dummyproc should not presently be in use by anyone else
>        */
>       if (dummyproc->pid != 0)
>           elog(FATAL, "DummyProc[%d] is in use by PID %d",
>                proctype, dummyproc->pid);
>       MyProc = dummyproc;
>   
>       /*
>        * Initialize all fields of MyProc, except MyProc->sem which was set
>        * up by InitProcGlobal.
>        */
> -     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
>       SHMQueueElemInit(&(MyProc->links));
>       MyProc->waitStatus = STATUS_OK;
>       MyProc->xid = InvalidTransactionId;
> --- 325,362 ----
>   
>       Assert(proctype >= 0 && proctype < NUM_DUMMY_PROCS);
>   
> +     /*
> +      * Just for paranoia's sake, we use the ProcStructLock to protect
> +      * assignment and releasing of DummyProcs entries.
> +      *
> +      * While we are holding the ProcStructLock, also copy the current
> +      * shared estimate of spins_per_delay to local storage.
> +      */
> +     SpinLockAcquire(ProcStructLock);
> + 
> +     set_spins_per_delay(ProcGlobal->spins_per_delay);
> + 
>       dummyproc = &DummyProcs[proctype];
>   
>       /*
>        * dummyproc should not presently be in use by anyone else
>        */
>       if (dummyproc->pid != 0)
> +     {
> +         SpinLockRelease(ProcStructLock);
>           elog(FATAL, "DummyProc[%d] is in use by PID %d",
>                proctype, dummyproc->pid);
> +     }
>       MyProc = dummyproc;
>   
> +     MyProc->pid = MyProcPid;    /* marks dummy proc as in use by me */
> + 
> +     SpinLockRelease(ProcStructLock);
> + 
>       /*
>        * Initialize all fields of MyProc, except MyProc->sem which was set
>        * up by InitProcGlobal.
>        */
>       SHMQueueElemInit(&(MyProc->links));
>       MyProc->waitStatus = STATUS_OK;
>       MyProc->xid = InvalidTransactionId;
> ***************
> *** 509,514 ****
> --- 533,541 ----
>       /* PGPROC struct isn't mine anymore */
>       MyProc = NULL;
>   
> +     /* Update shared estimate of spins_per_delay */
> +     procglobal->spins_per_delay = update_spins_per_delay(procglobal->spins_per_delay);
> + 
>       SpinLockRelease(ProcStructLock);
>   }
>   
> ***************
> *** 532,542 ****
> --- 559,576 ----
>       /* Release any LW locks I am holding (see notes above) */
>       LWLockReleaseAll();
>   
> +     SpinLockAcquire(ProcStructLock);
> + 
>       /* Mark dummy proc no longer in use */
>       MyProc->pid = 0;
>   
>       /* PGPROC struct isn't mine anymore */
>       MyProc = NULL;
> + 
> +     /* Update shared estimate of spins_per_delay */
> +     ProcGlobal->spins_per_delay = update_spins_per_delay(ProcGlobal->spins_per_delay);
> + 
> +     SpinLockRelease(ProcStructLock);
>   }
>   
>   
> *** /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c.orig    Fri Aug 26 10:47:35 2005
> --- /home/postgres/pgsql/src/backend/storage/lmgr/s_lock.c    Sun Sep 11 17:43:20 2005
> ***************
> *** 21,26 ****
> --- 21,30 ----
>   #include "storage/s_lock.h"
>   #include "miscadmin.h"
>   
> + 
> + static int    spins_per_delay = DEFAULT_SPINS_PER_DELAY;
> + 
> + 
>   /*
>    * s_lock_stuck() - complain about a stuck spinlock
>    */
> ***************
> *** 49,102 ****
>        * We loop tightly for awhile, then delay using pg_usleep() and try
>        * again. Preferably, "awhile" should be a small multiple of the
>        * maximum time we expect a spinlock to be held.  100 iterations seems
> !      * about right.  In most multi-CPU scenarios, the spinlock is probably
> !      * held by a process on another CPU and will be released before we
> !      * finish 100 iterations.  However, on a uniprocessor, the tight loop
> !      * is just a waste of cycles, so don't iterate thousands of times.
>        *
>        * Once we do decide to block, we use randomly increasing pg_usleep()
> !      * delays. The first delay is 10 msec, then the delay randomly
> !      * increases to about one second, after which we reset to 10 msec and
>        * start again.  The idea here is that in the presence of heavy
>        * contention we need to increase the delay, else the spinlock holder
>        * may never get to run and release the lock.  (Consider situation
>        * where spinlock holder has been nice'd down in priority by the
>        * scheduler --- it will not get scheduled until all would-be
> !      * acquirers are sleeping, so if we always use a 10-msec sleep, there
>        * is a real possibility of starvation.)  But we can't just clamp the
>        * delay to an upper bound, else it would take a long time to make a
>        * reasonable number of tries.
>        *
>        * We time out and declare error after NUM_DELAYS delays (thus, exactly
>        * that many tries).  With the given settings, this will usually take
> !      * 3 or so minutes.  It seems better to fix the total number of tries
>        * (and thus the probability of unintended failure) than to fix the
>        * total time spent.
>        *
> !      * The pg_usleep() delays are measured in centiseconds (0.01 sec) because
> !      * 10 msec is a common resolution limit at the OS level.
>        */
> ! #define SPINS_PER_DELAY        100
>   #define NUM_DELAYS            1000
> ! #define MIN_DELAY_CSEC        1
> ! #define MAX_DELAY_CSEC        100
>   
> !     int            spins = 0;
>       int            delays = 0;
> !     int            cur_delay = MIN_DELAY_CSEC;
>   
>       while (TAS(lock))
>       {
>           /* CPU-specific delay each time through the loop */
>           SPIN_DELAY();
>   
> !         /* Block the process every SPINS_PER_DELAY tries */
> !         if (++spins > SPINS_PER_DELAY)
>           {
>               if (++delays > NUM_DELAYS)
>                   s_lock_stuck(lock, file, line);
>   
> !             pg_usleep(cur_delay * 10000L);
>   
>   #if defined(S_LOCK_TEST)
>               fprintf(stdout, "*");
> --- 53,121 ----
>        * We loop tightly for awhile, then delay using pg_usleep() and try
>        * again. Preferably, "awhile" should be a small multiple of the
>        * maximum time we expect a spinlock to be held.  100 iterations seems
> !      * about right as an initial guess.  However, on a uniprocessor the
> !      * loop is a waste of cycles, while in a multi-CPU scenario it's usually
> !      * better to spin a bit longer than to call the kernel, so we try to
> !      * adapt the spin loop count depending on whether we seem to be in
> !      * a uniprocessor or multiprocessor.
>        *
>        * Once we do decide to block, we use randomly increasing pg_usleep()
> !      * delays. The first delay is 1 msec, then the delay randomly
> !      * increases to about one second, after which we reset to 1 msec and
>        * start again.  The idea here is that in the presence of heavy
>        * contention we need to increase the delay, else the spinlock holder
>        * may never get to run and release the lock.  (Consider situation
>        * where spinlock holder has been nice'd down in priority by the
>        * scheduler --- it will not get scheduled until all would-be
> !      * acquirers are sleeping, so if we always use a 1-msec sleep, there
>        * is a real possibility of starvation.)  But we can't just clamp the
>        * delay to an upper bound, else it would take a long time to make a
>        * reasonable number of tries.
>        *
>        * We time out and declare error after NUM_DELAYS delays (thus, exactly
>        * that many tries).  With the given settings, this will usually take
> !      * 2 or so minutes.  It seems better to fix the total number of tries
>        * (and thus the probability of unintended failure) than to fix the
>        * total time spent.
>        *
> !      * The pg_usleep() delays are measured in milliseconds because 1 msec
> !      * is a common resolution limit at the OS level for newer platforms.
> !      * On older platforms the resolution limit is usually 10 msec, in
> !      * which case the total delay before timeout will be a bit more.
>        */
> ! #define MIN_SPINS_PER_DELAY    10
> ! #define MAX_SPINS_PER_DELAY    1000
>   #define NUM_DELAYS            1000
> ! #define MIN_DELAY_MSEC        1
> ! #define MAX_DELAY_MSEC        1000
>   
> !     int            spins = spins_per_delay;
>       int            delays = 0;
> !     int            cur_delay = 0;
>   
>       while (TAS(lock))
>       {
>           /* CPU-specific delay each time through the loop */
>           SPIN_DELAY();
>   
> !         /*
> !          * Block the process every spins_per_delay tries.
> !          *
> !          * What we are really testing for here is spins being decremented
> !          * to zero.  We insert an unnecessary integer modulo operation
> !          * into the test because we'd like this loop to run longer than
> !          * just two or three instructions: ideally, the processor should
> !          * not be contending for the system bus for a little while here.
> !          */
> !         if ((--spins % MAX_SPINS_PER_DELAY) == 0)
>           {
>               if (++delays > NUM_DELAYS)
>                   s_lock_stuck(lock, file, line);
>   
> !             if (cur_delay == 0)    /* first time to delay? */
> !                 cur_delay = MIN_DELAY_MSEC;
> ! 
> !             pg_usleep(cur_delay * 1000L);
>   
>   #if defined(S_LOCK_TEST)
>               fprintf(stdout, "*");
> ***************
> *** 107,119 ****
>               cur_delay += (int) (cur_delay *
>                 (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
>               /* wrap back to minimum delay when max is exceeded */
> !             if (cur_delay > MAX_DELAY_CSEC)
> !                 cur_delay = MIN_DELAY_CSEC;
>   
> !             spins = 0;
>           }
>       }
>   }
>   
>   /*
>    * Various TAS implementations that cannot live in s_lock.h as no inline
> --- 126,200 ----
>               cur_delay += (int) (cur_delay *
>                 (((double) random()) / ((double) MAX_RANDOM_VALUE)) + 0.5);
>               /* wrap back to minimum delay when max is exceeded */
> !             if (cur_delay > MAX_DELAY_MSEC)
> !                 cur_delay = MIN_DELAY_MSEC;
>   
> !             spins = spins_per_delay;
>           }
>       }
> + 
> +     /*
> +      * If we were able to acquire the lock without delaying, it's a good
> +      * indication we are in a multiprocessor.  If we had to delay, it's
> +      * a sign (but not a sure thing) that we are in a uniprocessor.
> +      * Hence, we decrement spins_per_delay slowly when we had to delay,
> +      * and increase it rapidly when we didn't.  It's expected that
> +      * spins_per_delay will converge to the minimum value on a uniprocessor
> +      * and to the maximum value on a multiprocessor.
> +      *
> +      * Note: spins_per_delay is local within our current process.
> +      * We want to average these observations across multiple backends,
> +      * since it's relatively rare for this function to even get entered,
> +      * and so a single backend might not live long enough to converge on
> +      * a good value.  That is handled by the two routines below.
> +      */
> +     if (cur_delay == 0)
> +     {
> +         /* we never had to delay */
> +         spins_per_delay += 100;
> +         spins_per_delay = Min(spins_per_delay, MAX_SPINS_PER_DELAY);
> +     }
> +     else
> +     {
> +         spins_per_delay -= 1;
> +         spins_per_delay = Max(spins_per_delay, MIN_SPINS_PER_DELAY);
> +     }
> + }
> + 
> + 
> + /*
> +  * Set local copy of spins_per_delay during backend startup.
> +  *
> +  * NB: this has to be pretty fast as it is called while holding a spinlock
> +  */
> + void
> + set_spins_per_delay(int shared_spins_per_delay)
> + {
> +     spins_per_delay = shared_spins_per_delay;
> + }
> + 
> + /*
> +  * Update shared estimate of spins_per_delay during backend exit.
> +  *
> +  * NB: this has to be pretty fast as it is called while holding a spinlock
> +  */
> + int
> + update_spins_per_delay(int shared_spins_per_delay)
> + {
> +     /*
> +      * We use an exponential moving average with a relatively slow
> +      * adaption rate, so that noise in any one backend's result won't
> +      * affect the shared value too much.  As long as both inputs are
> +      * within the allowed range, the result must be too, so we need not
> +      * worry about clamping the result.
> +      *
> +      * We deliberately truncate rather than rounding; this is so that
> +      * single adjustments inside a backend can affect the shared estimate
> +      * (see the asymmetric adjustment rules above).
> +      */
> +     return (shared_spins_per_delay * 15 + spins_per_delay) / 16;
>   }
> + 
>   
>   /*
>    * Various TAS implementations that cannot live in s_lock.h as no inline
> *** /home/postgres/pgsql/src/include/storage/proc.h.orig    Sat Aug 20 19:26:34 2005
> --- /home/postgres/pgsql/src/include/storage/proc.h    Sun Sep 11 15:07:09 2005
> ***************
> *** 105,110 ****
> --- 105,112 ----
>   {
>       /* Head of list of free PGPROC structures */
>       SHMEM_OFFSET freeProcs;
> +     /* Current shared estimate of appropriate spins_per_delay value */
> +     int            spins_per_delay;
>   } PROC_HDR;
>   
>   
> *** /home/postgres/pgsql/src/include/storage/s_lock.h.orig    Sun Aug 28 20:41:34 2005
> --- /home/postgres/pgsql/src/include/storage/s_lock.h    Sun Sep 11 15:07:09 2005
> ***************
> *** 826,829 ****
> --- 826,835 ----
>    */
>   extern void s_lock(volatile slock_t *lock, const char *file, int line);
>   
> + /* Support for dynamic adjustment of spins_per_delay */
> + #define DEFAULT_SPINS_PER_DELAY  100
> + 
> + extern void set_spins_per_delay(int shared_spins_per_delay);
> + extern int    update_spins_per_delay(int shared_spins_per_delay);
> + 
>   #endif     /* S_LOCK_H */

> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Spinlocks, yet again: analysis and proposed patches

From
Simon Riggs
Date:
On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> I wrote:
> > Another thought came to mind: maybe the current data layout for LWLocks
> > is bad.  Right now, the spinlock that protects each LWLock data struct
> > is itself part of the struct, and since the structs aren't large (circa
> > 20 bytes), the whole thing is usually all in the same cache line. ...
> > Maybe it'd be better to allocate the spinlocks off by themselves.
> 
> Well, this is odd.  I made up a patch to do this (attached) and found
> that it pretty much sucks.  Still the 4-way Opteron, previous best
> (slock-no-cmpb and spin-delay-2):
>     1 31s    2 42s    4 51s    8 100s
> with lwlock-separate added:
>     1 31s    2 52s    4 106s    8 213s
> 
> What I had expected to see was a penalty in the single-thread case (due
> to instructions added to the LWLock functions), but there isn't any.
> I did not expect to see a factor-of-2 penalty for four threads.
> 
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent.  Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-).  Why the heck is this so bad?  

(Just going back through the whole thread for completeness.)

> Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?

Yes. That's the advice from Intel and AMD; but we should add that there
is potential for improving performance also.

The possible problem we were trying to avoid here was false sharing of
the cache line by lock requestors of locks whose spin locks were
adjacent in memory.

Splitting the data structure was just one of the possible ways of
avoiding that. The above shows that the possible solution described
above didn't work, but there could be others.

One thing we tried in February was padding out the statically defined
locks with dummy lock definitions in the enum. This has the effect of
ensuring that the most contested locks are very definitely in their own
cache line and not shared with others.
That showed a noticeable improvement in performance, probably because it
costs very little to implement, even if the code would require some
explanatory documentation. 

The lwlock structure in include/storage/lwlock.h looks like

typedef enum LWLockId

{BufMappingLock,BufFreelistLock,LockMgrLock,OidGenLock,XidGenLock,ProcArrayLock,SInvalLock,FreeSpaceLock,WALInsertLock,WALWriteLock,...

Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
are adjacent to at least one other heavily contested lock. So they are
certain to be in the same cache line and therefore to cause false
sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
optimization handbooks).

This could be replaced by...

typedef enum LWLockId

{BufMappingLock,BufMappingLock_Padding1,BufMappingLock_Padding2,BufMappingLock_Padding3,BufMappingLock_Padding4,BufMappingLock_Padding5,BufMappingLock_Padding6,BufMappingLock_Padding7,BufFreelistLock,BufFreelistLock_Padding1,BufFreelistLock_Padding2,BufFreelistLock_Padding3,BufFreelistLock_Padding4,BufFreelistLock_Padding5,BufFreelistLock_Padding6,BufFreelistLock_Padding7,LockMgrLock,LockMgrLock_Padding1,LockMgrLock_Padding2,LockMgrLock_Padding3,LockMgrLock_Padding4,LockMgrLock_Padding5,LockMgrLock_Padding6,LockMgrLock_Padding7,OidGenLock,XidGenLock,ProcArrayLock,SInvalLock,FreeSpaceLock,WALInsertLock,WALInsertLock_Padding1,WALInsertLock_Padding2,WALInsertLock_Padding3,WALInsertLock_Padding4,WALInsertLock_Padding5,WALInsertLock_Padding6,WALInsertLock_Padding7,WALWriteLock,WALWriteLock_Padding1,WALWriteLock_Padding2,WALWriteLock_Padding3,WALWriteLock_Padding4,WALWriteLock_Padding5,WALWriteLock_Padding6,WALWriteLock_Padding7,...

where the number of padding locks is determined by how many lock
structures fit within a 128 byte cache line.

This isn't exactly elegant coding, but it provides a useful improvement
on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
pretty darn stupid. But it does follow the CPU optimization handbook
advice and I did see a noticeable improvement in performance and a
reduction in context switching.

I'm not in a position to try this again now on 8.1beta, but I'd welcome
a performance test result from anybody that is. I'll supply a patch
against 8.1beta for anyone wanting to test this.

Best Regards, Simon Riggs



Re: Spinlocks, yet again: analysis and proposed patches

From
Emil Briggs
Date:
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
>
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
>
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.
>

I don't have an 8 way available right now but I can run tests on a 4 way 
Opteron if that would be helpful. 

Emil



Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Wong
Date:
On Wed, 12 Oct 2005 18:44:50 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> > I wrote:
> > > Another thought came to mind: maybe the current data layout for LWLocks
> > > is bad.  Right now, the spinlock that protects each LWLock data struct
> > > is itself part of the struct, and since the structs aren't large (circa
> > > 20 bytes), the whole thing is usually all in the same cache line. ...
> > > Maybe it'd be better to allocate the spinlocks off by themselves.
> > 
> > Well, this is odd.  I made up a patch to do this (attached) and found
> > that it pretty much sucks.  Still the 4-way Opteron, previous best
> > (slock-no-cmpb and spin-delay-2):
> >     1 31s    2 42s    4 51s    8 100s
> > with lwlock-separate added:
> >     1 31s    2 52s    4 106s    8 213s
> > 
> > What I had expected to see was a penalty in the single-thread case (due
> > to instructions added to the LWLock functions), but there isn't any.
> > I did not expect to see a factor-of-2 penalty for four threads.
> > 
> > I guess what this means is that there's no real problem with losing the
> > cache line while manipulating the LWLock, which is what the patch was
> > intended to prevent.  Instead, we're paying for swapping two cache lines
> > (the spinlock and the LWLock) across processors instead of just one line.
> > But that should at worst be a 2x inflation of the time previously spent
> > in LWLockAcquire/Release, which is surely not yet all of the application
> > ;-).  Why the heck is this so bad?  
> 
> (Just going back through the whole thread for completeness.)
> 
> > Should we expect that apparently
> > minor changes in shared data structures might be costing equivalently
> > huge penalties in SMP performance elsewhere?
> 
> Yes. That's the advice from Intel and AMD; but we should add that there
> is potential for improving performance also.
> 
> The possible problem we were trying to avoid here was false sharing of
> the cache line by lock requestors of locks whose spin locks were
> adjacent in memory.
> 
> Splitting the data structure was just one of the possible ways of
> avoiding that. The above shows that the possible solution described
> above didn't work, but there could be others.
> 
> One thing we tried in February was padding out the statically defined
> locks with dummy lock definitions in the enum. This has the effect of
> ensuring that the most contested locks are very definitely in their own
> cache line and not shared with others.
> That showed a noticeable improvement in performance, probably because it
> costs very little to implement, even if the code would require some
> explanatory documentation. 
> 
> The lwlock structure in include/storage/lwlock.h looks like
> 
> typedef enum LWLockId
> {
>     BufMappingLock,
>     BufFreelistLock,
>     LockMgrLock,
>     OidGenLock,
>     XidGenLock,
>     ProcArrayLock,
>     SInvalLock,
>     FreeSpaceLock,
>     WALInsertLock,
>     WALWriteLock,
>     ...
> 
> Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
> are adjacent to at least one other heavily contested lock. So they are
> certain to be in the same cache line and therefore to cause false
> sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
> optimization handbooks).
> 
> This could be replaced by...
> 
> typedef enum LWLockId
> {
>     BufMappingLock,
>     BufMappingLock_Padding1,
>     BufMappingLock_Padding2,
>     BufMappingLock_Padding3,
>     BufMappingLock_Padding4,
>     BufMappingLock_Padding5,
>     BufMappingLock_Padding6,
>     BufMappingLock_Padding7,
>     BufFreelistLock,
>     BufFreelistLock_Padding1,
>     BufFreelistLock_Padding2,
>     BufFreelistLock_Padding3,
>     BufFreelistLock_Padding4,
>     BufFreelistLock_Padding5,
>     BufFreelistLock_Padding6,
>     BufFreelistLock_Padding7,
>     LockMgrLock,
>     LockMgrLock_Padding1,
>     LockMgrLock_Padding2,
>     LockMgrLock_Padding3,
>     LockMgrLock_Padding4,
>     LockMgrLock_Padding5,
>     LockMgrLock_Padding6,
>     LockMgrLock_Padding7,
>     OidGenLock,
>     XidGenLock,
>     ProcArrayLock,
>     SInvalLock,
>     FreeSpaceLock,
>     WALInsertLock,
>     WALInsertLock_Padding1,
>     WALInsertLock_Padding2,
>     WALInsertLock_Padding3,
>     WALInsertLock_Padding4,
>     WALInsertLock_Padding5,
>     WALInsertLock_Padding6,
>     WALInsertLock_Padding7,
>     WALWriteLock,
>     WALWriteLock_Padding1,
>     WALWriteLock_Padding2,
>     WALWriteLock_Padding3,
>     WALWriteLock_Padding4,
>     WALWriteLock_Padding5,
>     WALWriteLock_Padding6,
>     WALWriteLock_Padding7,
>     ...
> 
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
> 
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
> 
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.

I'll take a patch.  I have a 4 processor (8core) POWER5 system just
about ready.

Mark


Re: Spinlocks, yet again: analysis and proposed patches

From
"Jim C. Nasby"
Date:
On Wed, Oct 12, 2005 at 03:07:23PM -0400, Emil Briggs wrote:
> > where the number of padding locks is determined by how many lock
> > structures fit within a 128 byte cache line.
> >
> > This isn't exactly elegant coding, but it provides a useful improvement
> > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > pretty darn stupid. But it does follow the CPU optimization handbook
> > advice and I did see a noticeable improvement in performance and a
> > reduction in context switching.
> >
> > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > a performance test result from anybody that is. I'll supply a patch
> > against 8.1beta for anyone wanting to test this.
> >
> 
> I don't have an 8 way available right now but I can run tests on a 4 way 
> Opteron if that would be helpful. 

Likewise I can test on a 4 way opteron running open solaris.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Spinlocks, yet again: analysis and proposed patches

From
Simon Riggs
Date:
On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > 
> > This isn't exactly elegant coding, but it provides a useful improvement
> > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > pretty darn stupid. But it does follow the CPU optimization handbook
> > advice and I did see a noticeable improvement in performance and a
> > reduction in context switching.

> > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > a performance test result from anybody that is. I'll supply a patch
> > against 8.1beta for anyone wanting to test this.
> 
> Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> I've just set up and probably needs a bit of tuning.  I don't see much
> difference but I'm wondering if the cacheline sizes are dramatically
> different from Intel/AMD processors.  I still need to take a closer look
> to make sure I haven't grossly mistuned anything, but I'll let everyone
> take a look:

Well, the Power 5 architecture probably has the lowest overall memory
delay you can get currently so in some ways that would negate the
effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
clear the patch isn't significantly better (like it was with 8.0 when we
tried this on the 8-way Itanium in Feb).

> cvs 20051013
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> 2501 notpm
> 
> cvs 20051013 w/ lw.patch
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> 2519 notpm

Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Best Regards, Simon Riggs



Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Wong
Date:
On Thu, 20 Oct 2005 23:03:47 +0100
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > > 
> > > This isn't exactly elegant coding, but it provides a useful improvement
> > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > advice and I did see a noticeable improvement in performance and a
> > > reduction in context switching.
> 
> > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > a performance test result from anybody that is. I'll supply a patch
> > > against 8.1beta for anyone wanting to test this.
> > 
> > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > I've just set up and probably needs a bit of tuning.  I don't see much
> > difference but I'm wondering if the cacheline sizes are dramatically
> > different from Intel/AMD processors.  I still need to take a closer look
> > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > take a look:
> 
> Well, the Power 5 architecture probably has the lowest overall memory
> delay you can get currently so in some ways that would negate the
> effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> clear the patch isn't significantly better (like it was with 8.0 when we
> tried this on the 8-way Itanium in Feb).
> 
> > cvs 20051013
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > 2501 notpm
> > 
> > cvs 20051013 w/ lw.patch
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > 2519 notpm
> 
> Could you re-run with wal_buffers = 32 ? (Without patch) Thanks

Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
redid the disk layout.  Here's where I'm at now:

cvs 20051013
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
3257 notpm

cvs 20051013 w/ lw.patch
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
3285 notpm

Still not much of a difference with the patch.  A quick glance over the
iostat data suggests I'm still not i/o bound, but the i/o wait is rather
high according to vmstat.  Will try to see if there's anything else
obvious to get the load up higher.

Mark


Re: Spinlocks, yet again: analysis and proposed patches

From
Simon Riggs
Date:
On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
> On Thu, 20 Oct 2005 23:03:47 +0100
> Simon Riggs <simon@2ndquadrant.com> wrote:
> 
> > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > > > 
> > > > This isn't exactly elegant coding, but it provides a useful improvement
> > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > > advice and I did see a noticeable improvement in performance and a
> > > > reduction in context switching.
> > 
> > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > > a performance test result from anybody that is. I'll supply a patch
> > > > against 8.1beta for anyone wanting to test this.
> > > 
> > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > > I've just set up and probably needs a bit of tuning.  I don't see much
> > > difference but I'm wondering if the cacheline sizes are dramatically
> > > different from Intel/AMD processors.  I still need to take a closer look
> > > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > > take a look:
> > 
> > Well, the Power 5 architecture probably has the lowest overall memory
> > delay you can get currently so in some ways that would negate the
> > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> > clear the patch isn't significantly better (like it was with 8.0 when we
> > tried this on the 8-way Itanium in Feb).
> > 
> > > cvs 20051013
> > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > > 2501 notpm
> > > 
> > > cvs 20051013 w/ lw.patch
> > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > > 2519 notpm
> > 
> > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
> 
> Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
> redid the disk layout.  Here's where I'm at now:
> 
> cvs 20051013
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
> 3257 notpm
> 
> cvs 20051013 w/ lw.patch
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
> 3285 notpm
> 
> Still not much of a difference with the patch.  A quick glance over the
> iostat data suggests I'm still not i/o bound, but the i/o wait is rather
> high according to vmstat.  Will try to see if there's anything else
> obvious to get the load up higher.

OK, thats fine. I'm glad there's some gain, but not much yet. I think we
should leave out doing any more tests on lw.patch for now.

Concerned about the awful checkpointing. Can you bump wal_buffers to
8192 just to make sure? Thats way too high, but just to prove it.

We need to rdeuce the number of blocks to be written at checkpoint.
bgwriter_all_maxpages   5      ->  15bgwriter_all_percent    0.333bgwriter_delay          200  bgwriter_lru_maxpages
5       ->  7bgwriter_lru_percent    1
 
shared_buffers         set lower to 100000(which should cause some amusement on-list)

Best Regards, Simon Riggs



Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Wong
Date:
On Tue, 01 Nov 2005 07:32:32 +0000
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote:
> > On Thu, 20 Oct 2005 23:03:47 +0100
> > Simon Riggs <simon@2ndquadrant.com> wrote:
> > 
> > > On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote:
> > > > > 
> > > > > This isn't exactly elegant coding, but it provides a useful improvement
> > > > > on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> > > > > pretty darn stupid. But it does follow the CPU optimization handbook
> > > > > advice and I did see a noticeable improvement in performance and a
> > > > > reduction in context switching.
> > > 
> > > > > I'm not in a position to try this again now on 8.1beta, but I'd welcome
> > > > > a performance test result from anybody that is. I'll supply a patch
> > > > > against 8.1beta for anyone wanting to test this.
> > > > 
> > > > Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which
> > > > I've just set up and probably needs a bit of tuning.  I don't see much
> > > > difference but I'm wondering if the cacheline sizes are dramatically
> > > > different from Intel/AMD processors.  I still need to take a closer look
> > > > to make sure I haven't grossly mistuned anything, but I'll let everyone
> > > > take a look:
> > > 
> > > Well, the Power 5 architecture probably has the lowest overall memory
> > > delay you can get currently so in some ways that would negate the
> > > effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's
> > > clear the patch isn't significantly better (like it was with 8.0 when we
> > > tried this on the 8-way Itanium in Feb).
> > > 
> > > > cvs 20051013
> > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/
> > > > 2501 notpm
> > > > 
> > > > cvs 20051013 w/ lw.patch
> > > > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/
> > > > 2519 notpm
> > > 
> > > Could you re-run with wal_buffers = 32 ? (Without patch) Thanks
> > 
> > Ok, sorry for the delay.  I've bumped up the wal_buffers to 2048 and
> > redid the disk layout.  Here's where I'm at now:
> > 
> > cvs 20051013
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/
> > 3257 notpm
> > 
> > cvs 20051013 w/ lw.patch
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/
> > 3285 notpm
> > 
> > Still not much of a difference with the patch.  A quick glance over the
> > iostat data suggests I'm still not i/o bound, but the i/o wait is rather
> > high according to vmstat.  Will try to see if there's anything else
> > obvious to get the load up higher.
> 
> OK, thats fine. I'm glad there's some gain, but not much yet. I think we
> should leave out doing any more tests on lw.patch for now.
> 
> Concerned about the awful checkpointing. Can you bump wal_buffers to
> 8192 just to make sure? Thats way too high, but just to prove it.
> 
> We need to rdeuce the number of blocks to be written at checkpoint.
> 
>  bgwriter_all_maxpages   5      ->  15
>  bgwriter_all_percent    0.333
>  bgwriter_delay          200  
>  bgwriter_lru_maxpages   5        ->  7
>  bgwriter_lru_percent    1
> 
>  shared_buffers         set lower to 100000
>  (which should cause some amusement on-list)


Okay, here goes, all with the same source base w/ the lw.patch:

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
only increased wal_buffers to 8192 from 2048
3242 notpm

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
3019 notpm (but more interesting graph)

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
Same as the previously listen run with hared_buffers lowered to 10000
2503 notpm

Mark


Re: Spinlocks, yet again: analysis and proposed patches

From
Simon Riggs
Date:
On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote:
> On Tue, 01 Nov 2005 07:32:32 +0000
> Simon Riggs <simon@2ndquadrant.com> wrote:
> > Concerned about the awful checkpointing. Can you bump wal_buffers to
> > 8192 just to make sure? Thats way too high, but just to prove it.
> > 
> > We need to rdeuce the number of blocks to be written at checkpoint.
> > 
> >  bgwriter_all_maxpages   5      ->  15
> >  bgwriter_all_percent    0.333
> >  bgwriter_delay          200  
> >  bgwriter_lru_maxpages   5        ->  7
> >  bgwriter_lru_percent    1
> > 
> >  shared_buffers         set lower to 100000
> >  (which should cause some amusement on-list)
> 
> 
> Okay, here goes, all with the same source base w/ the lw.patch:
> 
> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
> only increased wal_buffers to 8192 from 2048
> 3242 notpm

That looks to me like a clear negative effect from increasing
wal_buffers. Try putting it back down to 1024.
Looks like we need to plug that gap.

> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
> only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
> 3019 notpm (but more interesting graph)

Man that sucks. What the heck is happening there? Hackers - if you
watching you should see this graph - it shows some very poor behaviour.

I'm not happy with that performance at all.... any chance you could re-
run that exact same test to see if we can get that repeatably?

I see you have 
vm.dirty_writeback_centisecs = 0

which pretty much means we aren't ever writing to disk by the pdflush
daemons, even when the bgwriter is active.

Could we set the bgwriter stuff back to default and try 
vm.dirty_writeback_centisecs = 500

> http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
> Same as the previously listen run with hared_buffers lowered to 10000
> 2503 notpm

Sorry, that was 100,000 not 10,000. 

Looks like we need dates on the log_line_prefix so we can check the
logs.

...not sure about the oprofile results. Seems to show CreateLWLocks
being as high as xlog_insert, which is mad. Either that shows startup
time is excessive, or it means the oprofile timing range is too short.
Not sure which.

Best Regards, Simon Riggs





Re: Spinlocks, yet again: analysis and proposed patches

From
Mark Wong
Date:
On Thu, 03 Nov 2005 18:29:09 +0000
Simon Riggs <simon@2ndquadrant.com> wrote:

> On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote:
> > On Tue, 01 Nov 2005 07:32:32 +0000
> > Simon Riggs <simon@2ndquadrant.com> wrote:
> > > Concerned about the awful checkpointing. Can you bump wal_buffers to
> > > 8192 just to make sure? Thats way too high, but just to prove it.
> > > 
> > > We need to rdeuce the number of blocks to be written at checkpoint.
> > > 
> > >  bgwriter_all_maxpages   5      ->  15
> > >  bgwriter_all_percent    0.333
> > >  bgwriter_delay          200  
> > >  bgwriter_lru_maxpages   5        ->  7
> > >  bgwriter_lru_percent    1
> > > 
> > >  shared_buffers         set lower to 100000
> > >  (which should cause some amusement on-list)
> > 
> > 
> > Okay, here goes, all with the same source base w/ the lw.patch:
> > 
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/
> > only increased wal_buffers to 8192 from 2048
> > 3242 notpm
> 
> That looks to me like a clear negative effect from increasing
> wal_buffers. Try putting it back down to 1024.
> Looks like we need to plug that gap.
> 
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/
> > only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7
> > 3019 notpm (but more interesting graph)
> 
> Man that sucks. What the heck is happening there? Hackers - if you
> watching you should see this graph - it shows some very poor behaviour.
> 
> I'm not happy with that performance at all.... any chance you could re-
> run that exact same test to see if we can get that repeatably?
> 
> I see you have 
> vm.dirty_writeback_centisecs = 0
> 
> which pretty much means we aren't ever writing to disk by the pdflush
> daemons, even when the bgwriter is active.
> 
> Could we set the bgwriter stuff back to default and try 
> vm.dirty_writeback_centisecs = 500

http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/47/
3309 notpm
> > http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/
> > Same as the previously listen run with hared_buffers lowered to 10000
> > 2503 notpm
> 
> Sorry, that was 100,000 not 10,000. 

Oops!
http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/46/
2794 notpm

> Looks like we need dates on the log_line_prefix so we can check the
> logs.

Oops again!  I didn't check to make sure I had set this correctly before
I ran the last two tests, I'll get on it.
> ...not sure about the oprofile results. Seems to show CreateLWLocks
> being as high as xlog_insert, which is mad. Either that shows startup
> time is excessive, or it means the oprofile timing range is too short.
> Not sure which.

Yeah, we've seen this before.  I think I'll have to try pulling the
oprofile cvs code to see if there's any improvement.  I've been working
with oprofile-0.9.1.

Mark


Re: Spinlocks, yet again: analysis and proposed patches

From
Bruce Momjian
Date:
Added to TODO list.

---------------------------------------------------------------------------

Simon Riggs wrote:
> On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> > I wrote:
> > > Another thought came to mind: maybe the current data layout for LWLocks
> > > is bad.  Right now, the spinlock that protects each LWLock data struct
> > > is itself part of the struct, and since the structs aren't large (circa
> > > 20 bytes), the whole thing is usually all in the same cache line. ...
> > > Maybe it'd be better to allocate the spinlocks off by themselves.
> > 
> > Well, this is odd.  I made up a patch to do this (attached) and found
> > that it pretty much sucks.  Still the 4-way Opteron, previous best
> > (slock-no-cmpb and spin-delay-2):
> >     1 31s    2 42s    4 51s    8 100s
> > with lwlock-separate added:
> >     1 31s    2 52s    4 106s    8 213s
> > 
> > What I had expected to see was a penalty in the single-thread case (due
> > to instructions added to the LWLock functions), but there isn't any.
> > I did not expect to see a factor-of-2 penalty for four threads.
> > 
> > I guess what this means is that there's no real problem with losing the
> > cache line while manipulating the LWLock, which is what the patch was
> > intended to prevent.  Instead, we're paying for swapping two cache lines
> > (the spinlock and the LWLock) across processors instead of just one line.
> > But that should at worst be a 2x inflation of the time previously spent
> > in LWLockAcquire/Release, which is surely not yet all of the application
> > ;-).  Why the heck is this so bad?  
> 
> (Just going back through the whole thread for completeness.)
> 
> > Should we expect that apparently
> > minor changes in shared data structures might be costing equivalently
> > huge penalties in SMP performance elsewhere?
> 
> Yes. That's the advice from Intel and AMD; but we should add that there
> is potential for improving performance also.
> 
> The possible problem we were trying to avoid here was false sharing of
> the cache line by lock requestors of locks whose spin locks were
> adjacent in memory.
> 
> Splitting the data structure was just one of the possible ways of
> avoiding that. The above shows that the possible solution described
> above didn't work, but there could be others.
> 
> One thing we tried in February was padding out the statically defined
> locks with dummy lock definitions in the enum. This has the effect of
> ensuring that the most contested locks are very definitely in their own
> cache line and not shared with others.
> That showed a noticeable improvement in performance, probably because it
> costs very little to implement, even if the code would require some
> explanatory documentation. 
> 
> The lwlock structure in include/storage/lwlock.h looks like
> 
> typedef enum LWLockId
> {
>     BufMappingLock,
>     BufFreelistLock,
>     LockMgrLock,
>     OidGenLock,
>     XidGenLock,
>     ProcArrayLock,
>     SInvalLock,
>     FreeSpaceLock,
>     WALInsertLock,
>     WALWriteLock,
>     ...
> 
> Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
> are adjacent to at least one other heavily contested lock. So they are
> certain to be in the same cache line and therefore to cause false
> sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
> optimization handbooks).
> 
> This could be replaced by...
> 
> typedef enum LWLockId
> {
>     BufMappingLock,
>     BufMappingLock_Padding1,
>     BufMappingLock_Padding2,
>     BufMappingLock_Padding3,
>     BufMappingLock_Padding4,
>     BufMappingLock_Padding5,
>     BufMappingLock_Padding6,
>     BufMappingLock_Padding7,
>     BufFreelistLock,
>     BufFreelistLock_Padding1,
>     BufFreelistLock_Padding2,
>     BufFreelistLock_Padding3,
>     BufFreelistLock_Padding4,
>     BufFreelistLock_Padding5,
>     BufFreelistLock_Padding6,
>     BufFreelistLock_Padding7,
>     LockMgrLock,
>     LockMgrLock_Padding1,
>     LockMgrLock_Padding2,
>     LockMgrLock_Padding3,
>     LockMgrLock_Padding4,
>     LockMgrLock_Padding5,
>     LockMgrLock_Padding6,
>     LockMgrLock_Padding7,
>     OidGenLock,
>     XidGenLock,
>     ProcArrayLock,
>     SInvalLock,
>     FreeSpaceLock,
>     WALInsertLock,
>     WALInsertLock_Padding1,
>     WALInsertLock_Padding2,
>     WALInsertLock_Padding3,
>     WALInsertLock_Padding4,
>     WALInsertLock_Padding5,
>     WALInsertLock_Padding6,
>     WALInsertLock_Padding7,
>     WALWriteLock,
>     WALWriteLock_Padding1,
>     WALWriteLock_Padding2,
>     WALWriteLock_Padding3,
>     WALWriteLock_Padding4,
>     WALWriteLock_Padding5,
>     WALWriteLock_Padding6,
>     WALWriteLock_Padding7,
>     ...
> 
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
> 
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
> 
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Spinlocks, yet again: analysis and proposed patches

From
"Qingqing Zhou"
Date:
"Bruce Momjian" <pgman@candle.pha.pa.us> wrote
>
> Added to TODO list.
>
> > One thing we tried in February was padding out the statically defined
> > locks with dummy lock definitions in the enum. This has the effect of
> > ensuring that the most contested locks are very definitely in their own
> > cache line and not shared with others.
> > That showed a noticeable improvement in performance, probably because it
> > costs very little to implement, even if the code would require some
> > explanatory documentation.
> >

Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Regards,
Qingqing




Re: Spinlocks, yet again: analysis and proposed patches

From
Bruce Momjian
Date:
Qingqing Zhou wrote:
> 
> "Bruce Momjian" <pgman@candle.pha.pa.us> wrote
> >
> > Added to TODO list.
> >
> > > One thing we tried in February was padding out the statically defined
> > > locks with dummy lock definitions in the enum. This has the effect of
> > > ensuring that the most contested locks are very definitely in their own
> > > cache line and not shared with others.
> > > That showed a noticeable improvement in performance, probably because it
> > > costs very little to implement, even if the code would require some
> > > explanatory documentation.
> > >
> 
> Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Oh, yes, thanks.  I thought it had but I couldn't find anything in the
area of the code he propsed the patch.

--  Bruce Momjian   http://candle.pha.pa.us EnterpriseDB    http://www.enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: Spinlocks, yet again: analysis and proposed patches

From
Tom Lane
Date:
"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:
>>> One thing we tried in February was padding out the statically defined
>>> locks with dummy lock definitions in the enum.

> Has this been done? See the LWLOCK_PADDED_SIZE macro in code.

Not really --- that patch was intended to ensure that LWLocks don't
unnecessarily cross two cache lines.  It doesn't ensure that two
different LWLocks aren't sharing a cache line.  You could do that
by increasing LWLOCK_PADDED_SIZE to the cache line size for your
hardware, if you know what that is.

I think a more effective answer might be to twiddle the order of
"enum LWLockId" items so that the most heavily used LWLocks aren't
close together.  Haven't looked into it though.
        regards, tom lane


Re: Spinlocks, yet again: analysis and proposed patches

From
"Qingqing Zhou"
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> wrote
>
> Not really --- that patch was intended to ensure that LWLocks don't
> unnecessarily cross two cache lines.  It doesn't ensure that two
> different LWLocks aren't sharing a cache line.  You could do that
> by increasing LWLOCK_PADDED_SIZE to the cache line size for your
> hardware, if you know what that is.
>
Exactly, this is one way -- if we make LWLOCK_PADDED_SIZE big enough, we can
assure that one lwlock one cacheline. If so, maybe we should plug in a check
like LMBench in ./configure to guess out current cacheline size. But this
way looks like overkill -- a compromise is to pad only some of the LWLocks
big enough but not all (for example, the buffer content lock array).

Regards,
Qingqing