Thread: futex results with dbt-3

futex results with dbt-3

From

Mark Wong

Date:

13 October 2004, 19:57:55

Hi guys,

I have some DBT-3 (decision support) results using Gavin's original
futex patch fix.  It's on out 8-way Pentium III Xeon systems
in our STP environment.  Definitely see some overall throughput
performance on the tests, about 15% increase, but not change with
respect to the number of context switches.

Perhaps it doesn't really address what's causing the incredible number
of context switches, but still helps.  I think I'm seeing what Gavin
has, that it seems to solves some concurrency problems on x86 platforms.

Here's results without futexes:
    http://khack.osdl.org/stp/298114/

Results with futexes:
    http://khack.osdl.org/stp/298115/

Mark

Re: futex results with dbt-3

From

Neil Conway

Date:

14 October 2004, 03:30:24

On Thu, 2004-10-14 at 04:57, Mark Wong wrote:
> I have some DBT-3 (decision support) results using Gavin's original
> futex patch fix.

I sent an initial description of the futex patch to the mailing lists
last week, but it never appeared (from talking to Marc I believe it
exceeded the size limit on -performance). In any case, the "futex patch"
uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. The hope
is that using futexes will lead to better performance when there is
contention for spinlocks (e.g. on a busy SMP system). The original patch
was written by Stephen Hemminger at OSDL; Gavin and myself have done a
bunch of additional bugfixing and optimization, as well as added IA64
support.

I've attached a WIP copy of the patch to this email (it supports x86,
x86-64 (untested) and IA64 -- more architectures can be added at
request). I'll post a longer writeup when I submit the patch to
-patches.

> Definitely see some overall throughput performance on the tests, about
> 15% increase, but not change with respect to the number of context
> switches.

I'm glad to see that there is a performance improvement; in my own
testing on an 8-way P3 system provided by OSDL, I saw a similar
improvement in pgbench performance (50 concurrent clients, 1000
transactions each, scale factor 75; without the patch, TPS/sec was
between 180 and 185, with the patch TPS/sec was between 200 and 215).

As for context switching, there was some earlier speculation that the
patch might improve or even resolve the "CS storm" issue that some
people have experienced with SMP Xeon P4 systems. I don't think we have
enough evidence to answer this one way or the other at this point.

-Neil

Attachment

futex-spin-16.patch

Re: futex results with dbt-3

From

Manfred Spraul

Date:

19 October 2004, 22:22:46

Neil wrote:

>. In any case, the "futex patch"
>uses the Linux 2.6 futex API to implement PostgreSQL spinlocks.
>
Has anyone tried to replace the whole lwlock implementation with
pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is
implemented with futexes, i.e. we would get a fast lock handling without
os specific hacks. Perhaps other os contain user space pthread locks, too.
Attached is an old patch. I tested it on an uniprocessor system a year
ago and it didn't provide much difference, but perhaps the scalability
is better. You'll have to add -lpthread to the library list for linking.

Regarding Neil's patch:

>! /*
>!  * XXX: is there a more efficient way to write this? Perhaps using
>!  * decl...?
>!  */
>! static __inline__ slock_t
>! atomic_dec(volatile slock_t *ptr)
>! {
>!     slock_t prev = -1;
>!
>!     __asm__ __volatile__(
>!         "    lock        \n"
>!         "    xadd %0,%1    \n"
>!         :"=q"(prev)
>!         :"m"(*ptr), "0"(prev)
>!         :"memory", "cc");
>!
>!     return prev;
>! }
>
xadd is not supported by original 80386 cpus, it was added for 80486
cpus. There is no instruction in the 80386 cpu that allows to atomically
decrement and retrieve the old value of an integer. The only option are
atomic_dec_test_zero or atomic_dec_test_negative - that can be
implemented by looking at the sign/zero flag. Depending on what you want
this may be enough. Or make the futex code conditional for > 80386 cpus.

--
    Manfred
--- p7.3.3.orig/src/backend/storage/lmgr/lwlock.c    2002-09-25 22:31:40.000000000 +0200
+++ postgresql-7.3.3/src/backend/storage/lmgr/lwlock.c    2003-09-06 14:15:01.000000000 +0200
@@ -26,6 +26,28 @@
 #include "storage/proc.h"
 #include "storage/spin.h"

+#define USE_PTHREAD_LOCKS
+
+#ifdef USE_PTHREAD_LOCKS
+
+#include <pthread.h>
+#include <errno.h>
+typedef pthread_rwlock_t LWLock;
+
+inline static void
+InitLWLock(LWLock *p)
+{
+    pthread_rwlockattr_t rwattr;
+    int i;
+
+    pthread_rwlockattr_init(&rwattr);
+    pthread_rwlockattr_setpshared(&rwattr, PTHREAD_PROCESS_SHARED);
+    i=pthread_rwlock_init(p, &rwattr);
+    pthread_rwlockattr_destroy(&rwattr);
+    if (i)
+        elog(FATAL, "pthread_rwlock_init failed");
+}
+#else

 typedef struct LWLock
 {
@@ -38,6 +60,17 @@
     /* tail is undefined when head is NULL */
 } LWLock;

+inline static void
+InitLWLock(LWLock *lock)
+{
+    SpinLockInit(&lock->mutex);
+    lock->releaseOK = true;
+    lock->exclusive = 0;
+    lock->shared = 0;
+    lock->head = NULL;
+    lock->tail = NULL;
+}
+#endif
 /*
  * This points to the array of LWLocks in shared memory.  Backends inherit
  * the pointer by fork from the postmaster.  LWLockIds are indexes into
@@ -61,7 +94,7 @@
 static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS];


-#ifdef LOCK_DEBUG
+#if defined(LOCK_DEBUG) && !defined(USE_PTHREAD_LOCKS)
 bool        Trace_lwlocks = false;

 inline static void
@@ -153,12 +186,7 @@
      */
     for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++)
     {
-        SpinLockInit(&lock->mutex);
-        lock->releaseOK = true;
-        lock->exclusive = 0;
-        lock->shared = 0;
-        lock->head = NULL;
-        lock->tail = NULL;
+        InitLWLock(lock);
     }

     /*
@@ -185,7 +213,116 @@
     return (LWLockId) (LWLockCounter[0]++);
 }

+#ifdef USE_PTHREAD_LOCKS
+/*
+ * LWLockAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, sleep until it is.
+ *
+ * Side effect: cancel/die interrupts are held off until lock release.
+ */
+void
+LWLockAcquire(LWLockId lockid, LWLockMode mode)
+{
+    int i;
+    PRINT_LWDEBUG("LWLockAcquire", lockid, &LWLockArray[lockid]);
+
+    /*
+     * We can't wait if we haven't got a PGPROC.  This should only occur
+     * during bootstrap or shared memory initialization.  Put an Assert
+     * here to catch unsafe coding practices.
+     */
+    Assert(!(proc == NULL && IsUnderPostmaster));
+
+    /*
+     * Lock out cancel/die interrupts until we exit the code section
+     * protected by the LWLock.  This ensures that interrupts will not
+     * interfere with manipulations of data structures in shared memory.
+     */
+    HOLD_INTERRUPTS();
+
+    if (mode == LW_EXCLUSIVE) {
+        i = pthread_rwlock_wrlock(&LWLockArray[lockid]);
+    } else {
+        i = pthread_rwlock_rdlock(&LWLockArray[lockid]);
+    }
+    if (i)
+        elog(FATAL, "Unexpected error from pthread_rwlock.");
+
+    /* Add lock to list of locks held by this backend */
+    Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS);
+    held_lwlocks[num_held_lwlocks++] = lockid;
+}
+
+/*
+ * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode
+ *
+ * If the lock is not available, return FALSE with no side-effects.
+ *
+ * If successful, cancel/die interrupts are held off until lock release.
+ */
+bool
+LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode)
+{
+    int i;
+    PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, &LWLockArray[lockid]);
+
+    HOLD_INTERRUPTS();
+
+    if (mode == LW_EXCLUSIVE) {
+        i = pthread_rwlock_trywrlock(&LWLockArray[lockid]);
+    } else {
+        i = pthread_rwlock_tryrdlock(&LWLockArray[lockid]);
+    }
+    switch(i) {
+        case 0:
+            /* Add lock to list of locks held by this backend */
+            Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS);
+            held_lwlocks[num_held_lwlocks++] = lockid;
+            return true;
+        case EBUSY:
+            RESUME_INTERRUPTS();
+            return false;
+        default:
+            elog(FATAL, "Unexpected error from pthread_rwlock_try.");
+            return false;
+    }
+}
+
+/*
+ * LWLockRelease - release a previously acquired lock
+ */
+void
+LWLockRelease(LWLockId lockid)
+{
+    int i;
+
+    /*
+     * Remove lock from list of locks held.  Usually, but not always, it
+     * will be the latest-acquired lock; so search array backwards.
+     */
+    for (i = num_held_lwlocks; --i >= 0;)
+    {
+        if (lockid == held_lwlocks[i])
+            break;
+    }
+    if (i < 0)
+        elog(ERROR, "LWLockRelease: lock %d is not held", (int) lockid);
+    num_held_lwlocks--;
+    for (; i < num_held_lwlocks; i++)
+        held_lwlocks[i] = held_lwlocks[i + 1];
+
+    i = pthread_rwlock_unlock(&LWLockArray[lockid]);
+    if (i)
+        elog(FATAL, "Unexpected error from pthread_rwlock_unlock.");
+
+    /*
+     * Now okay to allow cancel/die interrupts.
+     */
+    RESUME_INTERRUPTS();
+}

+#else
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
@@ -499,6 +636,7 @@
     RESUME_INTERRUPTS();
 }

+#endif

 /*
  * LWLockReleaseAll - release all currently-held locks

Re: futex results with dbt-3

From

Tom Lane

Date:

19 October 2004, 22:52:31

Manfred Spraul <manfred@colorfullife.com> writes:
> Has anyone tried to replace the whole lwlock implementation with
> pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is
> implemented with futexes, i.e. we would get a fast lock handling without
> os specific hacks.

"At least for Linux" does not strike me as equivalent to "without
OS-specific hacks".

The bigger problem here is that the SMP locking bottlenecks we are
currently seeing are *hardware* issues (AFAICT anyway).  The only way
that futexes can offer a performance win is if they have a smarter way
of executing the basic atomic-test-and-set sequence than we do;
and if so, we could read their code and adopt that method without having
to buy into any large reorganization of our code.

            regards, tom lane

Re: futex results with dbt-3

From

Josh Berkus

Date:

19 October 2004, 22:57:59

Tom,

> The bigger problem here is that the SMP locking bottlenecks we are
> currently seeing are *hardware* issues (AFAICT anyway).  The only way
> that futexes can offer a performance win is if they have a smarter way
> of executing the basic atomic-test-and-set sequence than we do;
> and if so, we could read their code and adopt that method without having
> to buy into any large reorganization of our code.

Well, initial results from Gavin/Neil's patch seem to indicate that, while
futexes do not cure the CSStorm bug, they do lessen its effects in terms of
real performance loss.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: futex results with dbt-3

From

Tom Lane

Date:

19 October 2004, 23:40:10

Josh Berkus <josh@agliodbs.com> writes:
>> The bigger problem here is that the SMP locking bottlenecks we are
>> currently seeing are *hardware* issues (AFAICT anyway).

> Well, initial results from Gavin/Neil's patch seem to indicate that, while
> futexes do not cure the CSStorm bug, they do lessen its effects in terms of
> real performance loss.

It would be reasonable to expect that futexes would have a somewhat more
efficient code path in the case where you have to block (mainly because
SysV semaphores have such a heavyweight API, much more complex than we
really need).  However, the code path that is killing us is the one
where you *don't* actually need to block.  If we had a proper fix for
the problem then the context swap storm itself would go away, and
whatever advantage you might be able to measure now for futexes likewise
would go away.

In other words, I'm not real excited about a wholesale replacement of
code in order to speed up a code path that I don't want to be taking
in the first place; especially not if that replacement puts a fence
between me and working on the code path that I do care about.

            regards, tom lane

Re: futex results with dbt-3

From

Mark Wong

Date:

20 October 2004, 18:10:24

On Sun, Oct 17, 2004 at 09:39:33AM +0200, Manfred Spraul wrote:
> Neil wrote:
>
> >. In any case, the "futex patch"
> >uses the Linux 2.6 futex API to implement PostgreSQL spinlocks.
> >
> Has anyone tried to replace the whole lwlock implementation with
> pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is
> implemented with futexes, i.e. we would get a fast lock handling without
> os specific hacks. Perhaps other os contain user space pthread locks, too.
> Attached is an old patch. I tested it on an uniprocessor system a year
> ago and it didn't provide much difference, but perhaps the scalability
> is better. You'll have to add -lpthread to the library list for linking.

I've heard that simply linking to the pthreads libraries, regardless of
whether you're using them or not creates a significant overhead.  Has
anyone tried it for kicks?

Mark

Re: futex results with dbt-3

From

Tom Lane

Date:

20 October 2004, 18:15:58

Manfred Spraul <manfred@colorfullife.com> writes:
> Tom Lane wrote:
>> The bigger problem here is that the SMP locking bottlenecks we are
>> currently seeing are *hardware* issues (AFAICT anyway).  The only way
>> that futexes can offer a performance win is if they have a smarter way
>> of executing the basic atomic-test-and-set sequence than we do;
>>
> lwlocks operations are not a basic atomic-test-and-set sequence. They
> are spinlock, several nonatomic operations, spin_unlock.

Right, and it is the spinlock that is the problem.  See discussions a
few months back: at least on Intel SMP machines, most of the problem
seems to have to do with trading the spinlock's cache line back and
forth between CPUs.  It's difficult to see how a futex is going to avoid
that.

            regards, tom lane

Re: futex results with dbt-3

From

Mark Wong

Date:

20 October 2004, 23:06:13

On Wed, Oct 20, 2004 at 07:39:13PM +0200, Manfred Spraul wrote:
>
> But: According to the descriptions the problem is a context switch
> storm. I don't see that cache line bouncing can cause a context switch
> storm. What causes the context switch storm? If it's the pg_usleep in
> s_lock, then my patch should help a lot: with pthread_rwlock locks, this
> line doesn't exist anymore.
>

I gave Manfred's patch a try on my 4-way Xeon system with Tom's test_script.sql
files.  I ran 4 processes of test_script.sql against 8.0beta3 (without any
patches) and from my observations with top, the cpu utilization between
processors was pretty erratic.  They'd jump anywhere from 30% - 70%.

With the futex patches that Neil and Gavin have been working on, I'd see
the processors evenly utilized at about 50% each.

With just Manfred's patch I think there might be a problem somewhere with
the patch, or something else, as only one processor is doing anything at a
time and 100% utilized.

Here are some other details, per Manfred's request:

Linux 2.6.8.1 (on a gentoo distro)
gcc 3.3.4
glibc 2.3.3.20040420

Mark

Re: futex results with dbt-3

From

Dave Cramer

Date:

21 October 2004, 01:27:37

Forgive my naivete, but do futex's implement some priority algorithm for
which process gets control. One of the problems as I understand it is
that linux does (did ) not implement a priority algorithm, so it is
possible for the context which just gave up control to be the next
context woken up, which of course is a complete waste of time.

--dc--

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>The bigger problem here is that the SMP locking bottlenecks we are
>>>currently seeing are *hardware* issues (AFAICT anyway).  The only way
>>>that futexes can offer a performance win is if they have a smarter way
>>>of executing the basic atomic-test-and-set sequence than we do;
>>>
>>>
>>>
>>lwlocks operations are not a basic atomic-test-and-set sequence. They
>>are spinlock, several nonatomic operations, spin_unlock.
>>
>>
>
>Right, and it is the spinlock that is the problem.  See discussions a
>few months back: at least on Intel SMP machines, most of the problem
>seems to have to do with trading the spinlock's cache line back and
>forth between CPUs.  It's difficult to see how a futex is going to avoid
>that.
>
>            regards, tom lane
>
>---------------------------(end of broadcast)---------------------------
>TIP 3: if posting/reading through Usenet, please send an appropriate
>      subscribe-nomail command to majordomo@postgresql.org so that your
>      message can get through to the mailing list cleanly
>
>
>
>

--
Dave Cramer
www.postgresintl.com
519 939 0336
ICQ#14675561

Re: futex results with dbt-3

From

Mark Wong

Date:

21 October 2004, 14:55:20

On Thu, Oct 21, 2004 at 07:45:53AM +0200, Manfred Spraul wrote:
> Mark Wong wrote:
>
> >Here are some other details, per Manfred's request:
> >
> >Linux 2.6.8.1 (on a gentoo distro)
> >
> >
> How complicated are Tom's test scripts? His immediate reply was that I
> should retest with Fedora, to rule out any gentoo bugs.
>
> I have a dual-cpu system with RH FC, I could use it for testing.
>

Pretty, simple.  One to load the database, and 1 to query it.  I'll
attach them.

Mark

Attachment

Re: futex results with dbt-3

From

Gaetano Mendola

Date:

23 October 2004, 12:20:41

Josh Berkus wrote:
 > Tom,
 >
 >
 >>The bigger problem here is that the SMP locking bottlenecks we are
 >>currently seeing are *hardware* issues (AFAICT anyway).  The only way
 >>that futexes can offer a performance win is if they have a smarter way
 >>of executing the basic atomic-test-and-set sequence than we do;
 >>and if so, we could read their code and adopt that method without having
 >>to buy into any large reorganization of our code.
 >
 >
 > Well, initial results from Gavin/Neil's patch seem to indicate that, while
 > futexes do not cure the CSStorm bug, they do lessen its effects in terms of
 > real performance loss.

I proposed weeks ago to see how the CSStorm is affected by stick each backend
in one processor ( where the process was born ) using the cpu-affinity capability
( kernel 2.6 ), is this proposal completely out of mind ?


Regards
Gaetano Mendola

Re: futex results with dbt-3

From

Tom Lane

Date:

23 October 2004, 19:20:47

Gaetano Mendola <mendola@bigfoot.com> writes:
> I proposed weeks ago to see how the CSStorm is affected by stick each
> backend in one processor ( where the process was born ) using the
> cpu-affinity capability ( kernel 2.6 ), is this proposal completely
> out of mind ?

That was investigated long ago.  See for instance
http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php

            regards, tom lane

Re: futex results with dbt-3

From

Gaetano Mendola

Date:

23 October 2004, 20:20:46

Tom Lane wrote:
> Gaetano Mendola <mendola@bigfoot.com> writes:
>
>>I proposed weeks ago to see how the CSStorm is affected by stick each
>>backend in one processor ( where the process was born ) using the
>>cpu-affinity capability ( kernel 2.6 ), is this proposal completely
>>out of mind ?
>
>
> That was investigated long ago.  See for instance
> http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php
>

If I read correctly this help on the CSStorm, I guess also that this could
also help the performances. Unfortunatelly I do not have any kernel 2.6 running
on SMP to give it a try.


Regards
Gaetano Mendola

Re: futex results with dbt-3

From

Gaetano Mendola

Date:

24 October 2004, 00:20:55

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Josh Berkus wrote:
| Gaetano,
|
|
|>I proposed weeks ago to see how the CSStorm is affected by stick each
|>backend in one processor ( where the process was born ) using the
|>cpu-affinity capability ( kernel 2.6 ), is this proposal completely out of
|>mind ?
|
|
| I don't see how that would help.   The problem is not backends switching
| processors, it's the buffermgrlock needing to be swapped between processors.

This is not clear to me. What happen if during a spinlock a backend is
moved away from one processor to another one ?


Regards
Gaetano Mendola




-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBeudN7UpzwH2SGd4RAkL9AKCUY9vsw1CPmBV1kC7BKxUtuneN2wCfXaYr
E8utuJI34MAIP8jUm6By09M=
=oRvU
-----END PGP SIGNATURE-----

Re: futex results with dbt-3

From

Manfred Spraul

Date:

25 October 2004, 16:57:46

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Has anyone tried to replace the whole lwlock implementation with
>>pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is
>>implemented with futexes, i.e. we would get a fast lock handling without
>>os specific hacks.
>>
>>
>
>"At least for Linux" does not strike me as equivalent to "without
>OS-specific hacks".
>
>
>
For me, "at least for Linux" means that I have tested the patch with
Linux. I'd expect that the patch works on most recent unices
(pthread_rwlock_t is probably mandatory for Unix98 compatibility). You
and others on this mailing list have access to other systems - my patch
should be seen as a call for testers, not as a proposal for merging. I
expect that Linux is not the only OS with fast user space semaphores,
and if an OS has such objects, then the pthread_ locking functions are
hopefully implemented by using them. IMHO it's better to support the
standard function instead of trying to use the native (and OS specific)
fast semaphore functions.

>The bigger problem here is that the SMP locking bottlenecks we are
>currently seeing are *hardware* issues (AFAICT anyway).  The only way
>that futexes can offer a performance win is if they have a smarter way
>of executing the basic atomic-test-and-set sequence than we do;
>
>
lwlocks operations are not a basic atomic-test-and-set sequence. They
are spinlock, several nonatomic operations, spin_unlock.

--
    Manfred

Re: futex results with dbt-3

From

Manfred Spraul

Date:

25 October 2004, 16:57:49

Mark Wong wrote:

>Here are some other details, per Manfred's request:
>
>Linux 2.6.8.1 (on a gentoo distro)
>
>
How complicated are Tom's test scripts? His immediate reply was that I
should retest with Fedora, to rule out any gentoo bugs.

I have a dual-cpu system with RH FC, I could use it for testing.

--
    Manfred

Re: futex results with dbt-3

From

Manfred Spraul

Date:

25 October 2004, 16:57:57

Tom Lane wrote:

>Manfred Spraul <manfred@colorfullife.com> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>The bigger problem here is that the SMP locking bottlenecks we are
>>>currently seeing are *hardware* issues (AFAICT anyway).  The only way
>>>that futexes can offer a performance win is if they have a smarter way
>>>of executing the basic atomic-test-and-set sequence than we do;
>>>
>>>
>>>
>>lwlocks operations are not a basic atomic-test-and-set sequence. They
>>are spinlock, several nonatomic operations, spin_unlock.
>>
>>
>
>Right, and it is the spinlock that is the problem.  See discussions a
>few months back: at least on Intel SMP machines, most of the problem
>seems to have to do with trading the spinlock's cache line back and
>forth between CPUs.
>
I'd disagree: cache line bouncing is one problem. If this happens then
there is only one solution: The number of changes to that cacheline must
be reduced. The tools that are used in the linux kernel are:
- hashing. An emergency approach if there is no other solution. I think
RedHat used it for the buffer cache RH AS: Instead of one buffer cache,
there were lots of smaller buffer caches with individual locks. The
cache was chosen based on the file position (probably mixed with some
pointers to avoid overloading cache 0).
- For read-heavy loads: sequence locks. A reader reads a counter value
and then accesses the data structure. At the end it checks if the
counter was modified. If it's still the same value then it can continue,
otherwise it must retry. Writers acquire a normal spinlock and then
modify the counter value. RCU is the second option, but there are
patents - please be careful before using that tool.
- complete rewrites that avoid the global lock. I think the global
buffer cache is now gone, everything is handled per-file. I think there
is a global list for buffer replacement, but the at the top of the
buffer replacement strategy is a simple clock algorithm. That means that
simple lookups/accesses just set a (local) referenced bit and don't have
to acquire a global lock. I know that this is the total opposite of ARC,
but perhaps it's the only scalable solution. ARC could be used as the
second level strategy.

But: According to the descriptions the problem is a context switch
storm. I don't see that cache line bouncing can cause a context switch
storm. What causes the context switch storm? If it's the pg_usleep in
s_lock, then my patch should help a lot: with pthread_rwlock locks, this
line doesn't exist anymore.

--
    Manfred

Re: futex results with dbt-3

From

Manfred Spraul

Date:

25 October 2004, 16:58:04

Mark Wong wrote:

>Pretty, simple.  One to load the database, and 1 to query it.  I'll
>attach them.
>
>
>
I've tested it on my dual-cpu computer:
- it works, both cpus run within the postmaster. It seems something your
gentoo setup is broken.
- the number of context switch is down slightly, but not significantly:
The glibc implementation is more or less identical to the implementation
right now in lwlock.c: a spinlock that protects a few variables that are
used to implement the actual mutex, several wait queues: one for
spinlock busy, one or two for the actual mutex code.

Around 25% of the context switches are from spinlock collisions, the
rest are from actual mutex collisions. It might be possible to get rid
of the spinlock collisions by writing a special, futex based semaphore
function that only supports exclusive access [like sem_wait/sem_post],
but I don't think that it's worth the effort: 75% of the context
switches would remain.
What's needed is a buffer manager that can do lookups without a global lock.

--
    Manfred

Re: futex results with dbt-3

From

Manfred Spraul

Date:

25 October 2004, 16:58:29

Mark Wong wrote:

>I've heard that simply linking to the pthreads libraries, regardless of
>whether you're using them or not creates a significant overhead.  Has
>anyone tried it for kicks?
>
>
>
That depends on the OS and the functions that are used. The typical
worst case is buffered IO of single characters: The single threaded
implementation is just copy and update buffer status, the multi threaded
implementation contains full locking.

For most other functions there is no difference at all.
--
    Manfred

Re: futex results with dbt-3

From

Josh Berkus

Date:

25 October 2004, 17:34:05

Manfred,

> How complicated are Tom's test scripts? His immediate reply was that I
> should retest with Fedora, to rule out any gentoo bugs.

We've done some testing on other Linux.    Linking in pthreads reduced CSes by
< 15%, which was no appreciable impact on real performance.

Gavin/Neil's full futex patch was of greater benefit; while it did not reduce
CSes very much (25%) somehow the real performance benefit was greater.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: futex results with dbt-3

From

Tom Lane

Date:

25 October 2004, 17:45:23

Manfred Spraul <manfred@colorfullife.com> writes:
> But: According to the descriptions the problem is a context switch
> storm. I don't see that cache line bouncing can cause a context switch
> storm. What causes the context switch storm?

As best I can tell, the CS storm arises because the backends get into
some sort of lockstep timing that makes it far more likely than you'd
expect for backend A to try to enter the bufmgr when backend B is already
holding the BufMgrLock.  In the profiles we were looking at back in
April, it seemed that about 10% of the time was spent inside bufmgr
(which is bad enough in itself) but the odds of LWLock collision were
much higher than 10%, leading to many context swaps.

This is not totally surprising given that they are running identical
queries and so are running through loops of the same length, but still
it seems like there must be some effect driving their timing to converge
instead of diverge away from the point of conflict.

What I think (and here is where it's a leap of logic, cause I can't
prove it) is that the excessive time spent passing the spinlock cache
line back and forth is exactly the factor causing that convergence.
Somehow, the delay caused when a processor has to wait to get the cache
line contributes to keeping the backend loops in lockstep.

It is critical to understand that the CS storm is associated with LWLock
contention not spinlock contention: what we saw was a lot of semop()s
not a lot of select()s.

> If it's the pg_usleep in s_lock, then my patch should help a lot: with
> pthread_rwlock locks, this line doesn't exist anymore.

The profiles showed that s_lock() is hardly entered at all, and the
select() delay is reached even more seldom.  So changes in that area
will make exactly zero difference.  This is the surprising and
counterintuitive thing: oprofile clearly shows that very large fractions
of the CPU time are being spent at the initial TAS instructions in
LWLockAcquire and LWLockRelease, and yet those TASes hardly ever fail,
as proven by the fact that oprofile shows s_lock() is seldom entered.
So as far as the Postgres code can tell, there isn't any contention
worth mentioning for the spinlock.  This is indeed the way it was
designed to be, but when so much time is going to the TAS instructions,
you'd think there'd be more software-visible contention for the
spinlock.

It could be that I'm all wet and there is no relationship between the
cache line thrashing and the seemingly excessive BufMgrLock contention.
They are after all occurring at two very different levels of abstraction.
But I think there is some correlation that we just don't understand yet.

            regards, tom lane

Re: futex results with dbt-3

From

Tom Lane

Date:

25 October 2004, 18:45:38

Manfred Spraul <manfred@colorfullife.com> writes:
> Tom Lane wrote:
>> It could be that I'm all wet and there is no relationship between the
>> cache line thrashing and the seemingly excessive BufMgrLock contention.
>>
> Is it important? The fix is identical in both cases: per-bucket locks
> for the hash table and a buffer aging strategy that doesn't need one
> global lock that must be acquired for every lookup.

Reducing BufMgrLock contention is a good idea, but it's not really my
idea of a fix for this issue.  In the absence of a full understanding,
we may be fixing the wrong thing.  It's worth remembering that when we
first hit this issue, I made some simple changes that approximately
halved the number of BufMgrLock acquisitions by joining ReleaseBuffer
and ReadBuffer calls into ReleaseAndReadBuffer in all the places in the
test case's loop.  This made essentially no change in the CS storm
behavior :-(.  So I do not know how much contention we have to get rid
of to get the problem to go away, or even whether this is the right path
to take.

(I am unconvinced that either of those specific suggestions is The Right
Way to break up the bufmgrlock, either, but that's a different thread.)

            regards, tom lane

Re: futex results with dbt-3

From

Manfred Spraul

Date:

31 October 2004, 05:24:29

Tom Lane wrote:

>It could be that I'm all wet and there is no relationship between the
>cache line thrashing and the seemingly excessive BufMgrLock contention.
>
>
Is it important? The fix is identical in both cases: per-bucket locks
for the hash table and a buffer aging strategy that doesn't need one
global lock that must be acquired for every lookup.

--
    Manfred