Thread: futex results with dbt-3
Hi guys, I have some DBT-3 (decision support) results using Gavin's original futex patch fix. It's on out 8-way Pentium III Xeon systems in our STP environment. Definitely see some overall throughput performance on the tests, about 15% increase, but not change with respect to the number of context switches. Perhaps it doesn't really address what's causing the incredible number of context switches, but still helps. I think I'm seeing what Gavin has, that it seems to solves some concurrency problems on x86 platforms. Here's results without futexes: http://khack.osdl.org/stp/298114/ Results with futexes: http://khack.osdl.org/stp/298115/ Mark
On Thu, 2004-10-14 at 04:57, Mark Wong wrote: > I have some DBT-3 (decision support) results using Gavin's original > futex patch fix. I sent an initial description of the futex patch to the mailing lists last week, but it never appeared (from talking to Marc I believe it exceeded the size limit on -performance). In any case, the "futex patch" uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. The hope is that using futexes will lead to better performance when there is contention for spinlocks (e.g. on a busy SMP system). The original patch was written by Stephen Hemminger at OSDL; Gavin and myself have done a bunch of additional bugfixing and optimization, as well as added IA64 support. I've attached a WIP copy of the patch to this email (it supports x86, x86-64 (untested) and IA64 -- more architectures can be added at request). I'll post a longer writeup when I submit the patch to -patches. > Definitely see some overall throughput performance on the tests, about > 15% increase, but not change with respect to the number of context > switches. I'm glad to see that there is a performance improvement; in my own testing on an 8-way P3 system provided by OSDL, I saw a similar improvement in pgbench performance (50 concurrent clients, 1000 transactions each, scale factor 75; without the patch, TPS/sec was between 180 and 185, with the patch TPS/sec was between 200 and 215). As for context switching, there was some earlier speculation that the patch might improve or even resolve the "CS storm" issue that some people have experienced with SMP Xeon P4 systems. I don't think we have enough evidence to answer this one way or the other at this point. -Neil
Attachment
Neil wrote: >. In any case, the "futex patch" >uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. > Has anyone tried to replace the whole lwlock implementation with pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is implemented with futexes, i.e. we would get a fast lock handling without os specific hacks. Perhaps other os contain user space pthread locks, too. Attached is an old patch. I tested it on an uniprocessor system a year ago and it didn't provide much difference, but perhaps the scalability is better. You'll have to add -lpthread to the library list for linking. Regarding Neil's patch: >! /* >! * XXX: is there a more efficient way to write this? Perhaps using >! * decl...? >! */ >! static __inline__ slock_t >! atomic_dec(volatile slock_t *ptr) >! { >! slock_t prev = -1; >! >! __asm__ __volatile__( >! " lock \n" >! " xadd %0,%1 \n" >! :"=q"(prev) >! :"m"(*ptr), "0"(prev) >! :"memory", "cc"); >! >! return prev; >! } > xadd is not supported by original 80386 cpus, it was added for 80486 cpus. There is no instruction in the 80386 cpu that allows to atomically decrement and retrieve the old value of an integer. The only option are atomic_dec_test_zero or atomic_dec_test_negative - that can be implemented by looking at the sign/zero flag. Depending on what you want this may be enough. Or make the futex code conditional for > 80386 cpus. -- Manfred --- p7.3.3.orig/src/backend/storage/lmgr/lwlock.c 2002-09-25 22:31:40.000000000 +0200 +++ postgresql-7.3.3/src/backend/storage/lmgr/lwlock.c 2003-09-06 14:15:01.000000000 +0200 @@ -26,6 +26,28 @@ #include "storage/proc.h" #include "storage/spin.h" +#define USE_PTHREAD_LOCKS + +#ifdef USE_PTHREAD_LOCKS + +#include <pthread.h> +#include <errno.h> +typedef pthread_rwlock_t LWLock; + +inline static void +InitLWLock(LWLock *p) +{ + pthread_rwlockattr_t rwattr; + int i; + + pthread_rwlockattr_init(&rwattr); + pthread_rwlockattr_setpshared(&rwattr, PTHREAD_PROCESS_SHARED); + i=pthread_rwlock_init(p, &rwattr); + pthread_rwlockattr_destroy(&rwattr); + if (i) + elog(FATAL, "pthread_rwlock_init failed"); +} +#else typedef struct LWLock { @@ -38,6 +60,17 @@ /* tail is undefined when head is NULL */ } LWLock; +inline static void +InitLWLock(LWLock *lock) +{ + SpinLockInit(&lock->mutex); + lock->releaseOK = true; + lock->exclusive = 0; + lock->shared = 0; + lock->head = NULL; + lock->tail = NULL; +} +#endif /* * This points to the array of LWLocks in shared memory. Backends inherit * the pointer by fork from the postmaster. LWLockIds are indexes into @@ -61,7 +94,7 @@ static LWLockId held_lwlocks[MAX_SIMUL_LWLOCKS]; -#ifdef LOCK_DEBUG +#if defined(LOCK_DEBUG) && !defined(USE_PTHREAD_LOCKS) bool Trace_lwlocks = false; inline static void @@ -153,12 +186,7 @@ */ for (id = 0, lock = LWLockArray; id < numLocks; id++, lock++) { - SpinLockInit(&lock->mutex); - lock->releaseOK = true; - lock->exclusive = 0; - lock->shared = 0; - lock->head = NULL; - lock->tail = NULL; + InitLWLock(lock); } /* @@ -185,7 +213,116 @@ return (LWLockId) (LWLockCounter[0]++); } +#ifdef USE_PTHREAD_LOCKS +/* + * LWLockAcquire - acquire a lightweight lock in the specified mode + * + * If the lock is not available, sleep until it is. + * + * Side effect: cancel/die interrupts are held off until lock release. + */ +void +LWLockAcquire(LWLockId lockid, LWLockMode mode) +{ + int i; + PRINT_LWDEBUG("LWLockAcquire", lockid, &LWLockArray[lockid]); + + /* + * We can't wait if we haven't got a PGPROC. This should only occur + * during bootstrap or shared memory initialization. Put an Assert + * here to catch unsafe coding practices. + */ + Assert(!(proc == NULL && IsUnderPostmaster)); + + /* + * Lock out cancel/die interrupts until we exit the code section + * protected by the LWLock. This ensures that interrupts will not + * interfere with manipulations of data structures in shared memory. + */ + HOLD_INTERRUPTS(); + + if (mode == LW_EXCLUSIVE) { + i = pthread_rwlock_wrlock(&LWLockArray[lockid]); + } else { + i = pthread_rwlock_rdlock(&LWLockArray[lockid]); + } + if (i) + elog(FATAL, "Unexpected error from pthread_rwlock."); + + /* Add lock to list of locks held by this backend */ + Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS); + held_lwlocks[num_held_lwlocks++] = lockid; +} + +/* + * LWLockConditionalAcquire - acquire a lightweight lock in the specified mode + * + * If the lock is not available, return FALSE with no side-effects. + * + * If successful, cancel/die interrupts are held off until lock release. + */ +bool +LWLockConditionalAcquire(LWLockId lockid, LWLockMode mode) +{ + int i; + PRINT_LWDEBUG("LWLockConditionalAcquire", lockid, &LWLockArray[lockid]); + + HOLD_INTERRUPTS(); + + if (mode == LW_EXCLUSIVE) { + i = pthread_rwlock_trywrlock(&LWLockArray[lockid]); + } else { + i = pthread_rwlock_tryrdlock(&LWLockArray[lockid]); + } + switch(i) { + case 0: + /* Add lock to list of locks held by this backend */ + Assert(num_held_lwlocks < MAX_SIMUL_LWLOCKS); + held_lwlocks[num_held_lwlocks++] = lockid; + return true; + case EBUSY: + RESUME_INTERRUPTS(); + return false; + default: + elog(FATAL, "Unexpected error from pthread_rwlock_try."); + return false; + } +} + +/* + * LWLockRelease - release a previously acquired lock + */ +void +LWLockRelease(LWLockId lockid) +{ + int i; + + /* + * Remove lock from list of locks held. Usually, but not always, it + * will be the latest-acquired lock; so search array backwards. + */ + for (i = num_held_lwlocks; --i >= 0;) + { + if (lockid == held_lwlocks[i]) + break; + } + if (i < 0) + elog(ERROR, "LWLockRelease: lock %d is not held", (int) lockid); + num_held_lwlocks--; + for (; i < num_held_lwlocks; i++) + held_lwlocks[i] = held_lwlocks[i + 1]; + + i = pthread_rwlock_unlock(&LWLockArray[lockid]); + if (i) + elog(FATAL, "Unexpected error from pthread_rwlock_unlock."); + + /* + * Now okay to allow cancel/die interrupts. + */ + RESUME_INTERRUPTS(); +} +#else /* * LWLockAcquire - acquire a lightweight lock in the specified mode * @@ -499,6 +636,7 @@ RESUME_INTERRUPTS(); } +#endif /* * LWLockReleaseAll - release all currently-held locks
Manfred Spraul <manfred@colorfullife.com> writes: > Has anyone tried to replace the whole lwlock implementation with > pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is > implemented with futexes, i.e. we would get a fast lock handling without > os specific hacks. "At least for Linux" does not strike me as equivalent to "without OS-specific hacks". The bigger problem here is that the SMP locking bottlenecks we are currently seeing are *hardware* issues (AFAICT anyway). The only way that futexes can offer a performance win is if they have a smarter way of executing the basic atomic-test-and-set sequence than we do; and if so, we could read their code and adopt that method without having to buy into any large reorganization of our code. regards, tom lane
Tom, > The bigger problem here is that the SMP locking bottlenecks we are > currently seeing are *hardware* issues (AFAICT anyway). The only way > that futexes can offer a performance win is if they have a smarter way > of executing the basic atomic-test-and-set sequence than we do; > and if so, we could read their code and adopt that method without having > to buy into any large reorganization of our code. Well, initial results from Gavin/Neil's patch seem to indicate that, while futexes do not cure the CSStorm bug, they do lessen its effects in terms of real performance loss. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: >> The bigger problem here is that the SMP locking bottlenecks we are >> currently seeing are *hardware* issues (AFAICT anyway). > Well, initial results from Gavin/Neil's patch seem to indicate that, while > futexes do not cure the CSStorm bug, they do lessen its effects in terms of > real performance loss. It would be reasonable to expect that futexes would have a somewhat more efficient code path in the case where you have to block (mainly because SysV semaphores have such a heavyweight API, much more complex than we really need). However, the code path that is killing us is the one where you *don't* actually need to block. If we had a proper fix for the problem then the context swap storm itself would go away, and whatever advantage you might be able to measure now for futexes likewise would go away. In other words, I'm not real excited about a wholesale replacement of code in order to speed up a code path that I don't want to be taking in the first place; especially not if that replacement puts a fence between me and working on the code path that I do care about. regards, tom lane
On Sun, Oct 17, 2004 at 09:39:33AM +0200, Manfred Spraul wrote: > Neil wrote: > > >. In any case, the "futex patch" > >uses the Linux 2.6 futex API to implement PostgreSQL spinlocks. > > > Has anyone tried to replace the whole lwlock implementation with > pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is > implemented with futexes, i.e. we would get a fast lock handling without > os specific hacks. Perhaps other os contain user space pthread locks, too. > Attached is an old patch. I tested it on an uniprocessor system a year > ago and it didn't provide much difference, but perhaps the scalability > is better. You'll have to add -lpthread to the library list for linking. I've heard that simply linking to the pthreads libraries, regardless of whether you're using them or not creates a significant overhead. Has anyone tried it for kicks? Mark
Manfred Spraul <manfred@colorfullife.com> writes: > Tom Lane wrote: >> The bigger problem here is that the SMP locking bottlenecks we are >> currently seeing are *hardware* issues (AFAICT anyway). The only way >> that futexes can offer a performance win is if they have a smarter way >> of executing the basic atomic-test-and-set sequence than we do; >> > lwlocks operations are not a basic atomic-test-and-set sequence. They > are spinlock, several nonatomic operations, spin_unlock. Right, and it is the spinlock that is the problem. See discussions a few months back: at least on Intel SMP machines, most of the problem seems to have to do with trading the spinlock's cache line back and forth between CPUs. It's difficult to see how a futex is going to avoid that. regards, tom lane
On Wed, Oct 20, 2004 at 07:39:13PM +0200, Manfred Spraul wrote: > > But: According to the descriptions the problem is a context switch > storm. I don't see that cache line bouncing can cause a context switch > storm. What causes the context switch storm? If it's the pg_usleep in > s_lock, then my patch should help a lot: with pthread_rwlock locks, this > line doesn't exist anymore. > I gave Manfred's patch a try on my 4-way Xeon system with Tom's test_script.sql files. I ran 4 processes of test_script.sql against 8.0beta3 (without any patches) and from my observations with top, the cpu utilization between processors was pretty erratic. They'd jump anywhere from 30% - 70%. With the futex patches that Neil and Gavin have been working on, I'd see the processors evenly utilized at about 50% each. With just Manfred's patch I think there might be a problem somewhere with the patch, or something else, as only one processor is doing anything at a time and 100% utilized. Here are some other details, per Manfred's request: Linux 2.6.8.1 (on a gentoo distro) gcc 3.3.4 glibc 2.3.3.20040420 Mark
Forgive my naivete, but do futex's implement some priority algorithm for which process gets control. One of the problems as I understand it is that linux does (did ) not implement a priority algorithm, so it is possible for the context which just gave up control to be the next context woken up, which of course is a complete waste of time. --dc-- Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>Tom Lane wrote: >> >> >>>The bigger problem here is that the SMP locking bottlenecks we are >>>currently seeing are *hardware* issues (AFAICT anyway). The only way >>>that futexes can offer a performance win is if they have a smarter way >>>of executing the basic atomic-test-and-set sequence than we do; >>> >>> >>> >>lwlocks operations are not a basic atomic-test-and-set sequence. They >>are spinlock, several nonatomic operations, spin_unlock. >> >> > >Right, and it is the spinlock that is the problem. See discussions a >few months back: at least on Intel SMP machines, most of the problem >seems to have to do with trading the spinlock's cache line back and >forth between CPUs. It's difficult to see how a futex is going to avoid >that. > > regards, tom lane > >---------------------------(end of broadcast)--------------------------- >TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > > > > -- Dave Cramer www.postgresintl.com 519 939 0336 ICQ#14675561
On Thu, Oct 21, 2004 at 07:45:53AM +0200, Manfred Spraul wrote: > Mark Wong wrote: > > >Here are some other details, per Manfred's request: > > > >Linux 2.6.8.1 (on a gentoo distro) > > > > > How complicated are Tom's test scripts? His immediate reply was that I > should retest with Fedora, to rule out any gentoo bugs. > > I have a dual-cpu system with RH FC, I could use it for testing. > Pretty, simple. One to load the database, and 1 to query it. I'll attach them. Mark
Attachment
Josh Berkus wrote: > Tom, > > >>The bigger problem here is that the SMP locking bottlenecks we are >>currently seeing are *hardware* issues (AFAICT anyway). The only way >>that futexes can offer a performance win is if they have a smarter way >>of executing the basic atomic-test-and-set sequence than we do; >>and if so, we could read their code and adopt that method without having >>to buy into any large reorganization of our code. > > > Well, initial results from Gavin/Neil's patch seem to indicate that, while > futexes do not cure the CSStorm bug, they do lessen its effects in terms of > real performance loss. I proposed weeks ago to see how the CSStorm is affected by stick each backend in one processor ( where the process was born ) using the cpu-affinity capability ( kernel 2.6 ), is this proposal completely out of mind ? Regards Gaetano Mendola
Gaetano Mendola <mendola@bigfoot.com> writes: > I proposed weeks ago to see how the CSStorm is affected by stick each > backend in one processor ( where the process was born ) using the > cpu-affinity capability ( kernel 2.6 ), is this proposal completely > out of mind ? That was investigated long ago. See for instance http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php regards, tom lane
Tom Lane wrote: > Gaetano Mendola <mendola@bigfoot.com> writes: > >>I proposed weeks ago to see how the CSStorm is affected by stick each >>backend in one processor ( where the process was born ) using the >>cpu-affinity capability ( kernel 2.6 ), is this proposal completely >>out of mind ? > > > That was investigated long ago. See for instance > http://archives.postgresql.org/pgsql-performance/2004-04/msg00313.php > If I read correctly this help on the CSStorm, I guess also that this could also help the performances. Unfortunatelly I do not have any kernel 2.6 running on SMP to give it a try. Regards Gaetano Mendola
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Josh Berkus wrote: | Gaetano, | | |>I proposed weeks ago to see how the CSStorm is affected by stick each |>backend in one processor ( where the process was born ) using the |>cpu-affinity capability ( kernel 2.6 ), is this proposal completely out of |>mind ? | | | I don't see how that would help. The problem is not backends switching | processors, it's the buffermgrlock needing to be swapped between processors. This is not clear to me. What happen if during a spinlock a backend is moved away from one processor to another one ? Regards Gaetano Mendola -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.5 (MingW32) Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org iD8DBQFBeudN7UpzwH2SGd4RAkL9AKCUY9vsw1CPmBV1kC7BKxUtuneN2wCfXaYr E8utuJI34MAIP8jUm6By09M= =oRvU -----END PGP SIGNATURE-----
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>Has anyone tried to replace the whole lwlock implementation with >>pthread_rwlock? At least for Linux with recent glibcs, pthread_rwlock is >>implemented with futexes, i.e. we would get a fast lock handling without >>os specific hacks. >> >> > >"At least for Linux" does not strike me as equivalent to "without >OS-specific hacks". > > > For me, "at least for Linux" means that I have tested the patch with Linux. I'd expect that the patch works on most recent unices (pthread_rwlock_t is probably mandatory for Unix98 compatibility). You and others on this mailing list have access to other systems - my patch should be seen as a call for testers, not as a proposal for merging. I expect that Linux is not the only OS with fast user space semaphores, and if an OS has such objects, then the pthread_ locking functions are hopefully implemented by using them. IMHO it's better to support the standard function instead of trying to use the native (and OS specific) fast semaphore functions. >The bigger problem here is that the SMP locking bottlenecks we are >currently seeing are *hardware* issues (AFAICT anyway). The only way >that futexes can offer a performance win is if they have a smarter way >of executing the basic atomic-test-and-set sequence than we do; > > lwlocks operations are not a basic atomic-test-and-set sequence. They are spinlock, several nonatomic operations, spin_unlock. -- Manfred
Mark Wong wrote: >Here are some other details, per Manfred's request: > >Linux 2.6.8.1 (on a gentoo distro) > > How complicated are Tom's test scripts? His immediate reply was that I should retest with Fedora, to rule out any gentoo bugs. I have a dual-cpu system with RH FC, I could use it for testing. -- Manfred
Tom Lane wrote: >Manfred Spraul <manfred@colorfullife.com> writes: > > >>Tom Lane wrote: >> >> >>>The bigger problem here is that the SMP locking bottlenecks we are >>>currently seeing are *hardware* issues (AFAICT anyway). The only way >>>that futexes can offer a performance win is if they have a smarter way >>>of executing the basic atomic-test-and-set sequence than we do; >>> >>> >>> >>lwlocks operations are not a basic atomic-test-and-set sequence. They >>are spinlock, several nonatomic operations, spin_unlock. >> >> > >Right, and it is the spinlock that is the problem. See discussions a >few months back: at least on Intel SMP machines, most of the problem >seems to have to do with trading the spinlock's cache line back and >forth between CPUs. > I'd disagree: cache line bouncing is one problem. If this happens then there is only one solution: The number of changes to that cacheline must be reduced. The tools that are used in the linux kernel are: - hashing. An emergency approach if there is no other solution. I think RedHat used it for the buffer cache RH AS: Instead of one buffer cache, there were lots of smaller buffer caches with individual locks. The cache was chosen based on the file position (probably mixed with some pointers to avoid overloading cache 0). - For read-heavy loads: sequence locks. A reader reads a counter value and then accesses the data structure. At the end it checks if the counter was modified. If it's still the same value then it can continue, otherwise it must retry. Writers acquire a normal spinlock and then modify the counter value. RCU is the second option, but there are patents - please be careful before using that tool. - complete rewrites that avoid the global lock. I think the global buffer cache is now gone, everything is handled per-file. I think there is a global list for buffer replacement, but the at the top of the buffer replacement strategy is a simple clock algorithm. That means that simple lookups/accesses just set a (local) referenced bit and don't have to acquire a global lock. I know that this is the total opposite of ARC, but perhaps it's the only scalable solution. ARC could be used as the second level strategy. But: According to the descriptions the problem is a context switch storm. I don't see that cache line bouncing can cause a context switch storm. What causes the context switch storm? If it's the pg_usleep in s_lock, then my patch should help a lot: with pthread_rwlock locks, this line doesn't exist anymore. -- Manfred
Mark Wong wrote: >Pretty, simple. One to load the database, and 1 to query it. I'll >attach them. > > > I've tested it on my dual-cpu computer: - it works, both cpus run within the postmaster. It seems something your gentoo setup is broken. - the number of context switch is down slightly, but not significantly: The glibc implementation is more or less identical to the implementation right now in lwlock.c: a spinlock that protects a few variables that are used to implement the actual mutex, several wait queues: one for spinlock busy, one or two for the actual mutex code. Around 25% of the context switches are from spinlock collisions, the rest are from actual mutex collisions. It might be possible to get rid of the spinlock collisions by writing a special, futex based semaphore function that only supports exclusive access [like sem_wait/sem_post], but I don't think that it's worth the effort: 75% of the context switches would remain. What's needed is a buffer manager that can do lookups without a global lock. -- Manfred
Mark Wong wrote: >I've heard that simply linking to the pthreads libraries, regardless of >whether you're using them or not creates a significant overhead. Has >anyone tried it for kicks? > > > That depends on the OS and the functions that are used. The typical worst case is buffered IO of single characters: The single threaded implementation is just copy and update buffer status, the multi threaded implementation contains full locking. For most other functions there is no difference at all. -- Manfred
Manfred, > How complicated are Tom's test scripts? His immediate reply was that I > should retest with Fedora, to rule out any gentoo bugs. We've done some testing on other Linux. Linking in pthreads reduced CSes by < 15%, which was no appreciable impact on real performance. Gavin/Neil's full futex patch was of greater benefit; while it did not reduce CSes very much (25%) somehow the real performance benefit was greater. -- Josh Berkus Aglio Database Solutions San Francisco
Manfred Spraul <manfred@colorfullife.com> writes: > But: According to the descriptions the problem is a context switch > storm. I don't see that cache line bouncing can cause a context switch > storm. What causes the context switch storm? As best I can tell, the CS storm arises because the backends get into some sort of lockstep timing that makes it far more likely than you'd expect for backend A to try to enter the bufmgr when backend B is already holding the BufMgrLock. In the profiles we were looking at back in April, it seemed that about 10% of the time was spent inside bufmgr (which is bad enough in itself) but the odds of LWLock collision were much higher than 10%, leading to many context swaps. This is not totally surprising given that they are running identical queries and so are running through loops of the same length, but still it seems like there must be some effect driving their timing to converge instead of diverge away from the point of conflict. What I think (and here is where it's a leap of logic, cause I can't prove it) is that the excessive time spent passing the spinlock cache line back and forth is exactly the factor causing that convergence. Somehow, the delay caused when a processor has to wait to get the cache line contributes to keeping the backend loops in lockstep. It is critical to understand that the CS storm is associated with LWLock contention not spinlock contention: what we saw was a lot of semop()s not a lot of select()s. > If it's the pg_usleep in s_lock, then my patch should help a lot: with > pthread_rwlock locks, this line doesn't exist anymore. The profiles showed that s_lock() is hardly entered at all, and the select() delay is reached even more seldom. So changes in that area will make exactly zero difference. This is the surprising and counterintuitive thing: oprofile clearly shows that very large fractions of the CPU time are being spent at the initial TAS instructions in LWLockAcquire and LWLockRelease, and yet those TASes hardly ever fail, as proven by the fact that oprofile shows s_lock() is seldom entered. So as far as the Postgres code can tell, there isn't any contention worth mentioning for the spinlock. This is indeed the way it was designed to be, but when so much time is going to the TAS instructions, you'd think there'd be more software-visible contention for the spinlock. It could be that I'm all wet and there is no relationship between the cache line thrashing and the seemingly excessive BufMgrLock contention. They are after all occurring at two very different levels of abstraction. But I think there is some correlation that we just don't understand yet. regards, tom lane
Manfred Spraul <manfred@colorfullife.com> writes: > Tom Lane wrote: >> It could be that I'm all wet and there is no relationship between the >> cache line thrashing and the seemingly excessive BufMgrLock contention. >> > Is it important? The fix is identical in both cases: per-bucket locks > for the hash table and a buffer aging strategy that doesn't need one > global lock that must be acquired for every lookup. Reducing BufMgrLock contention is a good idea, but it's not really my idea of a fix for this issue. In the absence of a full understanding, we may be fixing the wrong thing. It's worth remembering that when we first hit this issue, I made some simple changes that approximately halved the number of BufMgrLock acquisitions by joining ReleaseBuffer and ReadBuffer calls into ReleaseAndReadBuffer in all the places in the test case's loop. This made essentially no change in the CS storm behavior :-(. So I do not know how much contention we have to get rid of to get the problem to go away, or even whether this is the right path to take. (I am unconvinced that either of those specific suggestions is The Right Way to break up the bufmgrlock, either, but that's a different thread.) regards, tom lane
Tom Lane wrote: >It could be that I'm all wet and there is no relationship between the >cache line thrashing and the seemingly excessive BufMgrLock contention. > > Is it important? The fix is identical in both cases: per-bucket locks for the hash table and a buffer aging strategy that doesn't need one global lock that must be acquired for every lookup. -- Manfred