Thread: spinlock contention
On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun12, 2011, at 23:39 , Robert Haas wrote: >> So, the majority (60%) of the excess spinning appears to be due to >> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). > > Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. > However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, > so I guess that two adjacent LWLocks currently share one cache line. > > Currently, the ProcArrayLock has index 4 while SInvalReadLock has > index 5, so if I'm not mistaken exactly the two locks where you saw > the largest contention on are on the same cache line... > > Might make sense to try and see if these numbers change if you > either make LWLockPadded 64bytes or arrange the locks differently... I fooled around with this a while back and saw no benefit. It's possible a more careful test would turn up something, but I think the only real way forward here is going to be to eliminate some of that locking altogether. SInvalReadLock is a good example of a lock which seems barely to be necessary at all. Except on extraordinarily rare occasions, it is only ever taken in share mode. Only SICleanupQueue() needs to lock out readers, and in the (probably common) case where all backends are reasonably close to being caught up, it wouldn't even matter if the determination of minMsgNum were a little bit fuzzy. I've been straining my brain trying to figure out whether there's some way to rewrite this logic so that it can run concurrently with readers; then we could get rid of SInvalReadLock altogether. I have a feeling if I were 25% smarter I could do it, but I'm not. So my fallback position is to try to find some way to optimize the common case where there are no messages to be read. We're already allowing readers and writers to run in parallel, so it must not be important for a reader to see a message that a writer is in the process of writing, but has not yet finished writing. I believe the idea here is that sinval messages should be emitted before releasing the related locks, so if we've successfully acquired a lock on an object, then any pertinent sinval messages must already be completely written. What I'm thinking we could do is just keep a global counter indicating how many messages have been written, and each backend could keep a counter indicating the highest-numbered message it has so far read. When a message is written, the global counter is bumped. When a backend goes to AcceptInvalidationMessages(), it compares the global counter to its own counter, and if they're the same, then it doesn't bother taking SInvalReadLock and just exits quickly. There is an obvious (potential) memory-ordering problem here. If it's possible for a backend doing AcceptInvalidationMessages() to see a stale version of the counter, then it might fail to read messages from the queue that it really ought to have seen. Protecting the global counter with a spinlock defeats the purpose of doing any of this in the first place, because the whole problem is too many people fighting over the spinlock protecting SInvalReadLock. It should be sufficient, though, for the writing backend to execute a memory-barrier operation just after bumping the global counter and the reading backend to execute one just before. As it happens, LWLockRelease() is such a barrier, since it must acquire and release a spinlock, so we should be able to just bump the counter right before releasing the LWLock and call it good. On the reading side, the immediately-preceding lock acquisition seems like it ought to be sufficient - surely it can't be possible to acquire a heavyweight lock on an object without seeing memory updates that some other process made before releasing the same heavyweight lock. A possible objection here is that a 32-bit counter might wrap around, giving a false negative, and a read from a 64-bit counter might not be atomic. In practice, the chances of a problem here seem remote, but I think we can guard against it easily enough. We can put each backend's counter of messages read into shared memory, protected by a per-backend lock (probably the same LWLock the fast relation lock patch already introduces). When a backend compares its counter to the global counter, it does so while holding this lock. When SICleanupQueue() resets a backend, it grabs the lock and sets that backend's counter to a special value (like 0) that we guarantee will never match the global counter (which we can cause to wrap from 2^32-1 to 1). So then AcceptInvalidationMessages() - in the common case where there are no invalidation messages to process - will only need the per-backend LWLock rather than fighting over SInvalLock. If anyone has read this far, you have my undying gratitude.... especially if you respond with comments. ProcArrayLock looks like a tougher nut to crack - there's simply no way, with the system we have right now, that you can take a snapshot without locking the list of running processes. I'm not sure what to do about that, but we're probably going to have to come up with something, because it seems clear that once we eliminate the lock manager LWLock contention, this is a major bottleneck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 23.06.2011 18:42, Robert Haas wrote: > ProcArrayLock looks like a tougher nut to crack - there's simply no > way, with the system we have right now, that you can take a snapshot > without locking the list of running processes. I'm not sure what to > do about that, but we're probably going to have to come up with > something, because it seems clear that once we eliminate the lock > manager LWLock contention, this is a major bottleneck. ProcArrayLock is currently held for a relatively long period of time when a snapshot is taken, especially if there's a lot of backends. There are three operations to the procarray: 1. Advertising a new xid belonging to my backend. 2. Ending a transaction. 3. Getting a snapshot. Advertising a new xid is currently done without a lock, assuming that setting an xid in shared memory is atomic. To end a transaction, you acquire ProcArrayLock in exclusive mode. To get a snapshot, you acquire ProcArrayLock in shared mode, and scan the whole procarray. I wonder if it would be a better tradeoff to keep a "materialized" snapshot in shared memory that's updated every time a new xid is assigned or a transaction ends. Getting a snapshot would become much cheaper, as you could just memcpy the ready-built snapshot from shared memory into local memory. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Jun23, 2011, at 17:42 , Robert Haas wrote: > On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>> So, the majority (60%) of the excess spinning appears to be due to >>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >> >> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >> so I guess that two adjacent LWLocks currently share one cache line. >> >> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >> index 5, so if I'm not mistaken exactly the two locks where you saw >> the largest contention on are on the same cache line... >> >> Might make sense to try and see if these numbers change if you >> either make LWLockPadded 64bytes or arrange the locks differently... > > I fooled around with this a while back and saw no benefit. It's > possible a more careful test would turn up something, but I think the > only real way forward here is going to be to eliminate some of that > locking altogether. It seems hard to believe that there isn't some effect of two locks sharing a cache line. There are architectures (even some of the Intel architectures, I believe) where cache lines are 32 bit, though - so measuring this certainly isn't easy. Also, I'm absolute no expert for this, so it might very well be that' I'm missing something fundamental. > SInvalReadLock is a good example of a lock which seems barely to be > necessary at all. Except on extraordinarily rare occasions, it is > only ever taken in share mode. This is the case we should optimize for, I think. Currently, there's contention even between two SHARE lockers of a LWLock because our LWLock implementation uses a spinlock underneath. I've googled around a bit, and found this: http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git It's an userspace rwlock implementation based on atomic instructions and futexes (for blocking) written by Linus Torwalds as a response to the following thread http://lwn.net/Articles/284741/ It seems to require only a single atomic increment instruction to acquire a share lock in the uncontested case, and a single compare- and-exchange instruction to acquire an exlcusive lock (again in the uncontested case). The code doesn't seem to have a license attached, and seems to have been written over a weekend and never touched since, so we might not want (or be able to) to use this directly. But it'd still be interesting to see if replacing our LWLocks with this implementation affects throughput. > Only SICleanupQueue() needs to lock > out readers, and in the (probably common) case where all backends are > reasonably close to being caught up, it wouldn't even matter if the > determination of minMsgNum were a little bit fuzzy. I've been > straining my brain trying to figure out whether there's some way to > rewrite this logic so that it can run concurrently with readers; then > we could get rid of SInvalReadLock altogether. I have a feeling if I > were 25% smarter I could do it, but I'm not. This sounds like something which could be done with RCU (Read-Copy-Update) in a lockless way. The idea is to copy the data structure (or parts) of it before updating it, and to "commit" the change afterwards by adjusting a single pointer to point to the new structure (kinda like we do atomic commits by an atomic update of one bit in the clog). The hard part is usually to figure when it's OK to clean up or reuse the old version of the data structure - for that, you need to know that all previous readers have completed. Here's a rough description of how that could work for the invalidation queue. We'd have two queue structures in shared memory, each containing a version number. We'd also have a "current" pointer pointing to the most recent one. Each reader would publish the version number of the queue he's about to read (the "current" one) in shared memory (and issue a write barrier). SICleanupQueue() would check whether the non-current queue structure is unused by comparing its version to the published versions of all backends (after issuing a read barrier, and after taking a lock that prevents concurrent SICleanupQueue runs of course). If there still are users of that version, SICleanupQueue() out-waits them. Afterwards, it simply copies the current structure over the old one, does the necessary cleanups, increments the version number to be larger than the one of the "current" structure, and *then* updates the "current" pointer. > So my fallback position is to try to find some way to optimize the > common case where there are no messages to be read. We're already > allowing readers and writers to run in parallel, so it must not be > important for a reader to see a message that a writer is in the > process of writing, but has not yet finished writing. I believe the > idea here is that sinval messages should be emitted before releasing > the related locks, so if we've successfully acquired a lock on an > object, then any pertinent sinval messages must already be completely > written. What I'm thinking we could do is just keep a global counter > indicating how many messages have been written, and each backend could > keep a counter indicating the highest-numbered message it has so far > read. When a message is written, the global counter is bumped. When > a backend goes to AcceptInvalidationMessages(), it compares the global > counter to its own counter, and if they're the same, then it doesn't > bother taking SInvalReadLock and just exits quickly. Sounds sensible. I do fear, however, that if we start to micro-optimize around every single LWLock we're in for quite a maintenance burden in the future. Even if each single modification is relatively simple, the complexity still adds ups. So I still believe that it'd be better to start with optimizing LWLocks in general, and to only touch the LWLocks which are still hot spots afterwards. > There is an obvious (potential) memory-ordering problem here. If it's > possible for a backend doing AcceptInvalidationMessages() to see a > stale version of the counter, then it might fail to read messages from > the queue that it really ought to have seen. Protecting the global > counter with a spinlock defeats the purpose of doing any of this in > the first place, because the whole problem is too many people fighting > over the spinlock protecting SInvalReadLock. It should be sufficient, > though, for the writing backend to execute a memory-barrier operation > just after bumping the global counter and the reading backend to > execute one just before. As it happens, LWLockRelease() is such a > barrier, since it must acquire and release a spinlock, so we should be > able to just bump the counter right before releasing the LWLock and > call it good. On the reading side, the immediately-preceding lock > acquisition seems like it ought to be sufficient - surely it can't be > possible to acquire a heavyweight lock on an object without seeing > memory updates that some other process made before releasing the same > heavyweight lock. If we start doing lockless algorithms, I really think we should add explicit barrier primitives. We could start out with something simple such as #define MemoryBarrierWrite \ do { slock_t l; S_UNLOCK(l); } while (0) #define MemoryBarrierRead \ do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } #define MemoryBarrierReadWrite \ do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } But yeah, it seems that in the case of the sinval queue, the surrounding LWLocks actually provide the necessary barriers. > A possible objection here is that a 32-bit counter might wrap around, > giving a false negative, and a read from a 64-bit counter might not be > atomic. In practice, the chances of a problem here seem remote, but I > think we can guard against it easily enough. We can put each > backend's counter of messages read into shared memory, protected by a > per-backend lock (probably the same LWLock the fast relation lock > patch already introduces). When a backend compares its counter to the > global counter, it does so while holding this lock. When > SICleanupQueue() resets a backend, it grabs the lock and sets that > backend's counter to a special value (like 0) that we guarantee will > never match the global counter (which we can cause to wrap from 2^32-1 > to 1). So then AcceptInvalidationMessages() - in the common case > where there are no invalidation messages to process - will only need > the per-backend LWLock rather than fighting over SInvalLock. OTOH, don't all architectures that are interesting targets of performance tuning provide atomic access to 64-bit values? Even i386 has an 8-byte compare-and-exchange instruction I think... best regards, Florian Pflug
On Thu, Jun 23, 2011 at 12:19 PM, Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> wrote: > On 23.06.2011 18:42, Robert Haas wrote: >> ProcArrayLock looks like a tougher nut to crack - there's simply no >> way, with the system we have right now, that you can take a snapshot >> without locking the list of running processes. I'm not sure what to >> do about that, but we're probably going to have to come up with >> something, because it seems clear that once we eliminate the lock >> manager LWLock contention, this is a major bottleneck. > > ProcArrayLock is currently held for a relatively long period of time when a > snapshot is taken, especially if there's a lot of backends. There are three > operations to the procarray: > > 1. Advertising a new xid belonging to my backend. > 2. Ending a transaction. > 3. Getting a snapshot. > > Advertising a new xid is currently done without a lock, assuming that > setting an xid in shared memory is atomic. The issue isn't just whether it's atomic, but whether some other backend scanning the ProcArray just afterwards might fail to see the update due to memory-ordering effects. But I think even if they do, there's no real harm done. Since we're holding XidGenLock, we know that the XID we are advertising is higher than any XID previously advertised, and in particular it must be higher than latestCompletedXid, so GetSnapshotdata() will ignore it anyway. I am rather doubtful that this code is strictly safe: myproc->subxids.xids[nxids] = xid; myproc->subxids.nxids = nxids+ 1; In practice, the window for a race condition is minimal, because (a) in many cases, nxids will be on the same cache line as the xid being set; (b) even if it's not, we almost immediately release XidGenLock, which acts as a memory barrier; (c) on many common architectures, such as x86, stores are guaranteed to become visible in execution order; (d) if, on some architecture, you manged to see the stores out of order and thus loaded a garbage XID from the end of the array, it might happen to be zero (which would be ignored, as a non-normal transaction ID) or the transaction might happen never to examine a tuple with that XID anyway. > To end a transaction, you acquire > ProcArrayLock in exclusive mode. To get a snapshot, you acquire > ProcArrayLock in shared mode, and scan the whole procarray. > > I wonder if it would be a better tradeoff to keep a "materialized" snapshot > in shared memory that's updated every time a new xid is assigned or a > transaction ends. Getting a snapshot would become much cheaper, as you could > just memcpy the ready-built snapshot from shared memory into local memory. At least in the case of the pgbench -S workload, I doubt that would help at all. The problem isn't that the LWLocks are being held for too long -- they are all being held in shared mode and don't conflict anyway. The problem is that everyone has to acquire and release the spinlock in order to acquire the LWLock, and again to release it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jun 23, 2011 at 1:34 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun23, 2011, at 17:42 , Robert Haas wrote: >> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote: >>> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>>> So, the majority (60%) of the excess spinning appears to be due to >>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >>> >>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >>> so I guess that two adjacent LWLocks currently share one cache line. >>> >>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >>> index 5, so if I'm not mistaken exactly the two locks where you saw >>> the largest contention on are on the same cache line... >>> >>> Might make sense to try and see if these numbers change if you >>> either make LWLockPadded 64bytes or arrange the locks differently... >> >> I fooled around with this a while back and saw no benefit. It's >> possible a more careful test would turn up something, but I think the >> only real way forward here is going to be to eliminate some of that >> locking altogether. > > It seems hard to believe that there isn't some effect of two locks > sharing a cache line. There are architectures (even some of the > Intel architectures, I believe) where cache lines are 32 bit, though - > so measuring this certainly isn't easy. Also, I'm absolute no > expert for this, so it might very well be that' I'm missing something > fundamental. > >> SInvalReadLock is a good example of a lock which seems barely to be >> necessary at all. Except on extraordinarily rare occasions, it is >> only ever taken in share mode. > > This is the case we should optimize for, I think. Currently, there's > contention even between two SHARE lockers of a LWLock because our > LWLock implementation uses a spinlock underneath. I've googled around > a bit, and found this: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git > > It's an userspace rwlock implementation based on atomic instructions > and futexes (for blocking) written by Linus Torwalds as a response > to the following thread > > http://lwn.net/Articles/284741/ > > It seems to require only a single atomic increment instruction to > acquire a share lock in the uncontested case, and a single compare- > and-exchange instruction to acquire an exlcusive lock (again in > the uncontested case). > > The code doesn't seem to have a license attached, and seems to have > been written over a weekend and never touched since, so we might > not want (or be able to) to use this directly. But it'd still be > interesting to see if replacing our LWLocks with this implementation > affects throughput. > >> Only SICleanupQueue() needs to lock >> out readers, and in the (probably common) case where all backends are >> reasonably close to being caught up, it wouldn't even matter if the >> determination of minMsgNum were a little bit fuzzy. I've been >> straining my brain trying to figure out whether there's some way to >> rewrite this logic so that it can run concurrently with readers; then >> we could get rid of SInvalReadLock altogether. I have a feeling if I >> were 25% smarter I could do it, but I'm not. > > This sounds like something which could be done with RCU (Read-Copy-Update) > in a lockless way. The idea is to copy the data structure (or parts) > of it before updating it, and to "commit" the change afterwards by > adjusting a single pointer to point to the new structure (kinda like > we do atomic commits by an atomic update of one bit in the clog). > > The hard part is usually to figure when it's OK to clean up or > reuse the old version of the data structure - for that, you need to > know that all previous readers have completed. > > Here's a rough description of how that could work for the invalidation > queue. We'd have two queue structures in shared memory, each containing > a version number. We'd also have a "current" pointer pointing to the > most recent one. Each reader would publish the version number of the > queue he's about to read (the "current" one) in shared memory > (and issue a write barrier). SICleanupQueue() would check whether the > non-current queue structure is unused by comparing its version to the > published versions of all backends (after issuing a read barrier, and > after taking a lock that prevents concurrent SICleanupQueue runs of > course). If there still are users of that version, SICleanupQueue() > out-waits them. Afterwards, it simply copies the current structure > over the old one, does the necessary cleanups, increments the version > number to be larger than the one of the "current" structure, > and *then* updates the "current" pointer. > >> So my fallback position is to try to find some way to optimize the >> common case where there are no messages to be read. We're already >> allowing readers and writers to run in parallel, so it must not be >> important for a reader to see a message that a writer is in the >> process of writing, but has not yet finished writing. I believe the >> idea here is that sinval messages should be emitted before releasing >> the related locks, so if we've successfully acquired a lock on an >> object, then any pertinent sinval messages must already be completely >> written. What I'm thinking we could do is just keep a global counter >> indicating how many messages have been written, and each backend could >> keep a counter indicating the highest-numbered message it has so far >> read. When a message is written, the global counter is bumped. When >> a backend goes to AcceptInvalidationMessages(), it compares the global >> counter to its own counter, and if they're the same, then it doesn't >> bother taking SInvalReadLock and just exits quickly. > > Sounds sensible. I do fear, however, that if we start to micro-optimize > around every single LWLock we're in for quite a maintenance burden in > the future. Even if each single modification is relatively simple, the > complexity still adds ups. So I still believe that it'd be better to > start with optimizing LWLocks in general, and to only touch the LWLocks > which are still hot spots afterwards. > >> There is an obvious (potential) memory-ordering problem here. If it's >> possible for a backend doing AcceptInvalidationMessages() to see a >> stale version of the counter, then it might fail to read messages from >> the queue that it really ought to have seen. Protecting the global >> counter with a spinlock defeats the purpose of doing any of this in >> the first place, because the whole problem is too many people fighting >> over the spinlock protecting SInvalReadLock. It should be sufficient, >> though, for the writing backend to execute a memory-barrier operation >> just after bumping the global counter and the reading backend to >> execute one just before. As it happens, LWLockRelease() is such a >> barrier, since it must acquire and release a spinlock, so we should be >> able to just bump the counter right before releasing the LWLock and >> call it good. On the reading side, the immediately-preceding lock >> acquisition seems like it ought to be sufficient - surely it can't be >> possible to acquire a heavyweight lock on an object without seeing >> memory updates that some other process made before releasing the same >> heavyweight lock. > > If we start doing lockless algorithms, I really think we should add > explicit barrier primitives. We could start out with something > simple such as > > #define MemoryBarrierWrite \ > do { slock_t l; S_UNLOCK(l); } while (0) > #define MemoryBarrierRead \ > do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } > #define MemoryBarrierReadWrite \ > do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } > > But yeah, it seems that in the case of the sinval queue, the surrounding > LWLocks actually provide the necessary barriers. > >> A possible objection here is that a 32-bit counter might wrap around, >> giving a false negative, and a read from a 64-bit counter might not be >> atomic. In practice, the chances of a problem here seem remote, but I >> think we can guard against it easily enough. We can put each >> backend's counter of messages read into shared memory, protected by a >> per-backend lock (probably the same LWLock the fast relation lock >> patch already introduces). When a backend compares its counter to the >> global counter, it does so while holding this lock. When >> SICleanupQueue() resets a backend, it grabs the lock and sets that >> backend's counter to a special value (like 0) that we guarantee will >> never match the global counter (which we can cause to wrap from 2^32-1 >> to 1). So then AcceptInvalidationMessages() - in the common case >> where there are no invalidation messages to process - will only need >> the per-backend LWLock rather than fighting over SInvalLock. > > OTOH, don't all architectures that are interesting targets of > performance tuning provide atomic access to 64-bit values? Even i386 > has an 8-byte compare-and-exchange instruction I think... yup, 'lock cmpxchg8b'. see: http://www.niallryan.com/node/137 for a way to do it via fpu without a loop. the 'lock' is a bus lock. merlin
On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote: > It seems hard to believe that there isn't some effect of two locks > sharing a cache line. There are architectures (even some of the > Intel architectures, I believe) where cache lines are 32 bit, though - > so measuring this certainly isn't easy. Also, I'm absolute no > expert for this, so it might very well be that' I'm missing something > fundamental. Well, I'm sure there is some effect, but my experiments seem to indicate that it's not a very important one. Again, please feel free to provide contrary evidence. I think the basic issue is that - in the best possible case - padding the LWLocks so that you don't have two locks sharing a cache line can reduce contention on the busier lock by at most 2x. (The less busy lock may get a larger reduction but that may not help you much.) If you what you really need is for the contention to decrease by 1000x, you're just not really moving the needle. That's why the basic fast-relation-lock patch helps so much: it replaces a system where every lock request results in contention with a system where NONE of them do. >> SInvalReadLock is a good example of a lock which seems barely to e >> necessary at all. Except on extraordinarily rare occasions, it is >> only ever taken in share mode. > > This is the case we should optimize for, I think. Currently, there's > contention even between two SHARE lockers of a LWLock because our > LWLock implementation uses a spinlock underneath. I've googled around > a bit, and found this: > > http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git > > It's an userspace rwlock implementation based on atomic instructions > and futexes (for blocking) written by Linus Torwalds as a response > to the following thread > > http://lwn.net/Articles/284741/ > > It seems to require only a single atomic increment instruction to > acquire a share lock in the uncontested case, and a single compare- > and-exchange instruction to acquire an exlcusive lock (again in > the uncontested case). > > The code doesn't seem to have a license attached, and seems to have > been written over a weekend and never touched since, so we might > not want (or be able to) to use this directly. But it'd still be > interesting to see if replacing our LWLocks with this implementation > affects throughput. I tried rewriting the LWLocks using CAS. It actually seems to make things slightly worse on the tests I've done so far, perhaps because I didn't make it respect spins_per_delay. Perhaps fetch-and-add would be better, but I'm not holding my breath. Everything I'm seeing so far leads me to the belief that we need to get rid of the contention altogether, not just contend more quickly. >> Only SICleanupQueue() needs to lock >> out readers, and in the (probably common) case where all backends are >> reasonably close to being caught up, it wouldn't even matter if the >> determination of minMsgNum were a little bit fuzzy. I've been >> straining my brain trying to figure out whether there's some way to >> rewrite this logic so that it can run concurrently with readers; then >> we could get rid of SInvalReadLock altogether. I have a feeling if I >> were 25% smarter I could do it, but I'm not. > > This sounds like something which could be done with RCU (Read-Copy-Update) > in a lockless way. The idea is to copy the data structure (or parts) > of it before updating it, and to "commit" the change afterwards by > adjusting a single pointer to point to the new structure (kinda like > we do atomic commits by an atomic update of one bit in the clog). > > The hard part is usually to figure when it's OK to clean up or > reuse the old version of the data structure - for that, you need to > know that all previous readers have completed. > > Here's a rough description of how that could work for the invalidation > queue. We'd have two queue structures in shared memory, each containing > a version number. We'd also have a "current" pointer pointing to the > most recent one. Each reader would publish the version number of the > queue he's about to read (the "current" one) in shared memory > (and issue a write barrier). SICleanupQueue() would check whether the > non-current queue structure is unused by comparing its version to the > published versions of all backends (after issuing a read barrier, and > after taking a lock that prevents concurrent SICleanupQueue runs of > course). If there still are users of that version, SICleanupQueue() > out-waits them. Afterwards, it simply copies the current structure > over the old one, does the necessary cleanups, increments the version > number to be larger than the one of the "current" structure, > and *then* updates the "current" pointer. Interesting idea. >> So my fallback position is to try to find some way to optimize the >> common case where there are no messages to be read. We're already >> allowing readers and writers to run in parallel, so it must not be >> important for a reader to see a message that a writer is in the >> process of writing, but has not yet finished writing. I believe the >> idea here is that sinval messages should be emitted before releasing >> the related locks, so if we've successfully acquired a lock on an >> object, then any pertinent sinval messages must already be completely >> written. What I'm thinking we could do is just keep a global counter >> indicating how many messages have been written, and each backend could >> keep a counter indicating the highest-numbered message it has so far >> read. When a message is written, the global counter is bumped. When >> a backend goes to AcceptInvalidationMessages(), it compares the global >> counter to its own counter, and if they're the same, then it doesn't >> bother taking SInvalReadLock and just exits quickly. > > Sounds sensible. I do fear, however, that if we start to micro-optimize > around every single LWLock we're in for quite a maintenance burden in > the future. Even if each single modification is relatively simple, the > complexity still adds ups. So I still believe that it'd be better to > start with optimizing LWLocks in general, and to only touch the LWLocks > which are still hot spots afterwards. I was hoping to do it that way, too, but I have so far been able to demonstrate a performance benefit from that approach. >> There is an obvious (potential) memory-ordering problem here. If it's >> possible for a backend doing AcceptInvalidationMessages() to see a >> stale version of the counter, then it might fail to read messages from >> the queue that it really ought to have seen. Protecting the global >> counter with a spinlock defeats the purpose of doing any of this in >> the first place, because the whole problem is too many people fighting >> over the spinlock protecting SInvalReadLock. It should be sufficient, >> though, for the writing backend to execute a memory-barrier operation >> just after bumping the global counter and the reading backend to >> execute one just before. As it happens, LWLockRelease() is such a >> barrier, since it must acquire and release a spinlock, so we should be >> able to just bump the counter right before releasing the LWLock and >> call it good. On the reading side, the immediately-preceding lock >> acquisition seems like it ought to be sufficient - surely it can't be >> possible to acquire a heavyweight lock on an object without seeing >> memory updates that some other process made before releasing the same >> heavyweight lock. > > If we start doing lockless algorithms, I really think we should add > explicit barrier primitives. We could start out with something > simple such as > > #define MemoryBarrierWrite \ > do { slock_t l; S_UNLOCK(l); } while (0) > #define MemoryBarrierRead \ > do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } > #define MemoryBarrierReadWrite \ > do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } > > But yeah, it seems that in the case of the sinval queue, the surrounding > LWLocks actually provide the necessary barriers. Yes, a set of general memory barrier primitives would be handy. Unfortunately, it would probably be quite a bit of work to develop this and verify that it works properly on all supported platforms. >> A possible objection here is that a 32-bit counter might wrap around, >> giving a false negative, and a read from a 64-bit counter might not be >> atomic. In practice, the chances of a problem here seem remote, but I >> think we can guard against it easily enough. We can put each >> backend's counter of messages read into shared memory, protected by a >> per-backend lock (probably the same LWLock the fast relation lock >> patch already introduces). When a backend compares its counter to the >> global counter, it does so while holding this lock. When >> SICleanupQueue() resets a backend, it grabs the lock and sets that >> backend's counter to a special value (like 0) that we guarantee will >> never match the global counter (which we can cause to wrap from 2^32-1 >> to 1). So then AcceptInvalidationMessages() - in the common case >> where there are no invalidation messages to process - will only need >> the per-backend LWLock rather than fighting over SInvalLock. > > OTOH, don't all architectures that are interesting targets of > performance tuning provide atomic access to 64-bit values? Even i386 > has an 8-byte compare-and-exchange instruction I think... Maybe so... but if I can get by without needing that, so much the better. I don't really want to have to have a separate code path for architectures that don't support that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun23, 2011, at 22:15 , Robert Haas wrote: > On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp@phlo.org> wrote: >> It seems hard to believe that there isn't some effect of two locks >> sharing a cache line. There are architectures (even some of the >> Intel architectures, I believe) where cache lines are 32 bit, though - >> so measuring this certainly isn't easy. Also, I'm absolute no >> expert for this, so it might very well be that' I'm missing something >> fundamental. > > Well, I'm sure there is some effect, but my experiments seem to > indicate that it's not a very important one. Again, please feel free > to provide contrary evidence. I think the basic issue is that - in > the best possible case - padding the LWLocks so that you don't have > two locks sharing a cache line can reduce contention on the busier > lock by at most 2x. (The less busy lock may get a larger reduction > but that may not help you much.) If you what you really need is for > the contention to decrease by 1000x, you're just not really moving the > needle. Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate the most heavily contested LWLocks for the others might still be worthwhile. > That's why the basic fast-relation-lock patch helps so much: > it replaces a system where every lock request results in contention > with a system where NONE of them do. > > I tried rewriting the LWLocks using CAS. It actually seems to make > things slightly worse on the tests I've done so far, perhaps because I > didn't make it respect spins_per_delay. Perhaps fetch-and-add would > be better, but I'm not holding my breath. Everything I'm seeing so > far leads me to the belief that we need to get rid of the contention > altogether, not just contend more quickly. Is there a patch available? How did you do the slow path (i.e. the case where there's contention and you need to block)? It seems to me that without some kernel support like futexes it's impossible to do better than LWLocks already do, because any simpler scheme like while (atomic_inc_post(lock) > 0) { atomic_dec(lock); block(lock); } for the shared-locker case suffers from a race condition (the lock might be released before you actually block()). >>> There is an obvious (potential) memory-ordering problem here. If it's >>> possible for a backend doing AcceptInvalidationMessages() to see a >>> stale version of the counter, then it might fail to read messages from >>> the queue that it really ought to have seen. Protecting the global >>> counter with a spinlock defeats the purpose of doing any of this in >>> the first place, because the whole problem is too many people fighting >>> over the spinlock protecting SInvalReadLock. It should be sufficient, >>> though, for the writing backend to execute a memory-barrier operation >>> just after bumping the global counter and the reading backend to >>> execute one just before. As it happens, LWLockRelease() is such a >>> barrier, since it must acquire and release a spinlock, so we should be >>> able to just bump the counter right before releasing the LWLock and >>> call it good. On the reading side, the immediately-preceding lock >>> acquisition seems like it ought to be sufficient - surely it can't be >>> possible to acquire a heavyweight lock on an object without seeing >>> memory updates that some other process made before releasing the same >>> heavyweight lock. >> >> If we start doing lockless algorithms, I really think we should add >> explicit barrier primitives. We could start out with something >> simple such as >> >> #define MemoryBarrierWrite \ >> do { slock_t l; S_UNLOCK(l); } while (0) >> #define MemoryBarrierRead \ >> do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); } >> #define MemoryBarrierReadWrite \ >> do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); } >> >> But yeah, it seems that in the case of the sinval queue, the surrounding >> LWLocks actually provide the necessary barriers. > > Yes, a set of general memory barrier primitives would be handy. > Unfortunately, it would probably be quite a bit of work to develop > this and verify that it works properly on all supported platforms. The idea would be to start out with something trivial like the above. Maybe with an #if for compilers which have something like GCC's __sync_synchronize(). We could then gradually add implementations for specific architectures, hopefully done by people who actually own the hardware and can test. best regards, Florian Pflug
On Thu, Jun 23, 2011 at 5:35 PM, Florian Pflug <fgp@phlo.org> wrote: >> Well, I'm sure there is some effect, but my experiments seem to >> indicate that it's not a very important one. Again, please feel free >> to provide contrary evidence. I think the basic issue is that - in >> the best possible case - padding the LWLocks so that you don't have >> two locks sharing a cache line can reduce contention on the busier >> lock by at most 2x. (The less busy lock may get a larger reduction >> but that may not help you much.) If you what you really need is for >> the contention to decrease by 1000x, you're just not really moving the >> needle. > > Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate > the most heavily contested LWLocks for the others might still be > worthwhile. Hey, if we can show that it works, sign me up. >> That's why the basic fast-relation-lock patch helps so much: >> it replaces a system where every lock request results in contention >> with a system where NONE of them do. >> >> I tried rewriting the LWLocks using CAS. It actually seems to make >> things slightly worse on the tests I've done so far, perhaps because I >> didn't make it respect spins_per_delay. Perhaps fetch-and-add would >> be better, but I'm not holding my breath. Everything I'm seeing so >> far leads me to the belief that we need to get rid of the contention >> altogether, not just contend more quickly. > > Is there a patch available? How did you do the slow path (i.e. the > case where there's contention and you need to block)? It seems to > me that without some kernel support like futexes it's impossible > to do better than LWLocks already do, because any simpler scheme > like > while (atomic_inc_post(lock) > 0) { > atomic_dec(lock); > block(lock); > } > for the shared-locker case suffers from a race condition (the lock > might be released before you actually block()). Attached... > The idea would be to start out with something trivial like the above. > Maybe with an #if for compilers which have something like GCC's > __sync_synchronize(). We could then gradually add implementations > for specific architectures, hopefully done by people who actually > own the hardware and can test. Yes. But if we go that route, then we have to also support a code path for architectures for which we don't have that support. That's going to be more work, so I don't want to do it until we have a case where there is a good, clear benefit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: > ProcArrayLock looks like a tougher nut to crack - there's simply no > way, with the system we have right now, that you can take a snapshot > without locking the list of running processes. I'm not sure what to > do about that, but we're probably going to have to come up with > something, because it seems clear that once we eliminate the lock > manager LWLock contention, this is a major bottleneck. Well as Tom observed earlier the kernel of a snapshot is actually a LSN. A snapshot contains a set of xids which all committed before some LSN and none which committed after it. So if we had a record of what log sequence number the commit record for any given transaction is we could build the snapshot at our leisure without any exclusive lock. In fact we could even build it lazily as a kind of cache only when we actually are interested in a given xid. -- greg
On Sat, Jun 25, 2011 at 8:26 PM, Greg Stark <stark@mit.edu> wrote: > On Thu, Jun 23, 2011 at 4:42 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> ProcArrayLock looks like a tougher nut to crack - there's simply no >> way, with the system we have right now, that you can take a snapshot >> without locking the list of running processes. I'm not sure what to >> do about that, but we're probably going to have to come up with >> something, because it seems clear that once we eliminate the lock >> manager LWLock contention, this is a major bottleneck. > > Well as Tom observed earlier the kernel of a snapshot is actually a > LSN. A snapshot contains a set of xids which all committed before some > LSN and none which committed after it. > > So if we had a record of what log sequence number the commit record > for any given transaction is we could build the snapshot at our > leisure without any exclusive lock. In fact we could even build it > lazily as a kind of cache only when we actually are interested in a > given xid. Yeah, I've been thinking about that. I think what we might do is set up a new SLRU that works like CLOG, but each entry is an LSN rather than just two bits. When a transaction commits, we save the commit LSN under the entry for that XID. We truncate away SLRU pages that contain no still-running XIDs. When we need to check whether an XID is visible to our snapshot, we just look up the commit LSN and compare it with our snapshot LSN. If it's before and non-zero, we can see it.If it's after or all-zeroes, we can't. But I'm not sure how much this would really help. It might (subject to working out the details) make the actual process of taking a snapshot faster. But it's not clear that taking snapshots more quickly will actually help anything, because the problem is not the amount of time spending taking the snapshot. The problem is rather that doing so requires acquiring and releasing an LWLock, and each of those operations requires taking and releasing a spinlock. And it is the spinlock contention that is killing us. That makes me think we need a way to take a snapshot without taking a spinlock. Or if we must take spinlocks, we at least have to avoid every backend that needs a snapshot lusting after the *same* spinlock. What I've been thinking about this weekend is whether it might be possible to create a sort of lock-free queue infrastructure. When a backend starts up, it would add an entry to the queue saying "I'm running". When it commits, it would add an entry to the queue saying "I'm committed". All entries would be added at the *end* of the queue, so a backend scanning the queue to build up a snapshot wouldn't ever be able to see commits out of order. We would need some memory barrier operations on weak-memory-ordering machines to make sure that the queue writes became visible before the end-of-queue pointer bump. The trick is figuring out how to clean up the queue. Since "commit" entries exist only to guard against "running" entries earlier in the queue, the start-of-queue pointer can be advanced whenever it points to a "commit" entry. Also, if it points to a "running" entry for which there is a later "commit" entry, then the start-of-queue pointer can be advanced over that as well. However, just because we can advance the point at which backends start reading doesn't mean that we can actually recycle space, because while we know that new scans needn't worry about those entries, we *don't* know that there isn't already a scan in flight that still needs them. Furthermore, if a transaction runs for a long time, we can never advance the start-of-queue pointer past the "running" entry for its XID, which is obviously problematic since the queue would get very long. To work around that problem, I think we could use Florian's idea upthread of an RCU system. We keep two copies of the queue around, an A copy and a B copy. When the currently active copy fills up, we rewrite it into the other queue, omitting all "committed" entries and any "running" entries that have matching "committed" entries, and then tell everyone to start looking at that copy instead. We would need some kind of gymnastics to make sure that we don't flip from the A copy to the B copy and back to the A copy while some laggardly backend is still hoping to scan the old A copy. A simple algorithm (there's probably a smarter one) would be to have each backend take a spinlock while it's scanning either copy, and to have the backend that is doing the rewrite take and release all of those spinlocks one at a time before beginning the rewrite, thus guaranteeing that any scans still in progress when the rewrite is requested have completed before it's actually performed. Any new scans started in the meanwhile will certainly be looking at the current copy rather than the old copy we're about to overwrite. We would still need a lock around the operation of adding new items to the queue; if two backends try to do that at the same time, chaos will ensue. But if that lock gets busy, it would be possible to have backends use some sort of system of elimination buffers to combine their requests. If ten backends each want to advertise a new XID, it will be far faster to have one backend acquire the lock, write in all ten entries, and wake everyone up than to have each backend insert its entry and wake up the next one. So what we might do is doing a condition-acquire on the lock to insert data into the buffer and, if we can't get it immediately, go find another backend with the same problem and elect a leader to do all the insertions. If we're the leader, sleep on the queue-insertion lock. If not, sleep until the leader wakes us up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun23, 2011, at 23:40 , Robert Haas wrote: >>> I tried rewriting the LWLocks using CAS. It actually seems to make >>> things slightly worse on the tests I've done so far, perhaps because I >>> didn't make it respect spins_per_delay. Perhaps fetch-and-add would >>> be better, but I'm not holding my breath. Everything I'm seeing so >>> far leads me to the belief that we need to get rid of the contention >>> altogether, not just contend more quickly. I got curious and put a tool together to benchmark this. The code can be found at https://github.com/fgp/lockbench. The tool starts N workers who each acquire a lock in SHARE mode, increment a per-worker counter and release the lock again, rinse, repeat. That is done for T seconds, after which it reports the sum of the individual counters, the elapsed (wall) time and the sums of the user and system times consumed by the workers. The lock implementations currently supported are atomicinc: RW-Lock using atomic increment/decrement instructions (locked xaddl) to acquire and releasethe lock in SHARE mode in the non-contested case. The contested case is unimplemented. cmpxchng: Like atomicinc, but uses "locked cmpxchng" loop instead of "locked xaddl". spin: RW-Lock built around a simple cmpxchng-based spinlock (i.e., to lock/unlock spinlock is taken,shared-lockers count is incremented/ decremented, spinlock is released). Similar to pg_lwlock,but without the sleep() in the contested path of the spinlock. The contested case of the RW-Lockis again unimplemented. none: No locking. pg_lwlock: Postgres LWLock implementation. The contested case doesn't work because the workers currentlydon't have a valid MyProc. pw_lwlock_cas: Like pg_lwlock but with your CAS patch applied. On an 4-core Intel Xeon system (Intel Xeon X3430, 2.4 GHz, 8MB cache) I get the following results: | 1 worker | 2 workers | 3 workers | 4 workers | |wallsec usersec|wallsec usersec|wallsecusersec|wallsec usersec| | / / | / / | / / | / / | |cycles cycles|cycles cycles|cycles cycles|cycles cycles| -------------------------------------------------------------------------- none |2.5e-09 2.5e-09|1.3e-09 2.5e-09|8.5e-10 2.5e-09|6.8e-10 2.5e-09| atomicinc|1.8e-08 1.8e-08|2.8e-08 5.6e-08|2.7e-08 8.1e-08|2.9e-08 1.1e-07| cmpxchng |3.0e-08 3.0e-08|7.1e-08 1.4e-07|6.9e-08 2.0e-07|7.2e-08 2.9e-07| spin |5.0e-08 5.0e-08|3.0e-07 5.9e-07|3.8e-07 1.1e-06|4.0e-07 1.5e-06| pg_lwlock|2.8e-08 2.8e-08|2.9e-08 3.0e-08|3.0e-08 3.2e-08|2.9e-08 3.1e-08| pg_lwlock|2.6e-08 2.6e-08|6.2e-08 1.2e-07|7.7e-08 2.3e-07|7.8e-08 3.1e-07| _cas| | | | | -------------------------------------------------------------------------- These results allow the following conclusions to be drawn First, our current pg_lwlock is quite good at preserving resources, i.e. at not spinning excessively - at least for up to four workers. It's the only lock implementation that consistently uses about 30ns of processor time for one acquire/release cycle. Only atomicinc with one worker (i.e. no cachline fighting) manages to beat that. It does, however, effectively serializate execution with this workload - the consumed user time is approximately equal to the elapsed wall clock time, even in the case of 4 workers, meaning that most of the time 3 of those 4 workers are sleeping. Secondly, pg_lwlock_cas and cmpxchng perform simiarly, which comes at no surprise, since use effectively the same algorithm. One could expect spin and pg_lwlock to also perform similarly, but these two don't. The reason is probably that spin uses a rather brain-dead spinning method, while pg_lwlock uses "rep; nop". Now to atomicinc, which is (the fast-path of) the most optimized RW-lock possible, at least on x86 and x86_64. One interesting result here is that wall time seconds / cycle are suspiciously for atomicinc and pg_lwlock. This proves, I think, that the limiting factor in both of these tests is the speed at which cache line invalidation can occur. It doesn't seem to matter whether the other cores are idle (as in the pg_lwlock case), or trying to obtain ownership of the cache line themselves (as in the atomicinc case). Finally, the no-locking test (none) shows that, even though the counter is written to shared memory after every increment, memory bandwith isn't an issue here, since performance scaled nearly linearly with the number of workers here. Here's the result of another run, this time with the per-worker counter being increment 100 in each acquire/release cycle. ------------------- 100 Increments per Cycle ----------------------------- | 1 worker | 2 workers | 3 workers | 4 workers | |wallsec usersec|wallsec usersec|wallsec usersec|wallsec usersec| | / / | / / | / / | / / | |cycles cycles|cycles cycles|cycles cycles|cycles cycles| -------------------------------------------------------------------------- none |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|6.5e-08 2.6e-07| atomicinc|2.5e-07 2.5e-07|1.3e-07 2.6e-07|8.6e-08 2.5e-07|6.5e-08 2.6e-07| cmpxchng |2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.6e-08 2.6e-07|7.8e-08 3.1e-07| spin |3.0e-07 3.0e-07|1.5e-07 3.0e-07|1.1e-07 3.2e-07|3.4e-07 1.4e-06| pg_lwlock|2.5e-07 2.5e-07|1.3e-07 2.5e-07|8.4e-08 2.5e-07|1.2e-07 4.8e-07| pg_lwlock|2.6e-07 2.6e-07|1.6e-07 3.1e-07|1.1e-07 3.2e-07|8.7e-08 3.4e-07| _cas| | | | | -------------------------------------------------------------------------- This makes the overhead of the locking disappear in the noise in the single-worker case. With 4 workers, however, differences materialize. The no-locking case still shows approximately linear speedup though, which I think again rules out memory bandwith effects. Now, atomicinc is the only implementation which doesn't perform significantly worse than the no-lock case. The closest competitors are the CAS-based implementation cmpxchng and pg_lwlock_cas, with 20% and 33% slowdown respectively (of both wall seconds and user seconds per cycle). The current LWLock implementation falls behind now, taking about twice as many seconds per cycle compared to atomicinc. In other words, with this workload an atomicinc-based RW-lock, taken in SHARE mode should add virtually no overhead, while every other implementation adds at least 20%. It'd be interesting to get benchmark results also on machines with more than four cores, but I don't have access to any such machine ATM. So if someone could run that on a 8 or 16-core beast, that'd be great. The code compiles on Mac OS X and Ubuntu 10.04. It's fairly portable, to make the postgres parts compile on other platforms, it might be necessary to adjust pg_stub/pg_config_manual.h. best regards, Florian Pflug
On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote: > [ testing of various spinlock implementations ] I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core loaner from Nate Boley: 100 counter increments per cycle worker 1 2 4 8 16 32 time wall user wall user wall user wall user wall user wall user none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07 atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06 cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05 spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04 pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04 pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06 I wrote a little script to show to reorganize this data in a possibly-easier-to-understand format - ordering each column from lowest to highest, and showing each algorithm as a multiple of the cheapest value for that column: wall-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5) user-1: none(1.0),atomicinc(1.3),cmpxchng(1.3),pg_lwlock_cas(1.3),pg_lwlock(1.4),spin(1.5) wall-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),spin(1.9),pg_lwlock_cas(1.9),cmpxchng(2.3) user-2: none(1.0),atomicinc(1.7),pg_lwlock(1.8),pg_lwlock_cas(1.9),spin(1.9),cmpxchng(2.3) wall-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.7),pg_lwlock(1.9),spin(2.0),cmpxchng(4.0) user-4: none(1.0),atomicinc(1.7),pg_lwlock_cas(1.8),pg_lwlock(1.9),spin(2.0),cmpxchng(4.1) wall-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(6.9),pg_lwlock(9.3),spin(28.6) user-8: none(1.0),atomicinc(3.3),pg_lwlock_cas(3.3),cmpxchng(7.0),pg_lwlock(9.4),spin(28.5) wall-16: none(1.0),atomicinc(7.1),pg_lwlock_cas(9.0),cmpxchng(20.0),pg_lwlock(76.2),spin(181.0) user-16: none(1.0),atomicinc(7.0),pg_lwlock_cas(9.1),cmpxchng(20.0),pg_lwlock(75.8),spin(184.8) wall-32: none(1.0),atomicinc(13.6),pg_lwlock_cas(21.8),cmpxchng(40.9),pg_lwlock(581.8),spin(1272.7) user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7) Here's the result of the same calculation applied to your second set of data: wall-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2) user-1: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),pg_lwlock_cas(1.0),spin(1.2) wall-2: none(1.0),atomicinc(1.0),cmpxchng(1.0),pg_lwlock(1.0),spin(1.2),pg_lwlock_cas(1.2) user-2: none(1.0),cmpxchng(1.0),pg_lwlock(1.0),atomicinc(1.0),spin(1.2),pg_lwlock_cas(1.2) wall-3: none(1.0),pg_lwlock(1.0),atomicinc(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3) user-3: none(1.0),atomicinc(1.0),pg_lwlock(1.0),cmpxchng(1.0),spin(1.3),pg_lwlock_cas(1.3) wall-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.2) user-4: none(1.0),atomicinc(1.0),cmpxchng(1.2),pg_lwlock_cas(1.3),pg_lwlock(1.8),spin(5.4) There seems to be something a bit funky in your 3-core data, but overall I read this data to indicate that 4 cores aren't really enough to see a severe problem with spinlock contention. I'm not even sure that you can see this problem on a real workload on a 16-core system. But as you add cores the problem gets rapidly worse, because the more cores you have, the more likely it is that someone else is already holding the spinlock (or has just very recently held the spinlock) that you want. And, of course, the problem multiplies itself, because once your lock acquistions start taking longer, they become a larger percentage of your execution time, which makes it proportionately more likely that you'll find someone already there when you go to grab the lock. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7) I may not be following all this correctly, but doesn't this suggest a huge potential upside for the cas based patch you posted upthread when combined with your earlier patches that were bogging down on spinlock contentionl? merlin
On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7) > > I may not be following all this correctly, but doesn't this suggest a > huge potential upside for the cas based patch you posted upthread when > combined with your earlier patches that were bogging down on spinlock > contentionl? Well, you'd think so, but in fact that patch makes it slower. Don't ask me why, 'cuz I dunno. :-( -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun28, 2011, at 23:48 , Robert Haas wrote: > On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >> On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7) >> >> I may not be following all this correctly, but doesn't this suggest a >> huge potential upside for the cas based patch you posted upthread when >> combined with your earlier patches that were bogging down on spinlock >> contentionl? > > Well, you'd think so, but in fact that patch makes it slower. Don't > ask me why, 'cuz I dunno. :-( Huh? Where do you see your CAS patch being slower than unpatched LWLocks in these results? Or are you referring to pgbench runs you made, not to these artificial benchmarks? best regards, Florian Pflug
On Tue, Jun 28, 2011 at 5:55 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jun28, 2011, at 23:48 , Robert Haas wrote: >> On Tue, Jun 28, 2011 at 5:33 PM, Merlin Moncure <mmoncure@gmail.com> wrote: >>> On Tue, Jun 28, 2011 at 3:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>>> user-32: none(1.0),atomicinc(14.4),pg_lwlock_cas(22.1),cmpxchng(41.2),pg_lwlock(588.2),spin(1264.7) >>> >>> I may not be following all this correctly, but doesn't this suggest a >>> huge potential upside for the cas based patch you posted upthread when >>> combined with your earlier patches that were bogging down on spinlock >>> contentionl? >> >> Well, you'd think so, but in fact that patch makes it slower. Don't >> ask me why, 'cuz I dunno. :-( > > Huh? Where do you see your CAS patch being slower than unpatched LWLocks > in these results? Or are you referring to pgbench runs you made, not > to these artificial benchmarks? pgbench -S -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jun28, 2011, at 22:18 , Robert Haas wrote: > On Tue, Jun 28, 2011 at 2:33 PM, Florian Pflug <fgp@phlo.org> wrote: >> [ testing of various spinlock implementations ] > > I set T=30 and N="1 2 4 8 16 32" and tried this out on a 32-core > loaner from Nate Boley: Cool, thanks! > 100 counter increments per cycle > worker 1 2 4 8 16 32 > time wall user wall user wall user wall user wall user wall user > none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07 > atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06 > cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05 > spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04 > pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04 > pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06 Here's the same table, formatted with spaces. worker 1 2 4 8 16 32 time wall user wall user wall user wall user wall user wall user none 2.8e-07 2.8e-07 1.5e-07 3.0e-07 8.0e-08 3.2e-07 4.2e-08 3.3e-07 2.1e-08 3.3e-07 1.1e-08 3.4e-07 atomicinc 3.6e-07 3.6e-07 2.6e-07 5.1e-07 1.4e-07 5.5e-07 1.4e-07 1.1e-06 1.5e-07 2.3e-06 1.5e-07 4.9e-06 cmpxchng 3.6e-07 3.6e-07 3.4e-07 6.9e-07 3.2e-07 1.3e-06 2.9e-07 2.3e-06 4.2e-07 6.6e-06 4.5e-07 1.4e-05 spin 4.1e-07 4.1e-07 2.8e-07 5.7e-07 1.6e-07 6.3e-07 1.2e-06 9.4e-06 3.8e-06 6.1e-05 1.4e-05 4.3e-04 pg_lwlock 3.8e-07 3.8e-07 2.7e-07 5.3e-07 1.5e-07 6.2e-07 3.9e-07 3.1e-06 1.6e-06 2.5e-05 6.4e-06 2.0e-04 pg_lwlock_cas 3.7e-07 3.7e-07 2.8e-07 5.6e-07 1.4e-07 5.8e-07 1.4e-07 1.1e-06 1.9e-07 3.0e-06 2.4e-07 7.5e-06 And here's the throughput table calculated from your results, i.e. the wall time per cycle relative to the wall time per cycle for 1 worker. workers 2 4 8 16 32 none 1.9 3.5 6.7 13 26 atomicinc 1.4 2.6 2.6 2.4 2.4 cmpxchng 1.1 1.1 1.2 0.9 0.8 spin 1.5 2.6 0.3 0.1 0.0 pg_lwlock 1.4 2.5 1.0 0.2 0.1 pg_lwlock_cas 1.3 2.6 2.6 1.9 1.5 Hm, so in the best case we get 2.6x the throughput of a single core, and that only for 4 and 8 workers (1.4e-7 second / cycle vs 3.6e-7). In that case, there also seems to be little difference between pg_lwlock{_cas} and atomicinc. atomicinc again manages to at least sustain that throughput when the worker count is increased, while for for the others the throughput actually *decreases*. What totally puzzles me is that your results don't show any trace of a decreased system load for the pg_lwlock implementation, which I'd have expected due to the sleep() in the contested path. Here are the user vs. wall time ratios - I'd have expected to see value significantly below the number of workers for pg_lwlock none 1.0 2.0 4.0 7.9 16 31 atomicinc 1.0 2.0 3.9 7.9 15 33 cmpxchng 1.0 2.0 4.1 7.9 16 31 spin 1.0 2.0 3.9 7.8 16 31 pg_lwlock 1.0 2.0 4.1 7.9 16 31 pg_lwlock_cas 1.0 2.0 4.1 7.9 16 31 > I wrote a little script to show to reorganize this data in a > possibly-easier-to-understand format - ordering each column from > lowest to highest, and showing each algorithm as a multiple of the > cheapest value for that column: If you're OK with that, I'd like to add that to the lockbench repo. > There seems to be something a bit funky in your 3-core data, but > overall I read this data to indicate that 4 cores aren't really enough > to see a severe problem with spinlock contention. Hm, it starts to show if you lower the counter increment per cycle (the D constant in run.sh). But yeah, it's never as bad as the 32-core results above.. best regards, Florian Pflug
On Tue, Jun 28, 2011 at 8:11 PM, Florian Pflug <fgp@phlo.org> wrote: >> I wrote a little script to show to reorganize this data in a >> possibly-easier-to-understand format - ordering each column from >> lowest to highest, and showing each algorithm as a multiple of the >> cheapest value for that column: > > If you're OK with that, I'd like to add that to the lockbench > repo. It's a piece of crap, but you're welcome to try to fix it up. See attached. With respect to the apparent lack of impact of the sleep loop, I can only speculate that this test case doesn't really mimic real-world conditions inside the backend very well. There is a lot more going on there - multiple locks, lots of memory access, etc. But I don't really understand it either. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote: >> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>> So, the majority (60%) of the excess spinning appears to be due to >>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >> >> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >> so I guess that two adjacent LWLocks currently share one cache line. >> >> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >> index 5, so if I'm not mistaken exactly the two locks where you saw >> the largest contention on are on the same cache line... >> >> Might make sense to try and see if these numbers change if you >> either make LWLockPadded 64bytes or arrange the locks differently... > > I fooled around with this a while back and saw no benefit. It's > possible a more careful test would turn up something, but I think the > only real way forward here is going to be to eliminate some of that > locking altogether. I did some benchmarking, on the 32-core system from Nate Boley, with pgbench -n -S -c 80 -j 80. With master+fastlock+lazyvxid, I can hit 180-200k TPS in the 32-client range. But at 80 clients throughput starts to fall off quite a bit, dropping down to about 80k TPS. So then, just for giggles, I inserted a "return;" statement at the top of AcceptInvalidationMessages(), turning it into a noop. Performance at 80 clients shot up to 210k TPS. I also tried inserting an acquire-and-release cycle on MyProc->backendLock, which was just as good. That seems like a pretty encouraging result, since there appear to be several ways of reimplementing SIGetDataEntries() that would work with a per-backend lock rather than a global one. I did some other experiments, too. In the hopes of finding a general way to reduce spinlock contention, I implemented a set of "elimination buffers", where backends that can't get a spinlock go and try to combine their requests and then send a designated representative to acquire the spinlock and acquire shared locks simultaneously for all group members. It took me a while to debug the code, and I still can't get it to do anything other than reduce performance, which may mean that I haven't found all the bugs yet, but I'm not feeling encouraged. Some poking around suggests that the problem isn't that spinlocks are routinely contended - it seems that we nearly always get the spinlock right off the bat. I'm wondering if the problem may be not so much that we have continuous spinlock contention, but rather than every once in a while a process gets time-sliced out while it holds a spinlock. If it's an important spinlock (like the one protecting SInvalReadLock), the system will quickly evolve into a state where every single processor is doing nothing but trying to get that spinlock. Even after the hapless lock-holder gets to run again and lets go of the lock, you have a whole pile of other backends who are sitting there firing of lock xchgb in a tight loop, and they can only get it one at a time, so you have ferocious cache line contention until the backlog clears. Then things are OK again for a bit until the same thing happens to some other backend. This is just a theory, I might be totally wrong... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jul7, 2011, at 03:35 , Robert Haas wrote: > On Thu, Jun 23, 2011 at 11:42 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp@phlo.org> wrote: >>> On Jun12, 2011, at 23:39 , Robert Haas wrote: >>>> So, the majority (60%) of the excess spinning appears to be due to >>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%). >>> >>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32. >>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK, >>> so I guess that two adjacent LWLocks currently share one cache line. >>> >>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has >>> index 5, so if I'm not mistaken exactly the two locks where you saw >>> the largest contention on are on the same cache line... >>> >>> Might make sense to try and see if these numbers change if you >>> either make LWLockPadded 64bytes or arrange the locks differently... >> >> I fooled around with this a while back and saw no benefit. It's >> possible a more careful test would turn up something, but I think the >> only real way forward here is going to be to eliminate some of that >> locking altogether. > > I also tried inserting an > acquire-and-release cycle on MyProc->backendLock, which was just as > good. That seems like a pretty encouraging result, since there appear > to be several ways of reimplementing SIGetDataEntries() that would > work with a per-backend lock rather than a global one. I did some research and found "TLRW: Return of the Read-Write Lock" by D. Dice and N. Shavit. PDF can be found here http://transact09.cs.washington.edu/19_paper.pdf They describe a read-write lock called "bytelock" with set of "slots" for shared lockers, where each shared locker only needs to increment his slot, and thus doesn't interfere with other concurrent shared locking attempts. The price is that, after incrementing its slot, a shared locker must issue a fencing instruction and then verify that no writer has taken the lock in the mean while. In their application, they distinguish between "slotted" threads - those who own a designated slot, and "unslotted" thread - those who operation on the normal shared counter. I implemented that idea, but with a few modifications. First, I don't distinguish between "slotted" and "unslotted" threads, but allow multiple backends to share a slot. This means my implementation cannot use a simple unlocked increment to update the slot, but instead uses "locked xadd". I also moved the slots out of the lock itself, and into a separate set of LWLockPart structures. Each such structure contains the counters for one slot id and multiple locks. In effect, the resulting thing is an LWLock with a partitioned shared counter. The partition one backend operates on for shared locks is determined by its backend id. I've added the implementation to the lock benchmarking tool at https://github.com/fgp/lockbench and also pushed a patched version of postgres to https://github.com/fgp/postgres/tree/lwlock_part The number of shared counter partitions is current 4, but can easily be adjusted in lwlock.h. The code uses GCC's atomic fetch and add intrinsic if available, otherwise it falls back to using a per-partition spin lock. Benchmarking results look promising so far. This is the through-put I get, in acquisition/release cycles per second, for the LWLock implementations on a 4-core machine. pg_lwlock_part,N,X is the partitioned lock with N lock partitions and using LOCKED XADD if X == yes. The numbers are normalized to the through-put in the single-worker case. ---------- Cycles / Second (Wall-Time) --------------------- 1 2 4 8 16 worker wall wall wall wall wall time 1 1.9 3.9 3.9 3.9 none 1 0.2 0.9 0.8 0.6 pg_lwlock_orig 1 0.4 0.3 0.3 0.3 pg_lwlock_cas 1 0.3 1.0 0.8 0.6 pg_lwlock_part,1,no 1 0.4 0.4 0.4 0.4 pg_lwlock_part,1,yes 1 2.0 0.4 0.4 0.3 pg_lwlock_part,2,no 1 2.0 0.7 0.8 0.7 pg_lwlock_part,2,yes 1 2.0 4.1 2.1 1.0 pg_lwlock_part,4,no 1 2.1 3.8 1.8 2.2 pg_lwlock_part,4,yes These numbers show that as long is the number of workers does not exceed the number of lock partitions, throughput increases approximately linearly. It drops off afterwards, but I that's to be expected - the test executes LQLockAcquire/Release in a tight loop, so once there's any kind of contention (even cache-line bouncing), performance drops. This is also why the original LWLock beats CAS and the partitioned lock with less than 4 partitions here - effectively, at least half of the workers are sleeping at any given time, as the following user vs. wall time numbers show. ---------- User-Time / Wall-Time ---------------------------- 1 2 4 8 16 worker 1.0 1.9 3.9 3.9 3.9 none 1.0 1.9 1.1 1.3 1.8 pg_lwlock_orig 1.0 2.0 4.0 4.0 4.0 pg_lwlock_cas 1.0 1.9 1.1 1.4 1.8 pg_lwlock_part,1,no 1.0 2.0 4.0 3.9 4.0 pg_lwlock_part,1,yes 1.0 2.0 3.6 3.8 3.6 pg_lwlock_part,2,no 1.0 2.0 3.8 3.9 3.9 pg_lwlock_part,2,yes 1.0 2.0 4.1 4.1 3.9 pg_lwlock_part,4,no 1.0 2.0 4.0 3.9 3.9 pg_lwlock_part,4,yes I lack the hardware to produce any meaningful benchmark results with pgbench for this, so again I'd be very cool if someone (Robert? ;-) who has access to that kind of hardware could give the patched version from github a spin with pgbench. The patched version is current set of 4 shared counter partitions, and should use LOCKED XADD if compiled with GCC >= 4.1 (or ICC). > I'm wondering if the problem may be > not so much that we have continuous spinlock contention, but rather > than every once in a while a process gets time-sliced out while it > holds a spinlock. If it's an important spinlock (like the one > protecting SInvalReadLock), the system will quickly evolve into a > state where every single processor is doing nothing but trying to get > that spinlock. Even after the hapless lock-holder gets to run again > and lets go of the lock, you have a whole pile of other backends who > are sitting there firing of lock xchgb in a tight loop, and they can > only get it one at a time, so you have ferocious cache line contention > until the backlog clears. Then things are OK again for a bit until > the same thing happens to some other backend. This is just a theory, > I might be totally wrong... Hm, but one would expect most of the spin lock contesters to have given up and gone to sleep by the time the interrupted lock holder resumes. If that is not what happens, then maybe we should revisit the logic in SpinLockAcquire()... best regards, Florian Pflug
On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote: > In effect, the resulting thing is an LWLock with a partitioned shared > counter. The partition one backend operates on for shared locks is > determined by its backend id. > > I've added the implementation to the lock benchmarking tool at > https://github.com/fgp/lockbench > and also pushed a patched version of postgres to > https://github.com/fgp/postgres/tree/lwlock_part > > The number of shared counter partitions is current 4, but can easily > be adjusted in lwlock.h. The code uses GCC's atomic fetch and add > intrinsic if available, otherwise it falls back to using a per-partition > spin lock. I think this is probably a good trade-off for locks that are most frequently taken in shared mode (like SInvalReadLock), but it seems like it could be a very bad trade-off for locks that are frequently taken in both shared and exclusive mode (e.g. ProcArrayLock, BufMappingLocks). I don't want to fiddle with your git repo, but if you attach a patch that applies to the master branch I'll give it a spin if I have time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jul7, 2011, at 18:09 , Robert Haas wrote: > On Thu, Jul 7, 2011 at 5:54 AM, Florian Pflug <fgp@phlo.org> wrote: >> In effect, the resulting thing is an LWLock with a partitioned shared >> counter. The partition one backend operates on for shared locks is >> determined by its backend id. >> >> I've added the implementation to the lock benchmarking tool at >> https://github.com/fgp/lockbench >> and also pushed a patched version of postgres to >> https://github.com/fgp/postgres/tree/lwlock_part >> >> The number of shared counter partitions is current 4, but can easily >> be adjusted in lwlock.h. The code uses GCC's atomic fetch and add >> intrinsic if available, otherwise it falls back to using a per-partition >> spin lock. > > I think this is probably a good trade-off for locks that are most > frequently taken in shared mode (like SInvalReadLock), but it seems > like it could be a very bad trade-off for locks that are frequently > taken in both shared and exclusive mode (e.g. ProcArrayLock, > BufMappingLocks). That's definitely a concern. One idea I had to alleviate that is to add a per-partition "used" flag to the LWLock struct, and set that to true (if not already set) before incrementing a partition's shared counter. Exclusive lockers would then only need to inspect those partitions for which the flag is set, and would clear all flags after having acquires the lock. I actually tried to do that initially, but figuring out where and how to safely clear the flag again turned out to be a bit hairy. So I decided to keep it simple first, and wait to see whether that problem manifests itself in pratice. > I don't want to fiddle with your git repo, but if you attach a patch > that applies to the master branch I'll give it a spin if I have time. Patch attached. Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition spin lock instead of "locked xadd" to increment the shared counters. best regards, Florian Pflug
Attachment
Florian Pflug <fgp@phlo.org> writes: > Patch attached. > Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition > spin lock instead of "locked xadd" to increment the shared counters. That's already sufficient reason to reject the patch. Not everyone uses gcc, let alone very recent versions of gcc. regards, tom lane
On Jul8, 2011, at 16:21 , Tom Lane wrote: > Florian Pflug <fgp@phlo.org> writes: >> Patch attached. > >> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition >> spin lock instead of "locked xadd" to increment the shared counters. > > That's already sufficient reason to reject the patch. Not everyone > uses gcc, let alone very recent versions of gcc. This is a WIP version meant for testing, not a finish patch! Spending time on making this work on every conceivable compiler before we even know whether or not the approach is worthwhile at all seems ludicrous to me. A finished version would use inline assembly to avoid the GCC version dependency, and would support as many additional compilers as there are people with access to these compilers who offer to help... But yeah, that will very probably leave some compilers unsupported (in the "fall back to spin lock per partition sense. Which, if the patch proves worthwhile at all, probably still provides a benefit over the current code). If that is reason enough to reject the patch, i.e. if the policy is "we don't want it for any if we cannot have it for all", then consider it withdrawn. best regards, Florian Pflug
On 07/08/2011 04:21 PM, Tom Lane wrote: > Florian Pflug <fgp@phlo.org> writes: >> Patch attached. > >> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition >> spin lock instead of "locked xadd" to increment the shared counters. > > That's already sufficient reason to reject the patch. Not everyone > uses gcc, let alone very recent versions of gcc. hmm - 4.1.0 was released in february 2006, which will be much older than even the 5 year support policy we have on pg when 9.2 will be released, not sure how much it will matter if we don't support as specific optimization on a gcc that old.. Stefan
On Jul8, 2011, at 22:27 , Stefan Kaltenbrunner wrote: > On 07/08/2011 04:21 PM, Tom Lane wrote: >> Florian Pflug <fgp@phlo.org> writes: >>> Patch attached. >> >>> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition >>> spin lock instead of "locked xadd" to increment the shared counters. >> >> That's already sufficient reason to reject the patch. Not everyone >> uses gcc, let alone very recent versions of gcc. > > hmm - 4.1.0 was released in february 2006, which will be much older than > even the 5 year support policy we have on pg when 9.2 will be released, > not sure how much it will matter if we don't support as specific > optimization on a gcc that old.. Still, it's not really hard to support older Versions, at least on x86 and x86-64. All it takes is some inline assembly. I just don't want to put effort into this until we know whether or not the whole approach is worthwhile or not. Should someone want to test this patch, but can't because of the GCC version restriction, please speak up. best regards, Florian Pflug
On Jul7, 2011, at 03:35 , Robert Haas wrote: > Some poking around suggests that the problem isn't that > spinlocks are routinely contended - it seems that we nearly always get > the spinlock right off the bat. I'm wondering if the problem may be > not so much that we have continuous spinlock contention, but rather > than every once in a while a process gets time-sliced out while it > holds a spinlock. If it's an important spinlock (like the one > protecting SInvalReadLock), the system will quickly evolve into a > state where every single processor is doing nothing but trying to get > that spinlock. Even after the hapless lock-holder gets to run again > and lets go of the lock, you have a whole pile of other backends who > are sitting there firing of lock xchgb in a tight loop, and they can > only get it one at a time, so you have ferocious cache line contention > until the backlog clears. Pondering this some more, I came up with another idea, pretty much orthogonal to the shared counter partitioning I posted a patch for. If indeed that problem isn't spin lock contention, but rather losing control of the CPU while holding a spin lock, then why not try to get rid of the spin lock entirely? On Linux, that's easy - this is exactly what futexes are about. But it occurred to me that kernel support isn't actually needed to do that - futexes can effectively be emulated in user-space using just a semaphore, and without doing a syscall except when there's contention. The algorithm is quite straight forward, if one assumes a lock-free implementation of a queue (More on that below) LockAcquire: (1) CAS the lock state to increment the reader count or set the exclusive bit in a loop while the locklooks free. If successful, we're done, otherwise we continue with (2) (2) Add ourself to the wait queue [ Sincewe've added ourself to the queue, we *must* now decrement the semaphore no matter what, to keep the increment/decrementcalls balanced. We're careful to maintain that invariant below. ] (3) Fence (AKA issue full memorybarrier) (4) Re-check if the lock still looks taken. If it does, we decrement the semaphore (PGSemaphoreLock),and (upon wake-up) restart at (1). Otherwise, continue with (5) (5) Remove the first waiterfrom the queue and increment her semaphore. Rinse-and-repeat until we either removed ourself from the queueor the queue is empty. (6) Decrement our semaphore. [ (6) is necessary in the general case, see the remark below(2). But we can of course detect the case were we'd increment our own semaphore in (5) only to decrement itagain in (6), and skip both operations ] LockRelease: (1) Set the lock state to 0, i.e. release the lock. (2) Fence (AKA issue full memory barrier) (3) Ifthe lock still looks free, remove the first waiter from the queue and increment her semaphore. Rinse-and-repeat while the lock looks free and the queue is non-empty. [ From a correctness POV, we only have to wake up one waiterhere, and that only if there isn't one who was been woken up but hasn't yet retried to take the lock. In reality, the decision if and how many waiter to wake up would depend on their desired lock level, and some variantof what we currently call releaseOK. ] The fencing steps (LockAcquire:3) and (LockRelease:2) guarantee that if we block in LockAcquire() a lock holder will see our queue entry and thus will wake us up eventually. Because we use a semaphore and not, say, a simple signal, we don't have to worry about the precise ordering of block and unblock operations - we just need to ensure they're balanced. Now to to enqueue() and dequeue() primitives that the algorithm above depends on. There are multiple published algorithms for lock-free queues. Some googling turned up "An optimistic approach to lock-free FIFO queues", E.L. Mozes, N. Shavit, DOI 10.1007/s00446-007-0050-0 and "CAS-Based Lock-Free Algorithm for Shared Deques", M.M. Michael The second one looks promising, since it only requires a single CAS to enqueue and dequeue entries in the common case. Thus, it should be just as efficient as our current spin-lock-based queue in the uncontended case (and much better in the contested case, of course). [ Our case is, in fact, simpler than the generic setting that these algorithms support. We only ever enqueue our *own* proc array entry (so allocation and re-use of queue entries isn't much of an issue), and always *process* the queue after enqueuing something - either directly in LockAcquire or later in LockRelease. We thus don't really need to support concurrent removal of queue entries, but might get away with simply skipping queue processing if we detect that somebody else is in the process of doing so. I think I have an idea for a simpler lock-less queue that fits our needs, but I haven't yet ironed out all the details (As it turns out, removing the very last entry from the queue is the hard case). ] Thoughts, comments and criticism are all very welcome! best regards, Florian Pflug
On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote: > The algorithm is quite straight forward, if one assumes a lock-free > implementation of a queue (More on that below) This is similar to the CAS-based LWLocks I played around with, though I didn't use a lock-free queue. I think I probablyneed to devote some time to figuring out why that didn't work out to a win... ...Robert
On Jul13, 2011, at 00:10 , Robert Haas wrote: > On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote: >> The algorithm is quite straight forward, if one assumes a lock-free >> implementation of a queue (More on that below) > > This is similar to the CAS-based LWLocks I played around with, though > I didn't use a lock-free queue. I think I probably need to devote some > time to figuring out why that didn't work out to a win... Yeah, the non-waitqueue-related parts are mostly identical. The only significant difference there is that the waiters-present bit is replaced by fence-and-recheck. What motivated me to research this was your theory that the problem is not so much spin-lock contention, but rather those unlucky processes who lose their time-slice while holding the lock. If that is true, then maybe the problem with your CAS patch is that once the LWLocks is contested heavily, the waiters-present bit will be set pretty much all the time, and the code will thus fall back to using the spin-lock. *Reading the code again* Hm, I just realized that you only clear the waiters-present bit after emptying the queue completely. With rising contention, you might reach a point where you never empty the queue completely (unless the lock is only ever acquired in share mode, that is). As it stands, the CAS patch is effectively neutralized at this point. It'll even have an adverse effect due to the higher number of atomic operations compared to the unpatched version. I wonder if clearing the waiters-present bit only upon clearing the queue completely is necessary for correctness. Wouldn't it be OK to clear the bit after waking up at least one locker? That'd still guarantee that you cannot end up with a blocked process but no lock holder, I believe. best regards Florian Pflug
On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote: > On Jul13, 2011, at 00:10 , Robert Haas wrote: >> On Jul 12, 2011, at 8:03 AM, Florian Pflug <fgp@phlo.org> wrote: >>> The algorithm is quite straight forward, if one assumes a lock-free >>> implementation of a queue (More on that below) >> >> This is similar to the CAS-based LWLocks I played around with, though >> I didn't use a lock-free queue. I think I probably need to devote some >> time to figuring out why that didn't work out to a win... > > Yeah, the non-waitqueue-related parts are mostly identical. The only > significant difference there is that the waiters-present bit is replaced > by fence-and-recheck. > > What motivated me to research this was your theory that the problem is > not so much spin-lock contention, but rather those unlucky processes who > lose their time-slice while holding the lock. > > If that is true, then maybe the problem with your CAS patch is that > once the LWLocks is contested heavily, the waiters-present bit will be > set pretty much all the time, and the code will thus fall back to > using the spin-lock. > > *Reading the code again* > > Hm, I just realized that you only clear the waiters-present bit after > emptying the queue completely. With rising contention, you might reach a > point where you never empty the queue completely (unless the lock is > only ever acquired in share mode, that is). As it stands, the CAS patch > is effectively neutralized at this point. It'll even have an adverse > effect due to the higher number of atomic operations compared to the > unpatched version. Hmm, yeah. > I wonder if clearing the waiters-present bit only upon clearing the > queue completely is necessary for correctness. Wouldn't it be OK > to clear the bit after waking up at least one locker? That'd still > guarantee that you cannot end up with a blocked process but no lock > holder, I believe. I don't think so - how does the next process that releases the lock know to release waiters? ...Robert
On Jul13, 2011, at 22:04 , Robert Haas wrote: > On Jul 12, 2011, at 8:10 PM, Florian Pflug <fgp@phlo.org> wrote: >> I wonder if clearing the waiters-present bit only upon clearing the >> queue completely is necessary for correctness. Wouldn't it be OK >> to clear the bit after waking up at least one locker? That'd still >> guarantee that you cannot end up with a blocked process but no lock >> holder, I believe. > > I don't think so - how does the next process that releases the lock > know to release waiters? It won't :-(. Not that easily, at least. The idea stemmed from that fact that my shared counter partitioning patch get away with something similar. But that only works because it always re-checks for possible interference after doing the fast path, so the idea isn't directly applicable to your patch it seems. Porting that idea to your CAS patch would probably make it nearly identical to the shared-counter partitioning patch with the number of partitions set to 1, so I see little point in that. BTW, I think I got a workable atomic queue that suits our needs sketched up - it'll post the details once I've convinced myself that it's actually correct. So should you believe that approach to be unworkable for some reason, please speak up and same me the effort. best regards, Florian Pflug
On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote: >> I don't want to fiddle with your git repo, but if you attach a patch >> that applies to the master branch I'll give it a spin if I have time. > > Patch attached. > > Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition > spin lock instead of "locked xadd" to increment the shared counters. [ Back from vacation, catching up on email. ] gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) pgbench -n -S -T 180 -c 32 -j 32 with lwlock-part patch: tps = 36974.644091 (including connections establishing) unpatched cd34647c666be867f95ef8fc0492c30356043f10: tps = 39432.064293 (including connections establishing) And with -c 8 -j 8: tps = 26946.202428 (including connections establishing) tps = 27206.507424 (including connections establishing) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Jul18, 2011, at 04:36 , Robert Haas wrote: > On Fri, Jul 8, 2011 at 6:02 AM, Florian Pflug <fgp@phlo.org> wrote: >>> I don't want to fiddle with your git repo, but if you attach a patch >>> that applies to the master branch I'll give it a spin if I have time. >> >> Patch attached. >> >> Beware that it needs at least GCC 4.1, otherwise it'll use a per-partition >> spin lock instead of "locked xadd" to increment the shared counters. > > [ Back from vacation, catching up on email. ] > > gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu5) > > pgbench -n -S -T 180 -c 32 -j 32 > > with lwlock-part patch: > tps = 36974.644091 (including connections establishing) > > unpatched cd34647c666be867f95ef8fc0492c30356043f10: > tps = 39432.064293 (including connections establishing) > > And with -c 8 -j 8: > > tps = 26946.202428 (including connections establishing) > tps = 27206.507424 (including connections establishing) :-( That's disappointing, to say the least. I also completely fail to understand what the heck is going on there. I mean, you did conclusively prove that commenting out the SInval stuff made a huge difference. There's also supposed to hardly any invalidation going on during a pgbench -S run. So, since the patch removes two of the three spin-lock acquisitions from SIGetDataEntries() (so long as there are no exclusive lockers of SInvalReadLock), there should be some effect. Or so I'd think at least... If anyone has I theory, I'd love to hear it. best regards, Florian Pflug