Thread: LWLock cache line alignment
Following some advice from Intel, http://www.intel.com/cd/ids/developer/asmo-na/eng/technologies/threading /20469.htm?page=2 I'm looking at whether the LWLock data structures may be within the same cache line. Intel uses 128 byte cache lines on its high end processors. slru.c uses BUFFERALIGN which is currently hardcoded in pg_config_manual.c to be #define ALIGNOF_BUFFER 32 which seems to be the wrong setting for the Intel CPUs, possibly others. In slru.c we have this code fragment: /* Release shared lock, grab per-buffer lock instead */ LWLockRelease(shared->ControlLock); LWLockAcquire(shared->buffer_locks[slotno],LW_EXCLUSIVE); The purpose of this is to reduce contention, by holding finer grained locks. ISTM what this does is drop one lock then take another lock by accessing an array (buffer_locks) which will be in the same cache line for all locks, then access the LWLock data structure, which again will be all within the same cache line. ISTM that we have fine grained LWLocks, but not fine grained cache lines. That means that all Clog and Subtrans locks would be effected, since we have 8 of each. For other global LWlocks, the same thing applies, so BufMgrLock and many other locks are effectively all the same from the cache's perspective. ...and BTW, what is MMCacheLock?? is that an attempt at padding already? It looks like padding out LWLock struct would ensure that each of those were in separate cache lines? Any views? Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> writes: > It looks like padding out LWLock struct would ensure that each of those > were in separate cache lines? I've looked at this before and I think it's a nonstarter; increasing the size of a spinlock to 128 bytes is just not reasonable. (Remember there are two per buffer.) Also, there's no evidence it would actually help anything, because the contention we have been able to measure is on only one particular lock (BufMgrLock) anyway. But feel free to try it to see if you can see a difference. regards, tom lane
Simon Riggs wrote: > ...and BTW, what is MMCacheLock?? is that an attempt at padding already? One would hope not, as it would be a totally braindead attempt :) It appears to have been formerly used by smgr/mm.c; when that was removed, the MMCacheLock should have been removed but was not. Barring any objections, I'll remove it from HEAD tomorrow. > It looks like padding out LWLock struct would ensure that each of those > were in separate cache lines? Sounds like it might be worth trying; since it would be trivial to implement for testing purposes, I'd be curious to see if this improves performance, and/or has any effect on the CS storm issue. -Neil
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote > "Simon Riggs" <simon@2ndquadrant.com> writes: > > It looks like padding out LWLock struct would ensure that > each of those > > were in separate cache lines? > > I've looked at this before and I think it's a nonstarter; > increasing the > size of a spinlock to 128 bytes is just not reasonable. > (Remember there > are two per buffer.) Well, the performance is unreasonably poor, so its time to do something, which might if it is unreasonable for the general case would need to be port specific. Also, there's no evidence it would actually help > anything, because the contention we have been able to measure > is on only > one particular lock (BufMgrLock) anyway. But feel free to > try it to see > if you can see a difference. Well, the Wierd Context Switching issue isn't normal contention, which I agree is stacked up against BufMgrLock. Overlapping cache lines doesn't cause measurable user space contention, just poor performance when the delay for cache-spoil, refetch is required many times more than the minimum (ideal). I'm thinking that the 128 byte cache line on Intel is sufficiently higher than the 64 byte cache line on AMD to tip us into different behaviour at runtime. Best Regards, Simon Riggs
"Simon Riggs" <simon@2ndquadrant.com> writes: >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote >> I've looked at this before and I think it's a nonstarter; >> increasing the >> size of a spinlock to 128 bytes is just not reasonable. > Well, the performance is unreasonably poor, so its time to do something, > which might if it is unreasonable for the general case would need to be > port specific. Well, it might be worth allocating a full 128 bytes just for the fixed LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks, which should be seeing far less contention than the fixed locks anyway. But first lets see some evidence that this actually helps? regards, tom lane
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote > "Simon Riggs" <simon@2ndquadrant.com> writes: > >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote > >> I've looked at this before and I think it's a nonstarter; > >> increasing the > >> size of a spinlock to 128 bytes is just not reasonable. > > > Well, the performance is unreasonably poor, so its time to > do something, > > which might if it is unreasonable for the general case > would need to be > > port specific. > > But first lets see some evidence that this actually helps? Yes, thats what we're planning. Currently just brainstorming ideas for prototyping, rather than suggesting definitive things to go into the codebase. There are some other suggestions there also coming from our Unisys friends: some additional Intel tweaks to avoid processor stalls and the like. Again... I'll show the performance figures first. > Well, it might be worth allocating a full 128 bytes just for the fixed > LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks, > which should be seeing far less contention than the fixed > locks anyway. Yes, that seems like a likely future way. Right now we're going to pad the whole structure, to save prototyping time and because we can on a test box. Working on this now. My concern about the per-buffer locks is with the CLog and Subtrans buffer pools. Because we have 8 buffers for each of those it looks like they'll all be in a single cache line. That means fine grained locking is ineffective for those caches. With 1000s of shared_buffers, there's less problem with cache line contention. Possibly Ken's suggestion of pseudo-randomising the allocation of locks in LWLockAcquire would reduce the effect on those smaller buffer pools. Best Regards, Simon Riggs
On Thu, Feb 03, 2005 at 06:26:16AM -0800, Simon Riggs wrote: > > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote > > "Simon Riggs" <simon@2ndquadrant.com> writes: > > > It looks like padding out LWLock struct would ensure that > > each of those > > > were in separate cache lines? > > > > I've looked at this before and I think it's a nonstarter; > > increasing the > > size of a spinlock to 128 bytes is just not reasonable. > > (Remember there > > are two per buffer.) > > Well, the performance is unreasonably poor, so its time to do something, > which might if it is unreasonable for the general case would need to be > port specific. > > Also, there's no evidence it would actually help > > anything, because the contention we have been able to measure > > is on only > > one particular lock (BufMgrLock) anyway. But feel free to > > try it to see > > if you can see a difference. > > Well, the Wierd Context Switching issue isn't normal contention, which I > agree is stacked up against BufMgrLock. > > Overlapping cache lines doesn't cause measurable user space contention, > just poor performance when the delay for cache-spoil, refetch is > required many times more than the minimum (ideal). > > I'm thinking that the 128 byte cache line on Intel is sufficiently > higher than the 64 byte cache line on AMD to tip us into different > behaviour at runtime. > Would it be possible to "randomize" the placement of the LWLock struct to reduce lock correlation in the buffers. i.e. Ensure that buffer locks and allocated at least 1 cacheline apart. This may allow the reduction in cache coherency and still maintain a compact data structure. Maybe a simple mod X offset? Ken