Thread: LWLock cache line alignment

LWLock cache line alignment

From
"Simon Riggs"
Date:
Following some advice from Intel,
http://www.intel.com/cd/ids/developer/asmo-na/eng/technologies/threading
/20469.htm?page=2
I'm looking at whether the LWLock data structures may be within the same
cache line.

Intel uses 128 byte cache lines on its high end processors.

slru.c uses BUFFERALIGN which is currently hardcoded in
pg_config_manual.c to be
#define ALIGNOF_BUFFER    32
which seems to be the wrong setting for the Intel CPUs, possibly others.

In slru.c we have this code fragment:
/* Release shared lock, grab per-buffer lock instead */    LWLockRelease(shared->ControlLock);
LWLockAcquire(shared->buffer_locks[slotno],LW_EXCLUSIVE);
 

The purpose of this is to reduce contention, by holding finer grained
locks. ISTM what this does is drop one lock then take another lock by
accessing an array (buffer_locks) which will be in the same cache line
for all locks, then access the LWLock data structure, which again will
be all within the same cache line. ISTM that we have fine grained
LWLocks, but not fine grained cache lines. That means that all Clog and
Subtrans locks would be effected, since we have 8 of each.

For other global LWlocks, the same thing applies, so BufMgrLock and many
other locks are effectively all the same from the cache's perspective.

...and BTW, what is MMCacheLock?? is that an attempt at padding already?

It looks like padding out LWLock struct would ensure that each of those
were in separate cache lines?

Any views?

Best Regards, Simon Riggs



Re: LWLock cache line alignment

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> It looks like padding out LWLock struct would ensure that each of those
> were in separate cache lines?

I've looked at this before and I think it's a nonstarter; increasing the
size of a spinlock to 128 bytes is just not reasonable.  (Remember there
are two per buffer.)  Also, there's no evidence it would actually help
anything, because the contention we have been able to measure is on only
one particular lock (BufMgrLock) anyway.  But feel free to try it to see
if you can see a difference.
        regards, tom lane


Re: LWLock cache line alignment

From
Neil Conway
Date:
Simon Riggs wrote:
> ...and BTW, what is MMCacheLock?? is that an attempt at padding already?

One would hope not, as it would be a totally braindead attempt :) It 
appears to have been formerly used by smgr/mm.c; when that was removed, 
the MMCacheLock should have been removed but was not. Barring any 
objections, I'll remove it from HEAD tomorrow.

> It looks like padding out LWLock struct would ensure that each of those
> were in separate cache lines?

Sounds like it might be worth trying; since it would be trivial to 
implement for testing purposes, I'd be curious to see if this improves 
performance, and/or has any effect on the CS storm issue.

-Neil


Re: LWLock cache line alignment

From
"Simon Riggs"
Date:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> > It looks like padding out LWLock struct would ensure that
> each of those
> > were in separate cache lines?
>
> I've looked at this before and I think it's a nonstarter;
> increasing the
> size of a spinlock to 128 bytes is just not reasonable.
> (Remember there
> are two per buffer.)

Well, the performance is unreasonably poor, so its time to do something,
which might if it is unreasonable for the general case would need to be
port specific.

Also, there's no evidence it would actually help
> anything, because the contention we have been able to measure
> is on only
> one particular lock (BufMgrLock) anyway.  But feel free to
> try it to see
> if you can see a difference.

Well, the Wierd Context Switching issue isn't normal contention, which I
agree is stacked up against BufMgrLock.

Overlapping cache lines doesn't cause measurable user space contention,
just poor performance when the delay for cache-spoil, refetch is
required many times more than the minimum (ideal).

I'm thinking that the 128 byte cache line on Intel is sufficiently
higher than the 64 byte cache line on AMD to tip us into different
behaviour at runtime.

Best Regards, Simon Riggs



Re: LWLock cache line alignment

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
>> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote
>> I've looked at this before and I think it's a nonstarter;
>> increasing the
>> size of a spinlock to 128 bytes is just not reasonable.

> Well, the performance is unreasonably poor, so its time to do something,
> which might if it is unreasonable for the general case would need to be
> port specific.

Well, it might be worth allocating a full 128 bytes just for the fixed
LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks,
which should be seeing far less contention than the fixed locks anyway.
But first lets see some evidence that this actually helps?
        regards, tom lane


Re: LWLock cache line alignment

From
"Simon Riggs"
Date:
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote
> "Simon Riggs" <simon@2ndquadrant.com> writes:
> >> From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote
> >> I've looked at this before and I think it's a nonstarter;
> >> increasing the
> >> size of a spinlock to 128 bytes is just not reasonable.
>
> > Well, the performance is unreasonably poor, so its time to
> do something,
> > which might if it is unreasonable for the general case
> would need to be
> > port specific.
>

> But first lets see some evidence that this actually helps?

Yes, thats what we're planning. Currently just brainstorming ideas for
prototyping, rather than suggesting definitive things to go into the
codebase.

There are some other suggestions there also coming from our Unisys
friends: some additional Intel tweaks to avoid processor stalls and the
like. Again... I'll show the performance figures first.

> Well, it might be worth allocating a full 128 bytes just for the fixed
> LWLocks (BufMgrLock and friends) and skimping on the per-buffer locks,
> which should be seeing far less contention than the fixed
> locks anyway.

Yes, that seems like a likely future way. Right now we're going to pad
the whole structure, to save prototyping time and because we can on a
test box. Working on this now.

My concern about the per-buffer locks is with the CLog and Subtrans
buffer pools. Because we have 8 buffers for each of those it looks like
they'll all be in a single cache line. That means fine grained locking
is ineffective for those caches. With 1000s of shared_buffers, there's
less problem with cache line contention. Possibly Ken's suggestion of
pseudo-randomising the allocation of locks in LWLockAcquire would reduce
the effect on those smaller buffer pools.

Best Regards, Simon Riggs



Re: LWLock cache line alignment

From
Kenneth Marshall
Date:
On Thu, Feb 03, 2005 at 06:26:16AM -0800, Simon Riggs wrote:
> > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] wrote
> > "Simon Riggs" <simon@2ndquadrant.com> writes:
> > > It looks like padding out LWLock struct would ensure that
> > each of those
> > > were in separate cache lines?
> >
> > I've looked at this before and I think it's a nonstarter;
> > increasing the
> > size of a spinlock to 128 bytes is just not reasonable.
> > (Remember there
> > are two per buffer.)
> 
> Well, the performance is unreasonably poor, so its time to do something,
> which might if it is unreasonable for the general case would need to be
> port specific.
> 
> Also, there's no evidence it would actually help
> > anything, because the contention we have been able to measure
> > is on only
> > one particular lock (BufMgrLock) anyway.  But feel free to
> > try it to see
> > if you can see a difference.
> 
> Well, the Wierd Context Switching issue isn't normal contention, which I
> agree is stacked up against BufMgrLock.
> 
> Overlapping cache lines doesn't cause measurable user space contention,
> just poor performance when the delay for cache-spoil, refetch is
> required many times more than the minimum (ideal).
> 
> I'm thinking that the 128 byte cache line on Intel is sufficiently
> higher than the 64 byte cache line on AMD to tip us into different
> behaviour at runtime.
> 

Would it be possible to "randomize" the placement of the LWLock
struct to reduce lock correlation in the buffers. i.e. Ensure that
buffer locks and allocated at least 1 cacheline apart. This may
allow the reduction in cache coherency and still maintain a compact
data structure. Maybe a simple mod X offset?

Ken