Thread: Separating Buffer LWlocks

Separating Buffer LWlocks

From
Simon Riggs
Date:
Looks like the right time to post this patch.

It separates the Buffer LWLocks from the main LW locks, allowing them to have different padding.

Tests showed noticeable/significant performance gain due to reduced false sharing on main LWlocks, though without wasting memory on the buffer LWlocks.

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment

Re: Separating Buffer LWlocks

From
Andres Freund
Date:
Hi,

On 2015-09-06 14:10:24 +0100, Simon Riggs wrote:
> It separates the Buffer LWLocks from the main LW locks, allowing them to
> have different padding.
> 
> Tests showed noticeable/significant performance gain due to reduced false
> sharing on main LWlocks, though without wasting memory on the buffer
> LWlocks.

Hm. I found that the buffer content lwlocks can actually also be a
significant source of contention - I'm not sure reducing padding for
those is going to be particularly nice. I think we should rather move
the *content* lock inline into the buffer descriptor. The io lock
doesn't matter and can be as small as possible.

Additionally I think we should increase the lwlock padding to 64byte
(i.e. the by far most command cacheline size). In the past I've seen
that to be rather beneficial.

Greetings,

Andres Freund



Re: Separating Buffer LWlocks

From
Andres Freund
Date:
On 2015-09-06 15:28:40 +0200, Andres Freund wrote:
> Hm. I found that the buffer content lwlocks can actually also be a
> significant source of contention - I'm not sure reducing padding for
> those is going to be particularly nice. I think we should rather move
> the *content* lock inline into the buffer descriptor. The io lock
> doesn't matter and can be as small as possible.

POC patch along those lines attached. This way the content locks have
full 64byte alignment *without* any additional memory usage because
buffer descriptors are already padded to 64bytes.  I'd to reorder
BufferDesc contents a bit and reduce the width of usagecount to 8bit
(which is fine given that 5 is our highest value) to make enough room.

I've experimented reducing the padding of the IO locks to nothing since
they're not that often contended on the CPU level. But even on my laptop
that lead to a noticeable regression for a readonly pgbench workload
where the dataset fit into the OS page cache but not into s_b.

> Additionally I think we should increase the lwlock padding to 64byte
> (i.e. the by far most command cacheline size). In the past I've seen
> that to be rather beneficial.

You'd already done that...


Benchmarking this on my 4 core/8 threads laptop I see a very slight
performance increase - which is about what we expect since this really
only should affect multi-socket machines.

Greetings,

Andres Freund

Attachment

Re: Separating Buffer LWlocks

From
Robert Haas
Date:
On Mon, Sep 7, 2015 at 1:59 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-06 15:28:40 +0200, Andres Freund wrote:
>> Hm. I found that the buffer content lwlocks can actually also be a
>> significant source of contention - I'm not sure reducing padding for
>> those is going to be particularly nice. I think we should rather move
>> the *content* lock inline into the buffer descriptor. The io lock
>> doesn't matter and can be as small as possible.
>
> POC patch along those lines attached. This way the content locks have
> full 64byte alignment *without* any additional memory usage because
> buffer descriptors are already padded to 64bytes.  I'd to reorder
> BufferDesc contents a bit and reduce the width of usagecount to 8bit
> (which is fine given that 5 is our highest value) to make enough room.
>
> I've experimented reducing the padding of the IO locks to nothing since
> they're not that often contended on the CPU level. But even on my laptop
> that lead to a noticeable regression for a readonly pgbench workload
> where the dataset fit into the OS page cache but not into s_b.

I like this approach, though I think clearly it needs more performance testing.

The method of determining the tranche IDs is totally awful, though.  I
assume that's just a dirty hack for the POC and not something you'd
seriously consider doing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Separating Buffer LWlocks

From
Andres Freund
Date:
On 2015-09-08 13:29:28 -0400, Robert Haas wrote:
> I like this approach, though I think clearly it needs more performance testing.

Yea, obviously. I did run this on a slightly bigger machine yesterday
and it gave consistent ~8% performance improvements.

> The method of determining the tranche IDs is totally awful, though.  I
> assume that's just a dirty hack for the POC and not something you'd
> seriously consider doing.

If you're referring to assigning fixed ids in the guts of lwlocks.c -
yea, that was really more of a quick hack. I think we should put a enum
into lwlock.h with fixed tranch ids with the final member being
LWTRANCHE_FIRST_DYNAMIC or so.

Greetings,

Andres Freund



Re: Separating Buffer LWlocks

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 1:54 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-08 13:29:28 -0400, Robert Haas wrote:
>> I like this approach, though I think clearly it needs more performance testing.
>
> Yea, obviously. I did run this on a slightly bigger machine yesterday
> and it gave consistent ~8% performance improvements.

Wow, nice.

>> The method of determining the tranche IDs is totally awful, though.  I
>> assume that's just a dirty hack for the POC and not something you'd
>> seriously consider doing.
>
> If you're referring to assigning fixed ids in the guts of lwlocks.c -
> yea, that was really more of a quick hack. I think we should put a enum
> into lwlock.h with fixed tranch ids with the final member being
> LWTRANCHE_FIRST_DYNAMIC or so.

We could do that, but I'm not sure just calling LWLockNewTrancheId()
for all of the tranches would be so bad either.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Separating Buffer LWlocks

From
Andres Freund
Date:
On 2015-09-08 14:15:32 -0400, Robert Haas wrote:
> We could do that, but I'm not sure just calling LWLockNewTrancheId()
> for all of the tranches would be so bad either.

To me that seems either fragile or annoying to use. If all backends call
LWLockNewTrancheId() we need to a be sure the callbacks are always going
to be called in the same order. Otherwise everyone needs to store the
tranche in shared memory (like xlog.c now does) which I find to be a
rather annoying requirement.

Greetings,

Andres Freund



Re: Separating Buffer LWlocks

From
Robert Haas
Date:
On Tue, Sep 8, 2015 at 2:19 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2015-09-08 14:15:32 -0400, Robert Haas wrote:
>> We could do that, but I'm not sure just calling LWLockNewTrancheId()
>> for all of the tranches would be so bad either.
>
> To me that seems either fragile or annoying to use. If all backends call
> LWLockNewTrancheId() we need to a be sure the callbacks are always going
> to be called in the same order.

How is that going to work?  The counter is in shared memory.

> Otherwise everyone needs to store the
> tranche in shared memory (like xlog.c now does) which I find to be a
> rather annoying requirement.

Yes, everyone would need to do that.  If that's too annoying to live
with, then we can adopt your suggestion.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Separating Buffer LWlocks

From
Andres Freund
Date:
On 2015-09-08 14:31:53 -0400, Robert Haas wrote:
> On Tue, Sep 8, 2015 at 2:19 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2015-09-08 14:15:32 -0400, Robert Haas wrote:
> >> We could do that, but I'm not sure just calling LWLockNewTrancheId()
> >> for all of the tranches would be so bad either.
> >
> > To me that seems either fragile or annoying to use. If all backends call
> > LWLockNewTrancheId() we need to a be sure the callbacks are always going
> > to be called in the same order.
> 
> How is that going to work?  The counter is in shared memory.

Oh, right. It's just the tranche definitions themselves that live in
backend private memory.

> > Otherwise everyone needs to store the
> > tranche in shared memory (like xlog.c now does) which I find to be a
> > rather annoying requirement.
> 
> Yes, everyone would need to do that.  If that's too annoying to live
> with, then we can adopt your suggestion.

I think having to add a separate ShmemInitStruct() just to be able to
store the tranche id falls under too annoying.

Greetings,

Andres Freund