Thread: slow startup due to LWLockAssign() spinlock

slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
Hi,

On larger, multi-socket, machines, startup takes a fair bit of time. As
I was profiling anyway I looked into it and noticed that just about all
of it is spent in LWLockAssign() called by InitBufferPool(). Starting
with shared_buffers=48GB on the server Nate Boley provided, takes about
12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
Simply modifying LWLockAssign() to not take the spinlock when
!IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
LWLockAssign() prettier it seems enough of a speedup to be worthwile
nonetheless.
Since this code is also hit when do an emergency restart, I'd say it has
practical relevance...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: slow startup due to LWLockAssign() spinlock

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On larger, multi-socket, machines, startup takes a fair bit of time. As
> I was profiling anyway I looked into it and noticed that just about all
> of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> with shared_buffers=48GB on the server Nate Boley provided, takes about
> 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> Simply modifying LWLockAssign() to not take the spinlock when
> !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> LWLockAssign() prettier it seems enough of a speedup to be worthwile
> nonetheless.

Hm.  This patch only works if the postmaster itself never assigns any
LWLocks except during startup.  That's *probably* all right, but it
seems a bit scary.  Is there any cheap way to make the logic actually
be what your comment claims, namely "Interlocking is not necessary during
postmaster startup"?  I guess we could invent a ShmemInitInProgress global
flag ...
        regards, tom lane



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On larger, multi-socket, machines, startup takes a fair bit of time. As
> > I was profiling anyway I looked into it and noticed that just about all
> > of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> > with shared_buffers=48GB on the server Nate Boley provided, takes about
> > 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> > Simply modifying LWLockAssign() to not take the spinlock when
> > !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> > LWLockAssign() prettier it seems enough of a speedup to be worthwile
> > nonetheless.
> 
> Hm.  This patch only works if the postmaster itself never assigns any
> LWLocks except during startup.  That's *probably* all right, but it
> seems a bit scary.  Is there any cheap way to make the logic actually
> be what your comment claims, namely "Interlocking is not necessary during
> postmaster startup"?  I guess we could invent a ShmemInitInProgress global
> flag ...

I'd be fine with inventing such a flag, I couldn't find one and decided
that this alone didn't merit it, since it seems to be really unlikely
that we will start to allocate such resources after startup in the
postmaster. Unless we're talking about single user mode obviously, but
the spinlock isn't necessary there anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On larger, multi-socket, machines, startup takes a fair bit of time. As
> > I was profiling anyway I looked into it and noticed that just about all
> > of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> > with shared_buffers=48GB on the server Nate Boley provided, takes about
> > 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> > Simply modifying LWLockAssign() to not take the spinlock when
> > !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> > LWLockAssign() prettier it seems enough of a speedup to be worthwile
> > nonetheless.
>
> Hm.  This patch only works if the postmaster itself never assigns any
> LWLocks except during startup.  That's *probably* all right, but it
> seems a bit scary.  Is there any cheap way to make the logic actually
> be what your comment claims, namely "Interlocking is not necessary during
> postmaster startup"?  I guess we could invent a ShmemInitInProgress global
> flag ...

So, here's a flag implementing things with that flag. I kept your name,
as it's more in line with ipci.c's naming, but it looks kinda odd
besides proc_exit_inprogress.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: slow startup due to LWLockAssign() spinlock

From
Bruce Momjian
Date:
On Tue, Feb  4, 2014 at 12:58:49AM +0100, Andres Freund wrote:
> On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On larger, multi-socket, machines, startup takes a fair bit of time. As
> > > I was profiling anyway I looked into it and noticed that just about all
> > > of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> > > with shared_buffers=48GB on the server Nate Boley provided, takes about
> > > 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> > > Simply modifying LWLockAssign() to not take the spinlock when
> > > !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> > > LWLockAssign() prettier it seems enough of a speedup to be worthwile
> > > nonetheless.
> > 
> > Hm.  This patch only works if the postmaster itself never assigns any
> > LWLocks except during startup.  That's *probably* all right, but it
> > seems a bit scary.  Is there any cheap way to make the logic actually
> > be what your comment claims, namely "Interlocking is not necessary during
> > postmaster startup"?  I guess we could invent a ShmemInitInProgress global
> > flag ...
> 
> So, here's a flag implementing things with that flag. I kept your name,
> as it's more in line with ipci.c's naming, but it looks kinda odd
> besides proc_exit_inprogress.

Uh, where are we on this?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + Everyone has their own god. +



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-04-16 19:33:52 -0400, Bruce Momjian wrote:
> On Tue, Feb  4, 2014 at 12:58:49AM +0100, Andres Freund wrote:
> > On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
> > > Andres Freund <andres@2ndquadrant.com> writes:
> > > > On larger, multi-socket, machines, startup takes a fair bit of time. As
> > > > I was profiling anyway I looked into it and noticed that just about all
> > > > of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> > > > with shared_buffers=48GB on the server Nate Boley provided, takes about
> > > > 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> > > > Simply modifying LWLockAssign() to not take the spinlock when
> > > > !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> > > > LWLockAssign() prettier it seems enough of a speedup to be worthwile
> > > > nonetheless.
> > > 
> > > Hm.  This patch only works if the postmaster itself never assigns any
> > > LWLocks except during startup.  That's *probably* all right, but it
> > > seems a bit scary.  Is there any cheap way to make the logic actually
> > > be what your comment claims, namely "Interlocking is not necessary during
> > > postmaster startup"?  I guess we could invent a ShmemInitInProgress global
> > > flag ...
> > 
> > So, here's a flag implementing things with that flag. I kept your name,
> > as it's more in line with ipci.c's naming, but it looks kinda odd
> > besides proc_exit_inprogress.
> 
> Uh, where are we on this?

I guess it's waiting for the next CF :(.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: slow startup due to LWLockAssign() spinlock

From
Heikki Linnakangas
Date:
On 04/17/2014 12:06 PM, Andres Freund wrote:
> On 2014-04-16 19:33:52 -0400, Bruce Momjian wrote:
>> On Tue, Feb  4, 2014 at 12:58:49AM +0100, Andres Freund wrote:
>>> On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
>>>> Andres Freund <andres@2ndquadrant.com> writes:
>>>>> On larger, multi-socket, machines, startup takes a fair bit of time. As
>>>>> I was profiling anyway I looked into it and noticed that just about all
>>>>> of it is spent in LWLockAssign() called by InitBufferPool(). Starting
>>>>> with shared_buffers=48GB on the server Nate Boley provided, takes about
>>>>> 12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
>>>>> Simply modifying LWLockAssign() to not take the spinlock when
>>>>> !IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
>>>>> LWLockAssign() prettier it seems enough of a speedup to be worthwile
>>>>> nonetheless.
>>>>
>>>> Hm.  This patch only works if the postmaster itself never assigns any
>>>> LWLocks except during startup.  That's *probably* all right, but it
>>>> seems a bit scary.  Is there any cheap way to make the logic actually
>>>> be what your comment claims, namely "Interlocking is not necessary during
>>>> postmaster startup"?  I guess we could invent a ShmemInitInProgress global
>>>> flag ...
>>>
>>> So, here's a flag implementing things with that flag. I kept your name,
>>> as it's more in line with ipci.c's naming, but it looks kinda odd
>>> besides proc_exit_inprogress.
>>
>> Uh, where are we on this?
>
> I guess it's waiting for the next CF :(.

Now that we have LWLock tranches in 9.4, it might be cleanest to have 
the buffer manager allocate a separate tranche for the buffer locks. We 
could also save some memory if we got rid of the LWLock pointers in 
BufferDesc altogether, and just used the buffer id as an index into the 
LWLock array (we could do that without tranches too, but would have to 
assume that the lock ids returned by LWLockAssign() are a contiguous range).

Another idea is to add an LWLockAssignBatch(int) function that assigns a 
range of locks in one call. That would be very simple, and I think it 
would be less likely to break things than a new global flag. I would be 
OK with sneaking that into 9.4 still.

- Heikki



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-04-24 15:56:45 +0300, Heikki Linnakangas wrote:
> On 04/17/2014 12:06 PM, Andres Freund wrote:
> >On 2014-04-16 19:33:52 -0400, Bruce Momjian wrote:
> >>On Tue, Feb  4, 2014 at 12:58:49AM +0100, Andres Freund wrote:
> >>>On 2014-02-03 11:22:45 -0500, Tom Lane wrote:
> >>>>Andres Freund <andres@2ndquadrant.com> writes:
> >>>>>On larger, multi-socket, machines, startup takes a fair bit of time. As
> >>>>>I was profiling anyway I looked into it and noticed that just about all
> >>>>>of it is spent in LWLockAssign() called by InitBufferPool(). Starting
> >>>>>with shared_buffers=48GB on the server Nate Boley provided, takes about
> >>>>>12 seconds. Nearly all of it spent taking the ShmemLock spinlock.
> >>>>>Simply modifying LWLockAssign() to not take the spinlock when
> >>>>>!IsUnderPostmaster speeds it up to 2 seconds. While certainly not making
> >>>>>LWLockAssign() prettier it seems enough of a speedup to be worthwile
> >>>>>nonetheless.
> >>>>
> >>>>Hm.  This patch only works if the postmaster itself never assigns any
> >>>>LWLocks except during startup.  That's *probably* all right, but it
> >>>>seems a bit scary.  Is there any cheap way to make the logic actually
> >>>>be what your comment claims, namely "Interlocking is not necessary during
> >>>>postmaster startup"?  I guess we could invent a ShmemInitInProgress global
> >>>>flag ...
> >>>
> >>>So, here's a flag implementing things with that flag. I kept your name,
> >>>as it's more in line with ipci.c's naming, but it looks kinda odd
> >>>besides proc_exit_inprogress.
> >>
> >>Uh, where are we on this?
> >
> >I guess it's waiting for the next CF :(.
> 
> Now that we have LWLock tranches in 9.4, it might be cleanest to have the
> buffer manager allocate a separate tranche for the buffer locks. We could
> also save some memory if we got rid of the LWLock pointers in BufferDesc
> altogether, and just used the buffer id as an index into the LWLock array
> (we could do that without tranches too, but would have to assume that the
> lock ids returned by LWLockAssign() are a contiguous range).

I tried that, and it's nontrivial from a performance POV because it
influences how a buffer descriptor fits into cacheline(s). I think this
needs significant experimentation.
My experimentation hinted that it'd be a good idea to put the content
lwlock inline, but the io one not since it's accessed much less
frequently. IIRC I could fit the remainder of the buffer descriptor into
one cacheline after putting the io locks into a separate array. I wonder
if we can't somehow get rid of the io locks entirely...

> Another idea is to add an LWLockAssignBatch(int) function that assigns a
> range of locks in one call. That would be very simple, and I think it would
> be less likely to break things than a new global flag. I would be OK with
> sneaking that into 9.4 still.

I don't really see the advantage tbh. Assuming we always can avoid the
spinlock initially seems simple enough - and I have significant doubts
that anything but buffer locks will need enough locks that it matters
for other users.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: slow startup due to LWLockAssign() spinlock

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-04-24 15:56:45 +0300, Heikki Linnakangas wrote:
>> Another idea is to add an LWLockAssignBatch(int) function that assigns a
>> range of locks in one call. That would be very simple, and I think it would
>> be less likely to break things than a new global flag. I would be OK with
>> sneaking that into 9.4 still.

> I don't really see the advantage tbh. Assuming we always can avoid the
> spinlock initially seems simple enough - and I have significant doubts
> that anything but buffer locks will need enough locks that it matters
> for other users.

FWIW, I like the LWLockAssignBatch idea a lot better than the currently
proposed patch.  LWLockAssign is a low-level function that has no business
making risky assumptions about the context it's invoked in.

The other ideas are 9.5 material at this point, since they involve
research --- but I agree with Heikki that LWLockAssignBatch could be
snuck into 9.4 still.
        regards, tom lane



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-04-24 11:02:44 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-04-24 15:56:45 +0300, Heikki Linnakangas wrote:
> >> Another idea is to add an LWLockAssignBatch(int) function that assigns a
> >> range of locks in one call. That would be very simple, and I think it would
> >> be less likely to break things than a new global flag. I would be OK with
> >> sneaking that into 9.4 still.
> 
> > I don't really see the advantage tbh. Assuming we always can avoid the
> > spinlock initially seems simple enough - and I have significant doubts
> > that anything but buffer locks will need enough locks that it matters
> > for other users.
> 
> FWIW, I like the LWLockAssignBatch idea a lot better than the currently
> proposed patch.  LWLockAssign is a low-level function that has no business
> making risky assumptions about the context it's invoked in.

I don't think LWLockAssignBatch() is that easy without introducing
layering violations. It can't just return a pointer out of the main
lwlock array that then can be ++ed clientside because MainLWLockArray's
stride isn't sizeof(LWLock).
We could just add a LWLockAssignStartup(), that'd be pretty
trivial. Whoever uses it later gets to keep the pieces...

I guess if it's not that, the whole thing is 9.5 material. Once those
locks are in a separate tranche the whole thing is moot anyway.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: slow startup due to LWLockAssign() spinlock

From
Heikki Linnakangas
Date:
On 04/24/2014 07:24 PM, Andres Freund wrote:
> On 2014-04-24 11:02:44 -0400, Tom Lane wrote:
>> Andres Freund <andres@2ndquadrant.com> writes:
>>> On 2014-04-24 15:56:45 +0300, Heikki Linnakangas wrote:
>>>> Another idea is to add an LWLockAssignBatch(int) function that assigns a
>>>> range of locks in one call. That would be very simple, and I think it would
>>>> be less likely to break things than a new global flag. I would be OK with
>>>> sneaking that into 9.4 still.
>>
>>> I don't really see the advantage tbh. Assuming we always can avoid the
>>> spinlock initially seems simple enough - and I have significant doubts
>>> that anything but buffer locks will need enough locks that it matters
>>> for other users.
>>
>> FWIW, I like the LWLockAssignBatch idea a lot better than the currently
>> proposed patch.  LWLockAssign is a low-level function that has no business
>> making risky assumptions about the context it's invoked in.
>
> I don't think LWLockAssignBatch() is that easy without introducing
> layering violations. It can't just return a pointer out of the main
> lwlock array that then can be ++ed clientside because MainLWLockArray's
> stride isn't sizeof(LWLock).

Well, it could copy the pointers to an array of pointers that the caller 
provides. Or palloc an array and return that. Allocating a large enough 
array to hold NBuffers locks might not be nice, but if you do it in 
batches of, say, 1k locks, that would make it fast enough. Makes the 
caller a bit more complicated, but still might be worth it.

- Heikki



Re: slow startup due to LWLockAssign() spinlock

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-04-24 11:02:44 -0400, Tom Lane wrote:
>> FWIW, I like the LWLockAssignBatch idea a lot better than the currently
>> proposed patch.  LWLockAssign is a low-level function that has no business
>> making risky assumptions about the context it's invoked in.

> I don't think LWLockAssignBatch() is that easy without introducing
> layering violations. It can't just return a pointer out of the main
> lwlock array that then can be ++ed clientside because MainLWLockArray's
> stride isn't sizeof(LWLock).

Meh.  I knew this business of using pointers instead of indexes would
have some downsides.

We could return the array stride ... kinda ugly, but since there's
probably only one consumer for this API, it's not *that* bad.  Could
wrap the stride-increment in a macro, perhaps.
        regards, tom lane



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-04-24 12:43:13 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-04-24 11:02:44 -0400, Tom Lane wrote:
> >> FWIW, I like the LWLockAssignBatch idea a lot better than the currently
> >> proposed patch.  LWLockAssign is a low-level function that has no business
> >> making risky assumptions about the context it's invoked in.
> 
> > I don't think LWLockAssignBatch() is that easy without introducing
> > layering violations. It can't just return a pointer out of the main
> > lwlock array that then can be ++ed clientside because MainLWLockArray's
> > stride isn't sizeof(LWLock).
> 
> Meh.  I knew this business of using pointers instead of indexes would
> have some downsides.
> 
> We could return the array stride ... kinda ugly, but since there's
> probably only one consumer for this API, it's not *that* bad.  Could
> wrap the stride-increment in a macro, perhaps.

I think I am just going to wait for 9.5 where I sure hope we can
allocate the buffer lwlocks outside the main array...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: slow startup due to LWLockAssign() spinlock

From
Andres Freund
Date:
On 2014-04-24 23:28:14 +0200, Andres Freund wrote:
> On 2014-04-24 12:43:13 -0400, Tom Lane wrote:
> > Andres Freund <andres@2ndquadrant.com> writes:
> > > On 2014-04-24 11:02:44 -0400, Tom Lane wrote:
> > >> FWIW, I like the LWLockAssignBatch idea a lot better than the currently
> > >> proposed patch.  LWLockAssign is a low-level function that has no business
> > >> making risky assumptions about the context it's invoked in.
> >
> > > I don't think LWLockAssignBatch() is that easy without introducing
> > > layering violations. It can't just return a pointer out of the main
> > > lwlock array that then can be ++ed clientside because MainLWLockArray's
> > > stride isn't sizeof(LWLock).
> >
> > Meh.  I knew this business of using pointers instead of indexes would
> > have some downsides.
> >
> > We could return the array stride ... kinda ugly, but since there's
> > probably only one consumer for this API, it's not *that* bad.  Could
> > wrap the stride-increment in a macro, perhaps.
>
> I think I am just going to wait for 9.5 where I sure hope we can
> allocate the buffer lwlocks outside the main array...

For reference (and backup), here's my current patch for that.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment