Thread: StrategyGetBuffer optimization, take 2

StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
My $company recently acquired another postgres based $company and
migrated all their server operations into our datacenter.  Upon
completing the move, the newly migrated database server started
experiencing huge load spikes.


*) Environment description:
Postgres 9.2.4
RHEL 6
32 cores
virtualized (ESX) but with a dedicated host
256GB ram
shared_buffers: 32G
96 application servers configured to max 5 connections each
very fast i/o
database size: ~ 200GB
HS/SR: 3 slaves

*) Problem description:
The server normally hums along nicely with load < 1.0 and no iowait --
in fact the server is massively over-provisioned.  However, on
semi-random basis (once every 1-2 days) load absolutely goes through
the roof to 600+, no iowait, 90-100%  (70%+ sys) cpu.  It hangs around
like that for 5-20 minutes then resolves as suddenly as it started.
There is nothing interesting going on application side (except the
application servers are all piling on) but pg_locks is recording lots
of contention on relation 'extension locks'.   One interesting point
is that the slaves are also affected, but the precise point of the
high load affects happens some seconds after the master.

*) Initial steps taken:
RhodiumToad aka (Andrew G) has seen this in the wild several times and
suggested dropping shared_buffers significantly might resolve the
situation short term.  That was done on friday night, and so far
problem has not re-occurred.

*) What I think is happening:
I think we are again getting burned by getting de-scheduled while
holding the free list lock. I've been chasing this problem for a long
time now (for example, see:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
but not I've got a reproducible case.  What is happening this:

1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
3. Lock free list, go into clock sweep loop
4. while holding clock sweep, hit 'hot' buffer, spin on it
5. get de-scheduled
6. now enter the 'hot buffer spin lock lottery'
7. more/more backends pile on, linux scheduler goes bezerk, reducing
chances of winning #6
8. finally win the lottery. lock released. everything back to normal.

*) what I would like to do to fix it:
see attached patch.  This builds on the work of Jeff Janes to remove
the free list lock and has some extra optimizations in the clock sweep
loop:

optimization 1: usage count is advisory. it is not updated behind the
buffer lock. in the event there are a large sequences of buffers with
>0 usage_count, this avoids spamming the cache_line lock; you
decrement and hope for the best

optimization 2: refcount is examined during buffer allocation without
a lock.  if it's > 0, buffer is assumed pinned (even though it may not
in fact be) and sweep continues

optimization 3: sweep does not wait on buf header lock.  instead, it
does 'try lock' and bails if the buffer is determined pinned.  I
believe this to be one of the two critical optimizations

optimization 4: remove free list lock (via Jeff Janes).  This is the
other optimization: one backend will no longer be able to shut down
buffer allocation

*) what I'm asking for

Is the analysis and the patch to fix the perceived problem plausible
without breaking other stuff..  If so, I'm inclined to go further with
this.   This is not the only solution on the table for high buffer
contention, but IMNSHO it should get a lot of points for being very
localized.  Maybe a reduced version could be tried retaining the
freelist lock but keeping the 'trylock' on the buf header.

*) further reading:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&ved=0CC8QFjAA&url=http%3A%2F%2Fpostgresql.1045698.n5.nabble.com%2FHigh-SYS-CPU-need-advise-td5732045.html&ei=hsb_Uc6pB4Ss9ASN7YHoAg&usg=AFQjCNEefMxOvjvW3Alg4TiXqCSAUmDR7A&sig2=EyPOQa9XbVEND5kwzTeBJg&bvm=bv.50165853,d.eWU

http://www.postgresql.org/message-id/CAHyXU0x47D4n6EdPyNyadShXQQXKoheLV2cbRgr_2NGrC8KRRQ@mail.gmail.com

http://postgresql.1045698.n5.nabble.com/Page-replacement-algorithm-in-buffer-cache-td5749236.html


merlin

Attachment

Re: StrategyGetBuffer optimization, take 2

From
Andres Freund
Date:
On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:
> optimization 4: remove free list lock (via Jeff Janes).  This is the
> other optimization: one backend will no longer be able to shut down
> buffer allocation

I think splitting off the actual freelist checking into a spinlock makes
quite a bit of sense. And I think a separate one should be used for the
actual search for the clock sweep.
I don't think the unlocked increment of nextVictimBuffer is a good idea
though. nextVictimBuffer jumping over NBuffers under concurrency seems
like a recipe for disaster to me. At the very, very least it will need a
good wad of comments explaining what it means and how you're allowed to
use it. The current way will lead to at least bgwriter accessing a
nonexistant/out of bounds buffer via StrategySyncStart().
Possibly it won't even save that much, it might just increase the
contention on the buffer header spinlock's cacheline.

Greetings,

Andres Freund



Re: StrategyGetBuffer optimization, take 2

From
Atri Sharma
Date:
> optimization 2: refcount is examined during buffer allocation without
> a lock.  if it's > 0, buffer is assumed pinned (even though it may not
> in fact be) and sweep continues

+1.

I think this shall not lead to much problems, since a lost update
cannot,IMO, lead to disastrous result. At most, a buffer page can
survive for an extra clock sweep.


> optimization 3: sweep does not wait on buf header lock.  instead, it
> does 'try lock' and bails if the buffer is determined pinned.  I
> believe this to be one of the two critical optimizations

+1 again. I believe the try lock will be a sort of filter before the
actual check, hence reducing the contention.


Regards,

Atri



-- 
Regards,

Atri
l'apprenant



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Mon, Aug 5, 2013 at 11:40 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2013-08-05 10:49:08 -0500, Merlin Moncure wrote:
>> optimization 4: remove free list lock (via Jeff Janes).  This is the
>> other optimization: one backend will no longer be able to shut down
>> buffer allocation
>
> I think splitting off the actual freelist checking into a spinlock makes
> quite a bit of sense. And I think a separate one should be used for the
> actual search for the clock sweep.


> I don't think the unlocked increment of nextVictimBuffer is a good idea
> though. nextVictimBuffer jumping over NBuffers under concurrency seems
> like a recipe for disaster to me. At the very, very least it will need a
> good wad of comments explaining what it means and how you're allowed to
> use it. The current way will lead to at least bgwriter accessing a
> nonexistant/out of bounds buffer via StrategySyncStart().
> Possibly it won't even save that much, it might just increase the
> contention on the buffer header spinlock's cacheline.

I agree; at least then it's not unambiguously better. if you (in
effect) swap all contention on allocation from a lwlock to a spinlock
it's not clear if you're improving things; it would have to be proven
and I'm trying to keep things simple.

Attached is a scaled down version of the patch that keeps the freelist
lock but still removes the spinlock during the clock sweep.  This
still hits the major objectives of reducing the chance of scheduling
out while holding the BufFreelistLock and mitigating the worst case
impact of doing so if it does happen.  An even more scaled down
version would keep the current logic exactly as is except for
replacing buffer lock in the clock sweep with a trylock (which is
IMNSHO a no-brainer).

merlin

Attachment

Re: StrategyGetBuffer optimization, take 2

From
Andres Freund
Date:
On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
> > I don't think the unlocked increment of nextVictimBuffer is a good idea
> > though. nextVictimBuffer jumping over NBuffers under concurrency seems
> > like a recipe for disaster to me. At the very, very least it will need a
> > good wad of comments explaining what it means and how you're allowed to
> > use it. The current way will lead to at least bgwriter accessing a
> > nonexistant/out of bounds buffer via StrategySyncStart().
> > Possibly it won't even save that much, it might just increase the
> > contention on the buffer header spinlock's cacheline.
> 
> I agree; at least then it's not unambiguously better. if you (in
> effect) swap all contention on allocation from a lwlock to a spinlock
> it's not clear if you're improving things; it would have to be proven
> and I'm trying to keep things simple.

I think converting it to a spinlock actually is a good idea, you just
need to expand the scope a bit.

> Attached is a scaled down version of the patch that keeps the freelist
> lock but still removes the spinlock during the clock sweep.  This
> still hits the major objectives of reducing the chance of scheduling
> out while holding the BufFreelistLock and mitigating the worst case
> impact of doing so if it does happen.

FWIW, I am not convinced this is the trigger for the problems you're
seing. It's a good idea nonetheless though.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
>> > I don't think the unlocked increment of nextVictimBuffer is a good idea
>> > though. nextVictimBuffer jumping over NBuffers under concurrency seems
>> > like a recipe for disaster to me. At the very, very least it will need a
>> > good wad of comments explaining what it means and how you're allowed to
>> > use it. The current way will lead to at least bgwriter accessing a
>> > nonexistant/out of bounds buffer via StrategySyncStart().
>> > Possibly it won't even save that much, it might just increase the
>> > contention on the buffer header spinlock's cacheline.
>>
>> I agree; at least then it's not unambiguously better. if you (in
>> effect) swap all contention on allocation from a lwlock to a spinlock
>> it's not clear if you're improving things; it would have to be proven
>> and I'm trying to keep things simple.
>
> I think converting it to a spinlock actually is a good idea, you just
> need to expand the scope a bit.

all right: well, I'll work up another version doing full spinlock and
maybe see things shake out in performance.

> FWIW, I am not convinced this is the trigger for the problems you're
> seing. It's a good idea nonetheless though.

I have some very strong evidence that the problem is coming out of the
buffer allocator.  Exhibit A is that vlad's presentation of the
problem was on a read only load (if not allocator lock, then what?).
Exhibit B is that lowering shared buffers to 2gb seems to have (so
far, 5 days in) fixed the issue.  This problem shows up on fast
machines with fast storage and lots of cores.  So what I think is
happening is that usage_count starts creeping up faster than it gets
cleared by the sweep with very large buffer settings which in turn
causes the 'problem' buffers to be analyzed for eviction more often.
What is not as clear is if the proposed optimizations will fix the
problem -- I'd have to get approval to test and confirm them in
production which seems unlikely at this juncture; that's why I'm
trying to keep things 'win-win' so as to not have to have them be
accepted on that basis.

merlin



Re: StrategyGetBuffer optimization, take 2

From
Atri Sharma
Date:
On Wed, Aug 7, 2013 at 10:37 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
>> > I don't think the unlocked increment of nextVictimBuffer is a good idea
>> > though. nextVictimBuffer jumping over NBuffers under concurrency seems
>> > like a recipe for disaster to me. At the very, very least it will need a
>> > good wad of comments explaining what it means and how you're allowed to
>> > use it. The current way will lead to at least bgwriter accessing a
>> > nonexistant/out of bounds buffer via StrategySyncStart().
>> > Possibly it won't even save that much, it might just increase the
>> > contention on the buffer header spinlock's cacheline.
>>
>> I agree; at least then it's not unambiguously better. if you (in
>> effect) swap all contention on allocation from a lwlock to a spinlock
>> it's not clear if you're improving things; it would have to be proven
>> and I'm trying to keep things simple.
>
> I think converting it to a spinlock actually is a good idea, you just
> need to expand the scope a bit.

Umm, sorry if I am being naive, but dont spinlocks perform bad when a
lot of contention is present on that node?

I feel that we may hit on that case here. A preliminary check before
the actual spinlock may be good,though,since spinlocks are cheap until
the contention remains low.

Regards,

Atri

-- 
Regards,

Atri
l'apprenant



Re: StrategyGetBuffer optimization, take 2

From
Amit Kapila
Date:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Merlin Moncure
> Sent: Thursday, August 08, 2013 12:09 AM
> To: Andres Freund
> Cc: PostgreSQL-development; Jeff Janes
> Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2
> 
> On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com>
> wrote:
> > On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
> >> > I don't think the unlocked increment of nextVictimBuffer is a good
> idea
> >> > though. nextVictimBuffer jumping over NBuffers under concurrency
> seems
> >> > like a recipe for disaster to me. At the very, very least it will
> need a
> >> > good wad of comments explaining what it means and how you're
> allowed to
> >> > use it. The current way will lead to at least bgwriter accessing a
> >> > nonexistant/out of bounds buffer via StrategySyncStart().
> >> > Possibly it won't even save that much, it might just increase the
> >> > contention on the buffer header spinlock's cacheline.
> >>
> >> I agree; at least then it's not unambiguously better. if you (in
> >> effect) swap all contention on allocation from a lwlock to a
> spinlock
> >> it's not clear if you're improving things; it would have to be
> proven
> >> and I'm trying to keep things simple.
> >
> > I think converting it to a spinlock actually is a good idea, you just
> > need to expand the scope a bit.
> 
> all right: well, I'll work up another version doing full spinlock and
> maybe see things shake out in performance.
> 
> > FWIW, I am not convinced this is the trigger for the problems you're
> > seing. It's a good idea nonetheless though.
> 
> I have some very strong evidence that the problem is coming out of the
> buffer allocator.  Exhibit A is that vlad's presentation of the
> problem was on a read only load (if not allocator lock, then what?).
> Exhibit B is that lowering shared buffers to 2gb seems to have (so
> far, 5 days in) fixed the issue.  This problem shows up on fast
> machines with fast storage and lots of cores.  So what I think is
> happening is that usage_count starts creeping up faster than it gets
> cleared by the sweep with very large buffer settings which in turn
> causes the 'problem' buffers to be analyzed for eviction more often.
 Yes one idea which was discussed previously is to not increase usage
count, every time buffer is pinned. I am also working on some of the optimizations on similar area, which you
can refer here:
http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
huawei.com


> What is not as clear is if the proposed optimizations will fix the
> problem -- I'd have to get approval to test and confirm them in
> production which seems unlikely at this juncture; that's why I'm
> trying to keep things 'win-win' so as to not have to have them be
> accepted on that basis.

With Regards,
Amit Kapila.




Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila <amit.kapila@huawei.com> wrote:
>> -----Original Message-----
>> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
>> owner@postgresql.org] On Behalf Of Merlin Moncure
>> Sent: Thursday, August 08, 2013 12:09 AM
>> To: Andres Freund
>> Cc: PostgreSQL-development; Jeff Janes
>> Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2
>>
>> On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres@2ndquadrant.com>
>> wrote:
>> > On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:
>> >> > I don't think the unlocked increment of nextVictimBuffer is a good
>> idea
>> >> > though. nextVictimBuffer jumping over NBuffers under concurrency
>> seems
>> >> > like a recipe for disaster to me. At the very, very least it will
>> need a
>> >> > good wad of comments explaining what it means and how you're
>> allowed to
>> >> > use it. The current way will lead to at least bgwriter accessing a
>> >> > nonexistant/out of bounds buffer via StrategySyncStart().
>> >> > Possibly it won't even save that much, it might just increase the
>> >> > contention on the buffer header spinlock's cacheline.
>> >>
>> >> I agree; at least then it's not unambiguously better. if you (in
>> >> effect) swap all contention on allocation from a lwlock to a
>> spinlock
>> >> it's not clear if you're improving things; it would have to be
>> proven
>> >> and I'm trying to keep things simple.
>> >
>> > I think converting it to a spinlock actually is a good idea, you just
>> > need to expand the scope a bit.
>>
>> all right: well, I'll work up another version doing full spinlock and
>> maybe see things shake out in performance.
>>
>> > FWIW, I am not convinced this is the trigger for the problems you're
>> > seing. It's a good idea nonetheless though.
>>
>> I have some very strong evidence that the problem is coming out of the
>> buffer allocator.  Exhibit A is that vlad's presentation of the
>> problem was on a read only load (if not allocator lock, then what?).
>> Exhibit B is that lowering shared buffers to 2gb seems to have (so
>> far, 5 days in) fixed the issue.  This problem shows up on fast
>> machines with fast storage and lots of cores.  So what I think is
>> happening is that usage_count starts creeping up faster than it gets
>> cleared by the sweep with very large buffer settings which in turn
>> causes the 'problem' buffers to be analyzed for eviction more often.
>
>   Yes one idea which was discussed previously is to not increase usage
> count, every time buffer is pinned.
>   I am also working on some of the optimizations on similar area, which you
> can refer here:
>
> http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
> huawei.com

yup -- just took a quick look at your proposed patch.  You're
attacking the 'freelist' side of buffer allocation where my stripped
down patch addresses issues with the clocksweep.  I think this is a
good idea but more than I wanted to get into personally.

Good news is that both patches should essentially bolt on together
AFAICT.  I propose we do a bit of consolidation of performance testing
efforts and run tests with patch A, B, and AB in various scenarios.  I
have a 16 core vm (4gb ram) that I can test with and want to start
with say 2gb database 1gb shared_buffers high concurrency test and see
how it burns in.  What do you think?   Are you at a point where we can
run some tests?

merlin



Re: StrategyGetBuffer optimization, take 2

From
Amit Kapila
Date:
Merlin Moncure wrote:
On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila
<amit(dot)kapila(at)huawei(dot)com> wrote:
>>> -----Original Message-----
>>> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
>>> owner(at)postgresql(dot)org] On Behalf Of Merlin Moncure
>>> Sent: Thursday, August 08, 2013 12:09 AM
>>> To: Andres Freund
>>> Cc: PostgreSQL-development; Jeff Janes
>>> Subject: Re: [HACKERS] StrategyGetBuffer optimization, take 2
>>>
>>> On Wed, Aug 7, 2013 at 12:07 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
>>> wrote:
>>> > On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:

>>> I have some very strong evidence that the problem is coming out of the
>>> buffer allocator.  Exhibit A is that vlad's presentation of the
>>> problem was on a read only load (if not allocator lock, then what?).
>>> Exhibit B is that lowering shared buffers to 2gb seems to have (so
>>> far, 5 days in) fixed the issue.  This problem shows up on fast
>>> machines with fast storage and lots of cores.  So what I think is
>>> happening is that usage_count starts creeping up faster than it gets
>>> cleared by the sweep with very large buffer settings which in turn
>>> causes the 'problem' buffers to be analyzed for eviction more often.
>
>>   Yes one idea which was discussed previously is to not increase usage
>> count, every time buffer is pinned.
>>   I am also working on some of the optimizations on similar area, which you
>> can refer here:
>>
>> http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@
>> huawei.com

> yup -- just took a quick look at your proposed patch.  You're
> attacking the 'freelist' side of buffer allocation where my stripped
> down patch addresses issues with the clocksweep.  I think this is a
> good idea but more than I wanted to get into personally.

> Good news is that both patches should essentially bolt on together
> AFAICT.

True, I also think so as both are trying to reduce contention in same area.

>  I propose we do a bit of consolidation of performance testing
> efforts and run tests with patch A, B, and AB in various scenarios.  I
> have a 16 core vm (4gb ram) that I can test with and want to start
> with say 2gb database 1gb shared_buffers high concurrency test and see
> how it burns in.  What do you think?

I think this can mainly benefit with large data  and shared buffers (>
10G), last year also I had ran few tests with similar idea's but
didn't get much
in with less shared buffers.

>  Are you at a point where we can
> run some tests?

Not now, but I will try to run before/during next CF.


With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
<div dir="ltr">Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to
giveuseful results :-(.  Need to scrounge up a doner box somewhere...</div><div class="gmail_extra"><br /><br /><div
class="gmail_quote">OnTue, Aug 13, 2013 at 12:26 AM, Amit Kapila <span dir="ltr"><<a
href="mailto:amit.kapila16@gmail.com"target="_blank">amit.kapila16@gmail.com</a>></span> wrote:<br /><blockquote
class="gmail_quote"style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="im">Merlin Moncure
wrote:<br/> On Wed, Aug 7, 2013 at 11:52 PM, Amit Kapila<br /></div><amit(dot)kapila(at)huawei(dot)com> wrote:<br
/><divclass="im">>>> -----Original Message-----<br /> >>> From:
pgsql-hackers-owner(at)postgresql(dot)org[mailto:<a href="mailto:pgsql-hackers-">pgsql-hackers-</a><br /> >>>
owner(at)postgresql(dot)org]On Behalf Of Merlin Moncure<br /> >>> Sent: Thursday, August 08, 2013 12:09 AM<br
/>>>> To: Andres Freund<br /> >>> Cc: PostgreSQL-development; Jeff Janes<br /> >>> Subject:
Re:[HACKERS] StrategyGetBuffer optimization, take 2<br /> >>><br /></div>>>> On Wed, Aug 7, 2013 at
12:07PM, Andres Freund <andres(at)2ndquadrant(dot)com><br /><div class="im">>>> wrote:<br />
>>>> On 2013-08-07 09:40:24 -0500, Merlin Moncure wrote:<br /><br /></div><div class="im">>>> I
havesome very strong evidence that the problem is coming out of the<br /> >>> buffer allocator.  Exhibit A is
thatvlad's presentation of the<br /> >>> problem was on a read only load (if not allocator lock, then
what?).<br/> >>> Exhibit B is that lowering shared buffers to 2gb seems to have (so<br /> >>> far, 5
daysin) fixed the issue.  This problem shows up on fast<br /> >>> machines with fast storage and lots of
cores. So what I think is<br /> >>> happening is that usage_count starts creeping up faster than it gets<br />
>>>cleared by the sweep with very large buffer settings which in turn<br /> >>> causes the 'problem'
buffersto be analyzed for eviction more often.<br /> ><br /> >>   Yes one idea which was discussed previously
isto not increase usage<br /> >> count, every time buffer is pinned.<br /> >>   I am also working on some
ofthe optimizations on similar area, which you<br /> >> can refer here:<br /> >><br /> >> <a
href="http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@"
target="_blank">http://www.postgresql.org/message-id/006e01ce926c$c7768680$56639380$@kapila@</a><br/> >> <a
href="http://huawei.com"target="_blank">huawei.com</a><br /><br /> > yup -- just took a quick look at your proposed
patch. You're<br /> > attacking the 'freelist' side of buffer allocation where my stripped<br /> > down patch
addressesissues with the clocksweep.  I think this is a<br /> > good idea but more than I wanted to get into
personally.<br/><br /> > Good news is that both patches should essentially bolt on together<br /> > AFAICT.<br
/><br/></div>True, I also think so as both are trying to reduce contention in same area.<br /><div class="im"><br />
> I propose we do a bit of consolidation of performance testing<br /> > efforts and run tests with patch A, B,
andAB in various scenarios.  I<br /> > have a 16 core vm (4gb ram) that I can test with and want to start<br /> >
withsay 2gb database 1gb shared_buffers high concurrency test and see<br /> > how it burns in.  What do you
think?<br/><br /></div>I think this can mainly benefit with large data  and shared buffers (><br /> 10G), last year
alsoI had ran few tests with similar idea's but<br /> didn't get much<br /> in with less shared buffers.<br /><div
class="im"><br/> >  Are you at a point where we can<br /> > run some tests?<br /><br /></div>Not now, but I will
tryto run before/during next CF.<br /><br /><br /> With Regards,<br /> Amit Kapila.<br /> EnterpriseDB: <a
href="http://www.enterprisedb.com"target="_blank">http://www.enterprisedb.com</a><br /></blockquote></div><br /></div> 

Re: StrategyGetBuffer optimization, take 2

From
Jim Nasby
Date:
On 8/14/13 8:30 AM, Merlin Moncure wrote:
> Performance testing this patch is a real bugaboo for me; the VMs I have to work with are too unstable to give useful
results:-(.  Need to scrounge up a doner box somewhere...
 

I offered a server or two to the community a while ago but I don't think anything was ever resolved. It wouldn't have a
massivenumber of cores (only 24 IIRC), but it would have a lot of memory (definitely over 192G; maybe more).
 

The community just needs to decide it wants it and where it gets shipped to...
-- 
Jim Nasby, Lead Data Architect
(512) 569-9461 (primary) (512) 579-9024 (backup)



Re: StrategyGetBuffer optimization, take 2

From
Amit Kapila
Date:
On Wed, Aug 14, 2013 at 7:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> Performance testing this patch is a real bugaboo for me; the VMs I have to
> work with are too unstable to give useful results :-(.  Need to scrounge up
> a doner box somewhere...
  While doing performance tests in this area, I always had a feeling
that OS layer (scheduler that flushes OS buffers)  had a role to play here, So I use to reboot m/c between tests,
ofcourse taking performance data in this fashion is tedious.  I think it is good to run tests on m/c with configuration
similar
to what Jim Nasby mentioned in his below mail.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com



Re: StrategyGetBuffer optimization, take 2

From
Robert Haas
Date:
On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> *) What I think is happening:
> I think we are again getting burned by getting de-scheduled while
> holding the free list lock. I've been chasing this problem for a long
> time now (for example, see:
> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
> but not I've got a reproducible case.  What is happening this:
>
> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
> 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
> 3. Lock free list, go into clock sweep loop
> 4. while holding clock sweep, hit 'hot' buffer, spin on it
> 5. get de-scheduled
> 6. now enter the 'hot buffer spin lock lottery'
> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
> chances of winning #6
> 8. finally win the lottery. lock released. everything back to normal.

This is an interesting theory, but where's the evidence?  I've seen
spinlock contention come from enough different places to be wary of
arguments that start with "it must be happening because...".

IMHO, the thing to do here is run perf record -g during one of the
trouble periods.  The performance impact is quite low.  You could
probably even set up a script that runs perf for five minute intervals
at a time and saves all of the perf.data files.  When one of these
spikes happens, grab the one that's relevant.

If you see that s_lock is where all the time is going, then you've
proved it's a PostgreSQL spinlock rather than something in the kernel
or a shared library.  If you can further see what's calling s_lock
(which should hopefully be possible with perf -g), then you've got it
nailed dead to rights.

...Robert



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> *) What I think is happening:
>> I think we are again getting burned by getting de-scheduled while
>> holding the free list lock. I've been chasing this problem for a long
>> time now (for example, see:
>> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
>> but not I've got a reproducible case.  What is happening this:
>>
>> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
>> 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
>> 3. Lock free list, go into clock sweep loop
>> 4. while holding clock sweep, hit 'hot' buffer, spin on it
>> 5. get de-scheduled
>> 6. now enter the 'hot buffer spin lock lottery'
>> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
>> chances of winning #6
>> 8. finally win the lottery. lock released. everything back to normal.
>
> This is an interesting theory, but where's the evidence?  I've seen
> spinlock contention come from enough different places to be wary of
> arguments that start with "it must be happening because...".

Absolutely.  My evidence is circumstantial at best -- let's call it a
hunch.  I also do not think we are facing pure spinlock contention,
but something more complex which is a combination of spinlocks, the
free list lwlock, and the linux scheduler.  This problem showed up
going from RHEL 5->6 which brought a lot of scheduler changes.  A lot
of other things changed too, but the high sys cpu really suggests we
are getting some feedback from the scheduler.

> IMHO, the thing to do here is run perf record -g during one of the
> trouble periods.  The performance impact is quite low.  You could
> probably even set up a script that runs perf for five minute intervals
> at a time and saves all of the perf.data files.  When one of these
> spikes happens, grab the one that's relevant.

Unfortunately -- that's not on the table.  Dropping shared buffers to
2GB (thanks RhodiumToad) seems to have fixed the issue and there is
zero chance I will get approval to revert that setting in order to
force this to re-appear.  So far, I have not been able to reproduce in
testing.   By the way, this problem has popped up in other places too;
and the typical remedies are applied until it goes away :(.

> If you see that s_lock is where all the time is going, then you've
> proved it's a PostgreSQL spinlock rather than something in the kernel
> or a shared library.  If you can further see what's calling s_lock
> (which should hopefully be possible with perf -g), then you've got it
> nailed dead to rights.

Well, I don't think it's that simple.  So my plan of action is this:
1) Improvise a patch that removes one *possible* trigger for the
problem, or at least makes it much less likely to occur.  Also, in
real world cases where usage_count is examined N times before
returning a candidate buffer, the amount of overall spinlocking from
buffer allocating is reduced by approximately (N-1)/N.  Even though
spin locking is cheap, it's hard to argue with that...

2) Exhaustively performance test patch #1.  I think this is win-win
since SGB clock sweep loop is quite frankly relatively un-optimized. I
don't see how reducing the amount of locking could hurt performance
but I've been, uh, wrong about these types of things before.

3) If a general benefit without downside is shown from #2, I'll simply
advance the patch for the next CF and see how things shake out. If and
when I feel like there's a decent shot at getting accepted, I may go
through the motions of setting up a patched server in production and
attempting shared buffers.  But that's a long way off.

merlin



Re: StrategyGetBuffer optimization, take 2

From
Jeff Janes
Date:
On Mon, Aug 5, 2013 at 8:49 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
> *) What I think is happening:
> I think we are again getting burned by getting de-scheduled while
> holding the free list lock. I've been chasing this problem for a long
> time now (for example, see:
> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
> but not I've got a reproducible case.  What is happening this:
>
> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
> 2. call ReadBufferBI.  this goes down the chain until StrategyGetBuffer()
> 3. Lock free list, go into clock sweep loop
> 4. while holding clock sweep, hit 'hot' buffer, spin on it
> 5. get de-scheduled

It seems more likely to me that it got descheduled immediately after
obtaining the hot buffer header spinlock but before releasing it,
rather than while still spinning for it.


> 6. now enter the 'hot buffer spin lock lottery'
> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
> chances of winning #6

Lots and lots of processes (which want the buffer to use, not to
evict) are spinning for the lock held by the descheduled process on
the buffer header, and the scheduler goes bezerk.  A bunch of other
processes are waiting on the relation extension lock, but are doing so
non-bezerk-ly on the semaphore.

Each spinning process puts itself to sleep without having consumed its
slice once it hits spins_per_delay, and so the scheduler keeps
rescheduling them; rather than scheduling the one that is holding the
lock, which consumed its entire slice and so is on the low priority to
reschedule list.

That is my theory, of course.  I'm not sure if it leads to a different
course of action than the theory it has not yet obtained the header
lock.

Any idea what spins_per_delay has converged to?  There seems to be no
way to obtain this info, other than firing up the debugger.  I had a
patch somewhere that has every process elog(LOG,...) the value which
it fetches from shared memory immediately upon start-up.  It is a pain
to hunt down the patch, apply, and compiler and restart the server
every time I want this value.  Maybe something like this could be put
in core with a suitable condition, but I don't know what that
condition would be.  Or would creating a C function with a
pg_spins_per_delay() wrapper function which returns this value on
demand be the way to go?



> Is the analysis and the patch to fix the perceived problem plausible
> without breaking other stuff..  If so, I'm inclined to go further with
> this.   This is not the only solution on the table for high buffer
> contention, but IMNSHO it should get a lot of points for being very
> localized.  Maybe a reduced version could be tried retaining the
> freelist lock but keeping the 'trylock' on the buf header.

My concern is how we can ever move this forward.  If we can't recreate
it on a test system, and you probably won't be allowed to push
experimental patches to the production system....what's left?

Also, if the kernel is introducing new scheduling bottlenecks, are we
just playing whack-a-mole by trying to deal with it in PG code?

Stepping back a bit, have you considered a connection pooler to
restrict the number of simultaneously active connections?  It wouldn't
be the first time that that has alleviated the symptoms of these
high-RAM kernel bottlenecks.  (If we are going to play whack-a-mole,
might as well use a hammer that already exists and is well tested)


Cheers,

Jeff



Re: StrategyGetBuffer optimization, take 2

From
Jeff Janes
Date:
On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

> I agree; at least then it's not unambiguously better. if you (in
> effect) swap all contention on allocation from a lwlock to a spinlock
> it's not clear if you're improving things; it would have to be proven
> and I'm trying to keep things simple.
>
> Attached is a scaled down version of the patch that keeps the freelist
> lock but still removes the spinlock during the clock sweep.  This
> still hits the major objectives of reducing the chance of scheduling
> out while holding the BufFreelistLock and mitigating the worst case
> impact of doing so if it does happen.  An even more scaled down
> version would keep the current logic exactly as is except for
> replacing buffer lock in the clock sweep with a trylock (which is
> IMNSHO a no-brainer).

Since usage_count is unsigned, are you sure that changing the tests
from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
what you need it to?  If usage_count gets decremented when it already
zero, it will wrap around to 65,535, at least on some compilers some
of the time, won't it?

It seem safer just not to decrement if we can't get the lock.

Cheers,

Jeff



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Mon, Aug 19, 2013 at 5:02 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>
> My concern is how we can ever move this forward.  If we can't recreate
> it on a test system, and you probably won't be allowed to push
> experimental patches to the production system....what's left?
>
> Also, if the kernel is introducing new scheduling bottlenecks, are we
> just playing whack-a-mole by trying to deal with it in PG code?

I think I can rest on the basis that there is no good reason to sit
and spin on a buffer during clock sweep -- even if I have to whittle
the patch down to the '1 liner' variant that trylocks the buffer
header.  I also like the idea of getting rid of the freelist lock
completely but I think those ideas can be tested in isolation.  Really
though, the acid-test is going to be exhaustive performance testing.

> Stepping back a bit, have you considered a connection pooler to
> restrict the number of simultaneously active connections?  It wouldn't
> be the first time that that has alleviated the symptoms of these
> high-RAM kernel bottlenecks.  (If we are going to play whack-a-mole,
> might as well use a hammer that already exists and is well tested)

Certainly. It was the very first thing I suggested upon hearing about
this problem.  Unfortunately, when these kinds of things happen in
high scale, high load databases on serious hardware improvising
pgbouncer is simply not possible without first going through extensive
performance and application compatibility testing.

merlin



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Mon, Aug 19, 2013 at 5:17 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>
>> I agree; at least then it's not unambiguously better. if you (in
>> effect) swap all contention on allocation from a lwlock to a spinlock
>> it's not clear if you're improving things; it would have to be proven
>> and I'm trying to keep things simple.
>>
>> Attached is a scaled down version of the patch that keeps the freelist
>> lock but still removes the spinlock during the clock sweep.  This
>> still hits the major objectives of reducing the chance of scheduling
>> out while holding the BufFreelistLock and mitigating the worst case
>> impact of doing so if it does happen.  An even more scaled down
>> version would keep the current logic exactly as is except for
>> replacing buffer lock in the clock sweep with a trylock (which is
>> IMNSHO a no-brainer).
>
> Since usage_count is unsigned, are you sure that changing the tests
> from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
> what you need it to?  If usage_count gets decremented when it already
> zero, it will wrap around to 65,535, at least on some compilers some
> of the time, won't it?
>
> It seem safer just not to decrement if we can't get the lock.

Hurk -- well, maybe it should be changed to signed in this
implementation (adjustment w/o lock).

Safer maybe, but you lose a part of the optimization: not having to
spam cache line locks as you constantly spinlock your way around the
clock.  Maybe that doesn't matter much -- I'm inclined to test it both
ways and see -- plus maybe a 3rd variant that manages the freelist
itself with a spinlock as well.

variant A: buffer spinlock -> trylock (1 liner!)
variant B: as above, but usage_count manipulation occurs outside of lock
variant C: as above, but dispense with lwlock wholly or in part (plus
possibly stuff the freelist etc).

merlin



Re: StrategyGetBuffer optimization, take 2

From
Andres Freund
Date:
On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:
> On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> 
> > I agree; at least then it's not unambiguously better. if you (in
> > effect) swap all contention on allocation from a lwlock to a spinlock
> > it's not clear if you're improving things; it would have to be proven
> > and I'm trying to keep things simple.
> >
> > Attached is a scaled down version of the patch that keeps the freelist
> > lock but still removes the spinlock during the clock sweep.  This
> > still hits the major objectives of reducing the chance of scheduling
> > out while holding the BufFreelistLock and mitigating the worst case
> > impact of doing so if it does happen.  An even more scaled down
> > version would keep the current logic exactly as is except for
> > replacing buffer lock in the clock sweep with a trylock (which is
> > IMNSHO a no-brainer).
> 
> Since usage_count is unsigned, are you sure that changing the tests
> from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
> what you need it to?  If usage_count gets decremented when it already
> zero, it will wrap around to 65,535, at least on some compilers some
> of the time, won't it?

Overflow of *unsigned* variables is actually defined and will always
wrap around. It's signed variables which don't have such a clear
behaviour.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: StrategyGetBuffer optimization, take 2

From
Merlin Moncure
Date:
On Tue, Aug 20, 2013 at 1:57 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-08-19 15:17:44 -0700, Jeff Janes wrote:
>> On Wed, Aug 7, 2013 at 7:40 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>>
>> > I agree; at least then it's not unambiguously better. if you (in
>> > effect) swap all contention on allocation from a lwlock to a spinlock
>> > it's not clear if you're improving things; it would have to be proven
>> > and I'm trying to keep things simple.
>> >
>> > Attached is a scaled down version of the patch that keeps the freelist
>> > lock but still removes the spinlock during the clock sweep.  This
>> > still hits the major objectives of reducing the chance of scheduling
>> > out while holding the BufFreelistLock and mitigating the worst case
>> > impact of doing so if it does happen.  An even more scaled down
>> > version would keep the current logic exactly as is except for
>> > replacing buffer lock in the clock sweep with a trylock (which is
>> > IMNSHO a no-brainer).
>>
>> Since usage_count is unsigned, are you sure that changing the tests
>> from "buf->usage_count == 0" to "buf->usage_count <= 0" accomplishes
>> what you need it to?  If usage_count gets decremented when it already
>> zero, it will wrap around to 65,535, at least on some compilers some
>> of the time, won't it?
>
> Overflow of *unsigned* variables is actually defined and will always
> wrap around. It's signed variables which don't have such a clear
> behaviour.


Hm, well, even better would be to leave things as they are and try to
guarantee that usage_count is updated via assignment vs increment;
that way it would be impossible to wander out of bounds.  I bet
changing:
i--; to i=(i-1);

isn't going to do much against modern compilers.  But what about
assignment from a volatile temporary?

volatile v = usage_count;
if (v > 0) v--;
usage_count = v;

something like that.  Or maybe declaring usage_count as volatile might
be enough?

merlin