Re: High SYS CPU - need advise - Mailing list pgsql-general

From Jeff Janes
Subject Re: High SYS CPU - need advise
Date
Msg-id CAMkU=1zvjECKSa18rHUQ1RJ98zRwk_xeAtTuwO8Ak0YaeB1B=Q@mail.gmail.com
Whole thread Raw
In response to Re: High SYS CPU - need advise  (Merlin Moncure <mmoncure@gmail.com>)
Responses Re: High SYS CPU - need advise  (Merlin Moncure <mmoncure@gmail.com>)
List pgsql-general
On Tue, Nov 20, 2012 at 12:00 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Tue, Nov 20, 2012 at 12:16 PM, Jeff Janes <jeff.janes@gmail.com> wrote:
>>
>> The freelist should never loop.  It is written as a loop, but I think
>> there is currently no code path which ends up with valid buffers being
>> on the freelist, so that loop will never, or at least rarely, execute
>> more than once.
>>
>>> Both of those operations are
>>> dependent on the number of buffers being managed and so it's
>>> reasonable to expect some workloads to increase contention with more
>>> buffers.
>>
>> The clock sweep can depend on the number of buffers begin managed in a
>> worst-case sense, but I've never seen any evidence (nor analysis) that
>> this worst case can be obtained in reality on an ongoing basis.  By
>> constructing two pathological workloads which get switched between, I
>> can get the worst-case to happen, but when it does happen the
>> consequences are mild compared to the amount of time needed to set up
>> the necessary transition.  In other words, the worse-case can't be
>> triggered often enough to make a meaningful difference.
>
> Yeah, good points;  but (getting off topic here) : there have been
> several documented cases of lowering shared buffers improving
> performance under contention...the  'worst case' might be happening
> more than expected.

The ones that I am aware of (mostly Greg Smith's case studies) this
has been for write-intensive work loads and are related to
writes/fsyncs getting gummed up.

Shaun Thomas reports one that is (I assume) not read intensive, but
his diagnosis is that this is a kernel bug where a larger
shared_buffers for no good reason causes the kernel to kill off its
page cache.  From the kernel's perspective, the freelist lock doesn't
look any different from any other lwlock, so I doubt that is issue is
related to freelist lock.

> In particular, what happens when a substantial
> percentage of the buffer pool is set with a non-zero usage count?

The current clock sweep algorithm is an extraordinary usagecount
decrementing machine.  From what I know, the only way to get much more
than half of the buffers to be non-zero usage count is for the
clock-sweep to rarely run (in which case, it is hard to be the
bottleneck if it rarely runs), or for most of the buffer-cache to be
pinned simultaneously.

> This seems unlikely, but possible?  Take note:
>
>                 if (buf->refcount == 0)
>                 {
>                         if (buf->usage_count > 0)
>                         {
>                                 buf->usage_count--;
>                                 trycounter = NBuffers;  /* emphasis *./
>                         }
>
> ISTM time spent here isn't bounded except that as more time is spent
> sweeping (more backends are thus waiting and not marking pages) the
> usage counts decrease faster until you hit steady state.

But that is a one time thing.  Once you hit the steady state, how do
you get away from it again, such that a large amount of work is needed
again?

> Smaller
> buffer pool naturally would help in that scenario as your usage counts
> would drop faster.

They would drop at the same rate in absolute numbers, barring the
smaller buffer_cache fitting entirely in the on-board CPU cache.

They would drop faster in percentage terms, but they would also
increase faster in percentage terms once a candidate is found and a
new page read into it.

> It strikes me as cavalier to be resetting
> trycounter while sitting under the #1 known contention point for read
> only workloads.

The only use for the trycounter is to know when to ERROR out with "no
unpinned buffers available", so not resetting that seems entirely
wrong.

I would contest "the #1 known contention point" claim.  We know that
the freelist lock is a point of contention under certain conditions,
but we (or at least I) also know that it is the mere acquisition of
this lock, and not the work done while it is held, that is important.

If I add a spurious
"LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); LWLockRelease(BufFreelistLock);"
to each execution of StrategyGetBuffer, then contention kicks in twice as fast.

But if I instead hack the clock sweep to run twice as far (ignore the
first eligible buffer it finds, and go find another one) but all under
the cover of a single BufFreelistLock acquisition, there was no
meaningful increase in contention.

This was all on a 4 socket x 2 core/socket opteron machine which I no
longer have access to.  Using a more modern 8 core on a single socket,
I can't get it to collapse on BufFreelistLock at all, presumably
because the cache coherence mechanisms are so much faster.

>  Shouldn't SBF() work on advisory basis and try to
> force a buffer after N failed usage count attempts?

I believe Simon tried that a couple commit-fests ago, and no one could
show that it made a difference.

Cheers,

Jeff


pgsql-general by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: High SYS CPU - need advise
Next
From: Merlin Moncure
Date:
Subject: Re: High SYS CPU - need advise