On Fri, Sep 13, 2013 at 4:04 PM, Kevin Grittner <kgrittn@ymail.com> wrote:
> Andres Freund <andres@2ndquadrant.com> wrote:
>
>> Absolutely not claiming the contrary. I think it sucks that we
>> couldn't fully figure out what's happening in detail. I'd love to
>> get my hand on a setup where it can be reliably reproduced.
>
> I have seen two completely different causes for symptoms like this,
> and I suspect that these aren't the only two.
>
> (1) The dirty page avalanche: PostgreSQL hangs on to a large
> number of dirty buffers and then dumps a lot of them at once. The
> OS does the same. When PostgreSQL dumps its buffers to the OS it
> pushes the OS over a "tipping point" where it is writing dirty
> buffers too fast for the controller's BBU cache to absorb them.
> Everything freezes until the controller writes and accepts OS
> writes for a lot of data. This can take several minutes, during
> which time the database seems "frozen". Cure is some combination
> of these: reduce shared_buffers, make the background writer more
> aggressive, checkpoint more often, make the OS dirty page writing
> more aggressive, add more BBU RAM to the controller.
Yeah -- I've seen this too, and it's a well understood problem.
Getting o/s to spin dirty pages out faster is the name of the game I
think. Storage is getting so fast that it's (mostly) moot anyways.
Also, this is under the umbrella of 'high i/o' -- the stuff I've been
seeing is low- or no- I/o.
> (2) Transparent huge page support goes haywire on its defrag work.
> Clues on this include very high "system" CPU time during an
> episode, and `perf top` shows more time in kernel spinlock
> functions than anywhere else. The database doesn't completely lock
> up like with the dirty page avalanche, but it is slow enough that
> users often describe it that way. So far I have only seen this
> cured by disabling THP support (in spite of some people urging that
> just the defrag be disabled). It does make me wonder whether there
> is something we could do in PostgreSQL to interact better with
> THPs.
Ah, that's a useful tip; need to research that, thanks. Maybe Josh
might be able to give it a whirl...
merlin