Re: Two Necessary Kernel Tweaks for Linux Systems - Mailing list pgsql-performance

From Shaun Thomas
Subject Re: Two Necessary Kernel Tweaks for Linux Systems
Date
Msg-id 50EC743E.1040209@optionshouse.com
Whole thread Raw
In response to Re: Two Necessary Kernel Tweaks for Linux Systems  (Scott Marlowe <scott.marlowe@gmail.com>)
Responses Re: Two Necessary Kernel Tweaks for Linux Systems
Re: Two Necessary Kernel Tweaks for Linux Systems
List pgsql-performance
On 01/08/2013 01:04 PM, Scott Marlowe wrote:

> Assembly language on the brain.  of course I meant NOOP.

Ok, in that case, these are completely separate things. For IO
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, and
so on.

For process scheduling, at least recently, there's Completely Fair
Scheduler or nothing. So far as I can tell, there is no alternative
process scheduler. Just as I can't find an alternative memory manager
that I can tell to stop flushing my freaking active file cache due to
phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on heavily
forked apps like PG. System % is about 5% lower since the scheduler is
doing less work, but at the cost of less spreading across available
CPUs. Our systems see a 30% performance hit with grouping enabled,
others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for
some reason, causing the scheduler to spend more and more time planning
CPU assignments until it spirals completely out of control. I've seen
this behavior on 3.0 kernels straight to 3.5, so it looks like an
inherent weakness of CFS. By increasing migration cost, we make the
scheduler do less work less often, so that weird 70+% system CPU spike
vanishes.

My guess is the increased migration cost basically offsets the point at
which the scheduler would freak out. I've tested up to 2000 connections,
and it responds fine, whereas before we were seeing flaky results as
early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by the
scheduler. To swap processes, it also has to preload L2 and L3 cache for
the assigned process. As the number of PG connections increase, all with
their own VSZ/RSS allocations, the scheduler has more thinking to do. At
a point when the sum of VSZ/RSS eclipses the amount of available RAM,
the scheduler loses nearly all decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. At
high connection counts, even though %used is fine, and we have over 40GB
free for caching. VSZ/RSS are both way bigger than available cache, so
memory pressure causes kswapd to continuously purge the active cache
pool into inactive, and inactive into free, all while the device
attempts to fill the active pool. It's an IO feedback loop, and around
the same number of connections that used to make the process scheduler
die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I
know neither jack, nor squat about internal kernel mechanics. Anyone who
actually *isn't* talking out of his ass is free to interject. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


pgsql-performance by date:

Previous
From: Scott Marlowe
Date:
Subject: Re: Two Necessary Kernel Tweaks for Linux Systems
Next
From: AJ Weber
Date:
Subject: Re: Two Necessary Kernel Tweaks for Linux Systems