Thread: Two Necessary Kernel Tweaks for Linux Systems

Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
Hey everyone!

After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially
new ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.

If you want to follow along, these are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which correspond to sysctl settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What do these settings do?
--------------------------

* sched_migration_cost

The migration cost is the total time the scheduler will consider a
migrated process "cache hot" and thus less likely to be re-migrated. By
default, this is 0.5ms (500000 ns), and as the size of the process table
increases, eventually causes the scheduler to break down. On our
systems, after a smooth degradation with increasing connection count,
system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
we crossed some invisible connection count threshold. For us, that was a
pgbench with 900 or more clients.

The migration cost should be increased, almost universally on server
systems with many processes. This means systems like PostgreSQL or
Apache would benefit from having higher migration costs. We've had good
luck with a setting of 5ms (5000000 ns) instead.

When the breakdown occurs, system CPU (as obtained from sar) increases
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
and %nice/%user is cut by half or more. A higher migration cost
essentially eliminates this artificial throttle.

* sched_autogroup_enabled

This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved.
But on server systems, large daemons like PostgreSQL are going to be
launched from the same pseudo-TTY, and be effectively choked out of CPU
cycles in favor of less important tasks.

The default setting is 1 (enabled) on some platforms. By setting this to
0 (disabled), we saw an outright 30% performance boost on the same
pgbench test. A fully cached scale 3500 database on a 72GB system went
from 67k TPS to 82k TPS with 900 client connections.

Total Benefit
-------------

At higher connections counts, such as systems that can't use pooling or
make extensive use of prepared queries, these can massively affect
performance. At 900 connections, our test systems were at 17k TPS
unaltered, but 85k TPS after these two modifications. Even with this
performance boost, we still had 40% CPU free instead of 0%. In effect,
the logarithmic performance of the new scheduler is returned to normal
under large process tables.

Some systems will have a higher "cracking" point than others. The effect
is amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the
easiest way to replicate these results.

Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.

There's probably still a little room for improvement here, since 30-40%
CPU is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't
seem to be any consensus on how these (and other) scheduler settings
should be modified under different usage scenarios.

I just figured I'd share, since we found this info so beneficial.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Richard Neill
Date:
Dear Shaun,

Thanks for that - it's really interesting to know.

On 02/01/13 21:46, Shaun Thomas wrote:
> Hey everyone!
>
> After much testing and hair-pulling, we've confirmed two kernel
> settings that should always be modified in production Linux systems.
> Especially new ones with the completely fair scheduler (CFS) as
> opposed to the O(1) scheduler.

Does it apply to all types of production system, or just to certain
workloads?

For example, what happens when there are only one or two concurrent
processes?  (i.e. there are always several more CPU cores than there are
actual connections).


> * sched_autogroup_enabled
>
> This is a relatively new patch which Linus lauded back in late 2010.
> It basically groups tasks by TTY so perceived responsiveness is
> improved. But on server systems, large daemons like PostgreSQL are
> going to be launched from the same pseudo-TTY, and be effectively
> choked out of CPU cycles in favor of less important tasks.


I've got several production servers using Postgres: I'd like to squeeze
a bit more performance out of them, but in all cases, one (sometimes
two) CPU cores are (sometimes) maxed out, but there are always several
cores permanently idling. So does this apply here?

Thanks for your advice,

Richard


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Merlin Moncure
Date:
On Wed, Jan 2, 2013 at 3:46 PM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> Hey everyone!
>
> After much testing and hair-pulling, we've confirmed two kernel settings
> that should always be modified in production Linux systems. Especially new
> ones with the completely fair scheduler (CFS) as opposed to the O(1)
> scheduler.
>
> If you want to follow along, these are:
>
> /proc/sys/kernel/sched_migration_cost
> /proc/sys/kernel/sched_autogroup_enabled
>
> Which correspond to sysctl settings:
>
> kernel.sched_migration_cost
> kernel.sched_autogroup_enabled
>
> What do these settings do?
> --------------------------
>
> * sched_migration_cost
>
> The migration cost is the total time the scheduler will consider a migrated
> process "cache hot" and thus less likely to be re-migrated. By default, this
> is 0.5ms (500000 ns), and as the size of the process table increases,
> eventually causes the scheduler to break down. On our systems, after a
> smooth degradation with increasing connection count, system CPU spiked from
> 20 to 70% sustained and TPS was cut by 5-10x once we crossed some invisible
> connection count threshold. For us, that was a pgbench with 900 or more
> clients.
>
> The migration cost should be increased, almost universally on server systems
> with many processes. This means systems like PostgreSQL or Apache would
> benefit from having higher migration costs. We've had good luck with a
> setting of 5ms (5000000 ns) instead.
>
> When the breakdown occurs, system CPU (as obtained from sar) increases from
> 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%, and
> %nice/%user is cut by half or more. A higher migration cost essentially
> eliminates this artificial throttle.
>
> * sched_autogroup_enabled
>
> This is a relatively new patch which Linus lauded back in late 2010. It
> basically groups tasks by TTY so perceived responsiveness is improved. But
> on server systems, large daemons like PostgreSQL are going to be launched
> from the same pseudo-TTY, and be effectively choked out of CPU cycles in
> favor of less important tasks.
>
> The default setting is 1 (enabled) on some platforms. By setting this to 0
> (disabled), we saw an outright 30% performance boost on the same pgbench
> test. A fully cached scale 3500 database on a 72GB system went from 67k TPS
> to 82k TPS with 900 client connections.
>
> Total Benefit
> -------------
>
> At higher connections counts, such as systems that can't use pooling or make
> extensive use of prepared queries, these can massively affect performance.
> At 900 connections, our test systems were at 17k TPS unaltered, but 85k TPS
> after these two modifications. Even with this performance boost, we still
> had 40% CPU free instead of 0%. In effect, the logarithmic performance of
> the new scheduler is returned to normal under large process tables.
>
> Some systems will have a higher "cracking" point than others. The effect is
> amplified when a system is under high memory pressure, hence a lot of
> expensive queries on a high number of concurrent connections is the easiest
> way to replicate these results.
>
> Admins migrating from older systems (RHEL 5.x) may find this especially
> shocking, because the old O(1) scheduler was too "stupid" to have these
> advanced features, hence it was impossible to cause this kind of behavior.
>
> There's probably still a little room for improvement here, since 30-40% CPU
> is still unclaimed in our larger tests. I'd like to see the total
> performance drop (175k ideal TPS at 24-connections) decreased. But these
> kernel tweaks are rarely discussed anywhere, it seems. There doesn't seem to
> be any consensus on how these (and other) scheduler settings should be
> modified under different usage scenarios.
>
> I just figured I'd share, since we found this info so beneficial.

This is fantastic info.

Vlad, you might want to check this out and see if it has any impact in
your high cpu case...via:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html

merlin


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Andrea Suisani
Date:
On 01/02/2013 10:46 PM, Shaun Thomas wrote:
> Hey everyone!
>
> After much testing and hair-pulling, we've confirmed two kernel settings that
 > should always be modified in production Linux systems. Especially new ones with
> the completely fair scheduler (CFS) as opposed to the O(1) scheduler.

[cut]

> I just figured I'd share, since we found this info so beneficial.

I just want to confirm that on our relatively small
test server that tweaks give us a 25% performance boost!

Really appreciated Shaun.

thanks
Andrea




Re: Two Necessary Kernel Tweaks for Linux Systems

From
Andrea Suisani
Date:
On 01/08/2013 09:29 AM, Andrea Suisani wrote:
> On 01/02/2013 10:46 PM, Shaun Thomas wrote:
>> Hey everyone!
>>
>> After much testing and hair-pulling, we've confirmed two kernel settings that
>  > should always be modified in production Linux systems. Especially new ones with
>> the completely fair scheduler (CFS) as opposed to the O(1) scheduler.
>
> [cut]
>
>> I just figured I'd share, since we found this info so beneficial.
>
> I just want to confirm that on our relatively small
> test server that tweaks give us a 25% performance boost!

12.5% sorry for the typo...


> Really appreciated Shaun.
>
> thanks
> Andrea
>
>
>
>



Re: Two Necessary Kernel Tweaks for Linux Systems

From
"Midge Brown"
Date:
The kernel on our Linux system doesn't appear to have these two settings according to the list provided by sysctl -a. Please pardon my ignorance, but should I add them?
 
We have Postgresql 9.0 on Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 
Thanks,
Midge
 
----- Original Message -----
Sent: Wednesday, January 02, 2013 1:46 PM
Subject: [PERFORM] Two Necessary Kernel Tweaks for Linux Systems

Hey everyone!

After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially
new ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.

If you want to follow along, these are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which correspond to sysctl settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What do these settings do?
--------------------------

* sched_migration_cost

The migration cost is the total time the scheduler will consider a
migrated process "cache hot" and thus less likely to be re-migrated. By
default, this is 0.5ms (500000 ns), and as the size of the process table
increases, eventually causes the scheduler to break down. On our
systems, after a smooth degradation with increasing connection count,
system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
we crossed some invisible connection count threshold. For us, that was a
pgbench with 900 or more clients.

The migration cost should be increased, almost universally on server
systems with many processes. This means systems like PostgreSQL or
Apache would benefit from having higher migration costs. We've had good
luck with a setting of 5ms (5000000 ns) instead.

When the breakdown occurs, system CPU (as obtained from sar) increases
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
and %nice/%user is cut by half or more. A higher migration cost
essentially eliminates this artificial throttle.

* sched_autogroup_enabled

This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved.
But on server systems, large daemons like PostgreSQL are going to be
launched from the same pseudo-TTY, and be effectively choked out of CPU
cycles in favor of less important tasks.

The default setting is 1 (enabled) on some platforms. By setting this to
0 (disabled), we saw an outright 30% performance boost on the same
pgbench test. A fully cached scale 3500 database on a 72GB system went
from 67k TPS to 82k TPS with 900 client connections.

Total Benefit
-------------

At higher connections counts, such as systems that can't use pooling or
make extensive use of prepared queries, these can massively affect
performance. At 900 connections, our test systems were at 17k TPS
unaltered, but 85k TPS after these two modifications. Even with this
performance boost, we still had 40% CPU free instead of 0%. In effect,
the logarithmic performance of the new scheduler is returned to normal
under large process tables.

Some systems will have a higher "cracking" point than others. The effect
is amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the
easiest way to replicate these results.

Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.

There's probably still a little room for improvement here, since 30-40%
CPU is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't
seem to be any consensus on how these (and other) scheduler settings
should be modified under different usage scenarios.

I just figured I'd share, since we found this info so beneficial.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
On 01/08/2013 12:25 PM, Midge Brown wrote:

> The kernel on our Linux system doesn't appear to have these two
> settings according to the list provided by sysctl -a. Please pardon
> my ignorance, but should I add them?

Sorry if I wasn't more clear. These only apply to Linux systems with the
Completely Fair Scheduler, as opposed to the O(1) scheduler. For all
intents and purposes, this means 3.0 kernels and above.

With a 2.6 kernel, you're fine.

Effectively these changes fix what is basically a performance regression
compared to older kernels.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Scott Marlowe
Date:
On Tue, Jan 8, 2013 at 11:28 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 01/08/2013 12:25 PM, Midge Brown wrote:
>
>> The kernel on our Linux system doesn't appear to have these two
>> settings according to the list provided by sysctl -a. Please pardon
>> my ignorance, but should I add them?
>
>
> Sorry if I wasn't more clear. These only apply to Linux systems with the
> Completely Fair Scheduler, as opposed to the O(1) scheduler. For all intents
> and purposes, this means 3.0 kernels and above.
>
> With a 2.6 kernel, you're fine.
>
> Effectively these changes fix what is basically a performance regression
> compared to older kernels.

What's the comparison of these settings versus say going to the NOP scheduler?


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
On 01/08/2013 12:31 PM, Scott Marlowe wrote:

> What's the comparison of these settings versus say going to the NOP
> scheduler?

Assuming you actually meant NOP and not the NOOP I/O scheduler, I don't
know. These CPU scheduler tweaks are all I could dig up, and googling
for NOP by itself or combined with Linux terms is tremendously unhelpful.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Scott Marlowe
Date:
On Tue, Jan 8, 2013 at 11:36 AM, Shaun Thomas <sthomas@optionshouse.com> wrote:
> On 01/08/2013 12:31 PM, Scott Marlowe wrote:
>
>> What's the comparison of these settings versus say going to the NOP
>> scheduler?
>
>
> Assuming you actually meant NOP and not the NOOP I/O scheduler, I don't
> know. These CPU scheduler tweaks are all I could dig up, and googling for
> NOP by itself or combined with Linux terms is tremendously unhelpful.

Assembly language on the brain.  of course I meant NOOP.


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
On 01/08/2013 01:04 PM, Scott Marlowe wrote:

> Assembly language on the brain.  of course I meant NOOP.

Ok, in that case, these are completely separate things. For IO
scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline, and
so on.

For process scheduling, at least recently, there's Completely Fair
Scheduler or nothing. So far as I can tell, there is no alternative
process scheduler. Just as I can't find an alternative memory manager
that I can tell to stop flushing my freaking active file cache due to
phantom memory pressure. ;)

The tweaks I was discussing in this thread effectively do two things:

1. Stop process grouping by TTY.

On servers, this really is a net performance loss. Especially on heavily
forked apps like PG. System % is about 5% lower since the scheduler is
doing less work, but at the cost of less spreading across available
CPUs. Our systems see a 30% performance hit with grouping enabled,
others may see more or less.

2. Less aggressive process scheduling.

The O(log N) scheduler heuristics collapse at high process counts for
some reason, causing the scheduler to spend more and more time planning
CPU assignments until it spirals completely out of control. I've seen
this behavior on 3.0 kernels straight to 3.5, so it looks like an
inherent weakness of CFS. By increasing migration cost, we make the
scheduler do less work less often, so that weird 70+% system CPU spike
vanishes.

My guess is the increased migration cost basically offsets the point at
which the scheduler would freak out. I've tested up to 2000 connections,
and it responds fine, whereas before we were seeing flaky results as
early as 700 connections.

My guess as to why this is? I think it's due to VSZ as perceived by the
scheduler. To swap processes, it also has to preload L2 and L3 cache for
the assigned process. As the number of PG connections increase, all with
their own VSZ/RSS allocations, the scheduler has more thinking to do. At
a point when the sum of VSZ/RSS eclipses the amount of available RAM,
the scheduler loses nearly all decision-making ability and craps its pants.

This would also explain why I'm seeing something similar with memory. At
high connection counts, even though %used is fine, and we have over 40GB
free for caching. VSZ/RSS are both way bigger than available cache, so
memory pressure causes kswapd to continuously purge the active cache
pool into inactive, and inactive into free, all while the device
attempts to fill the active pool. It's an IO feedback loop, and around
the same number of connections that used to make the process scheduler
die. Too much of a coincidence, in my opinion.

But unlike the process scheduler, there are no good knobs to turn that
will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
3.4 kernels.

But I freely admit I'm just speculating based on observed behavior. I
know neither jack, nor squat about internal kernel mechanics. Anyone who
actually *isn't* talking out of his ass is free to interject. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
AJ Weber
Date:
When I checked these, both of these settings exist on my CentOS 6.x host
(2.6.32-279.5.1.el6.x86_64).

However, the autogroup_enabled was already set to 0.  (The
migration_cost was set to the 0.5ms, default noted in the OP.)  So I
don't know if this is strictly limited to kernel 3.0.

Is there an "easy" way to tell what scheduler my OS is using?

-AJ


On 1/8/2013 2:32 PM, Shaun Thomas wrote:
> On 01/08/2013 01:04 PM, Scott Marlowe wrote:
>
>> Assembly language on the brain.  of course I meant NOOP.
>
> Ok, in that case, these are completely separate things. For IO
> scheduling, there's the Completely Fair Queue (CFQ), NOOP, Deadline,
> and so on.
>
> For process scheduling, at least recently, there's Completely Fair
> Scheduler or nothing. So far as I can tell, there is no alternative
> process scheduler. Just as I can't find an alternative memory manager
> that I can tell to stop flushing my freaking active file cache due to
> phantom memory pressure. ;)
>
> The tweaks I was discussing in this thread effectively do two things:
>
> 1. Stop process grouping by TTY.
>
> On servers, this really is a net performance loss. Especially on
> heavily forked apps like PG. System % is about 5% lower since the
> scheduler is doing less work, but at the cost of less spreading across
> available CPUs. Our systems see a 30% performance hit with grouping
> enabled, others may see more or less.
>
> 2. Less aggressive process scheduling.
>
> The O(log N) scheduler heuristics collapse at high process counts for
> some reason, causing the scheduler to spend more and more time
> planning CPU assignments until it spirals completely out of control.
> I've seen this behavior on 3.0 kernels straight to 3.5, so it looks
> like an inherent weakness of CFS. By increasing migration cost, we
> make the scheduler do less work less often, so that weird 70+% system
> CPU spike vanishes.
>
> My guess is the increased migration cost basically offsets the point
> at which the scheduler would freak out. I've tested up to 2000
> connections, and it responds fine, whereas before we were seeing flaky
> results as early as 700 connections.
>
> My guess as to why this is? I think it's due to VSZ as perceived by
> the scheduler. To swap processes, it also has to preload L2 and L3
> cache for the assigned process. As the number of PG connections
> increase, all with their own VSZ/RSS allocations, the scheduler has
> more thinking to do. At a point when the sum of VSZ/RSS eclipses the
> amount of available RAM, the scheduler loses nearly all
> decision-making ability and craps its pants.
>
> This would also explain why I'm seeing something similar with memory.
> At high connection counts, even though %used is fine, and we have over
> 40GB free for caching. VSZ/RSS are both way bigger than available
> cache, so memory pressure causes kswapd to continuously purge the
> active cache pool into inactive, and inactive into free, all while the
> device attempts to fill the active pool. It's an IO feedback loop, and
> around the same number of connections that used to make the process
> scheduler die. Too much of a coincidence, in my opinion.
>
> But unlike the process scheduler, there are no good knobs to turn that
> will fix the memory manager's behavior. At least, not in 3.0, 3.2, or
> 3.4 kernels.
>
> But I freely admit I'm just speculating based on observed behavior. I
> know neither jack, nor squat about internal kernel mechanics. Anyone
> who actually *isn't* talking out of his ass is free to interject. :)
>


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
On 01/08/2013 02:05 PM, AJ Weber wrote:

> Is there an "easy" way to tell what scheduler my OS is using?

Unfortunately not. I looked again, and it seems that CFS was merged into
2.6.23. Anything before that is probably safe, but the vendor may have
backported it. If you don't see the settings I described, you probably
don't have it.

So I guess Midge had 2.6.18, which predates the merge in 2.6.23.

I honestly don't understand the Linux kernel sometimes. A process
scheduler swap is a *gigantic* functional change, and it's in a dot
release. I vastly prefer PostgreSQL's approach...

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Alan Hodgson
Date:
On Tuesday, January 08, 2013 03:48:38 PM Shaun Thomas wrote:
> On 01/08/2013 02:05 PM, AJ Weber wrote:
> > Is there an "easy" way to tell what scheduler my OS is using?
>
> Unfortunately not. I looked again, and it seems that CFS was merged into
> 2.6.23. Anything before that is probably safe, but the vendor may have
> backported it. If you don't see the settings I described, you probably
> don't have it.
>
> So I guess Midge had 2.6.18, which predates the merge in 2.6.23.
>
> I honestly don't understand the Linux kernel sometimes. A process
> scheduler swap is a *gigantic* functional change, and it's in a dot
> release. I vastly prefer PostgreSQL's approach...

Red Hat also selectively backports major functionality into their enterprise
kernels. If you're running RHEL or a clone like CentOS, the reported kernel
version has little bearing on what may nor may not be in your kernel.

They're very well tested and stable, so there's nothing wrong with them, per
se, but you can't just say oh, you have version xxx, you don't have this
functionality.




Re: Two Necessary Kernel Tweaks for Linux Systems

From
Henri Philipps
Date:
Hi,

we also hit this performance barrier a while ago, when migrating a
database on a big server (48 core Opteron, 512GB RAM) from Kernel
2.6.32 to 3.2 (both kernels from Debian packages). The system load was
getting very high, as you also observed (don't know the exact numbers
right now).

After some investigation I found out, that the reason for the high
system load was that the postgresql processes were migrating from core
to core at very high rates. So the behaviour of the CFS scheduler must
have changed in this regard between 2.6.32 and 3.2 kernels.

You can easily see this, if you have a look how much time the
migration kernel threads spend in the CPU (ps ax | grep migration). A
look into /proc/sched_debug also can give you some more insight into
the scheduler behaviour.

 On NUMA systems the scheduler tries to migrate processes to the nodes
on which they have the best memory-locality. But on a big database one
process is typically reading randomly from a dataset which is spread
above all nodes. On newer kernels the CFS scheduler seems to try more
aggressively to migrate processes to other cores. I don't know if it
is for better load balancing or for better memory locality. But
process migrations are consuming a lot of resources.

I had to change sched_migration_costs from 500000 (0.5ms) to 100000000
(100ms). This means, the scheduler is only considering a task for
migration if the task was running at least for 100ms instead of 0.5ms.
This solved the problem for us - the migration kernel threads didn't
have to do much work anymore and thus the system load was going down
again.

A general problem is, that the CFS scheduler has a lot of changes
between all kernel versions, so it is really hard to predict which
regressions you can hit when going to another kernel version.
Scheduling on NUMA systems is also very complex.

An interesting dissertations showing the inconsistent behaviour of the
CFS scheduler:
http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf

Some parameters, which also could be considered for systematic benchmarking are

sched_latency_ns
sched_min_granularity_ns

I guess that higher numbers could improve performance too on systems
with many cores and many connections.

Thanks for starting this interesting thread!

Henri


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Shaun Thomas
Date:
On 01/10/2013 02:51 AM, Henri Philipps wrote:

> http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf

Wow, that was pretty interesting. It looks like for servers, the O(1)
scheduler is much better even with the assignment bug he identified, and
BFS responds better to varying load than CFS.

It's too bad the paper is so old and only considers the 2.6 kernel. I'd
love to see this type of research applied to the latest.

> sched_latency_ns
> sched_min_granularity_ns
>
> I guess that higher numbers could improve performance too on systems
> with many cores and many connections.

I messed around with these a bit. Settings 10x smaller and 10x larger
didn't do anything appreciable that I noticed. Performance metrics were
within variance of my earlier tests. Only autogrouping and migration
cost had any appreciable effect.

I'm glad we weren't the only ones who ran into this, too. You settled on
a much higher setting than we did, but the end result was the same. I
wonder how prevalent this will become as more servers are switched over
to newer kernels in the next couple of years. Hopefully more people
start complaining so they fix it. :)

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-676-8870
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


Re: Two Necessary Kernel Tweaks for Linux Systems

From
Boszormenyi Zoltan
Date:
2013-01-08 22:48 keltezéssel, Shaun Thomas írta:
> On 01/08/2013 02:05 PM, AJ Weber wrote:
>
>> Is there an "easy" way to tell what scheduler my OS is using?
>
> Unfortunately not. I looked again, and it seems that CFS was merged into 2.6.23.
> Anything before that is probably safe, but the vendor may have backported it. If you
> don't see the settings I described, you probably don't have it.
>
> So I guess Midge had 2.6.18, which predates the merge in 2.6.23.
>
> I honestly don't understand the Linux kernel sometimes. A process scheduler swap is a
> *gigantic* functional change, and it's in a dot release. I vastly prefer PostgreSQL's
> approach...

The kernel version numbering is different.
A point release in 2.6.x is 2.6.x.y.
This has changed in 3.x, a point release is 3.x.y.

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt, Austria
Web: http://www.postgresql-support.de
      http://www.postgresql.at/



autovacuum fringe case?

From
AJ Weber
Date:
I have a server that is IO-bound right now (it's 4 cores, and top
indicates the use rarely hits 25%, but the Wait spikes above 25-40%
regularly).  The server is running postgresql 9.0 and tomcat 6.  As I
have mentioned in a previous thread, I can't alter the hardware to add
disks unfortunately, so I'm going to try and move postgresql off this
application server to its own host, but this is a production
environment, so in the meantime...

Is it possible that some spikes in IO could be attributable to the
autovacuum process?  Is there a way to check this theory?

Would it be advisable (or even permissible to try/test) to disable
autovacuum, and schedule a manual vacuumdb in the middle of the night,
when this server is mostly-idle?

Thanks for any tips.  I'm in a bit of a jam with my limited hardware.

-AJ



Re: autovacuum fringe case?

From
Evgeniy Shishkin
Date:



On 23.01.2013, at 20:53, AJ Weber <aweber@comcast.net> wrote:

> I have a server that is IO-bound right now (it's 4 cores, and top indicates the use rarely hits 25%, but the Wait
spikesabove 25-40% regularly).  The server is running postgresql 9.0 and tomcat 6.  As I have mentioned in a previous
thread,I can't alter the hardware to add disks unfortunately, so I'm going to try and move postgresql off this
applicationserver to its own host, but this is a production environment, so in the meantime... 
>
> Is it possible that some spikes in IO could be attributable to the autovacuum process?  Is there a way to check this
theory?
>

Try iotop

> Would it be advisable (or even permissible to try/test) to disable autovacuum, and schedule a manual vacuumdb in the
middleof the night, when this server is mostly-idle? 
>
> Thanks for any tips.  I'm in a bit of a jam with my limited hardware.
>
> -AJ
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


Re: autovacuum fringe case?

From
Jeff Janes
Date:
On Wed, Jan 23, 2013 at 8:53 AM, AJ Weber <aweber@comcast.net> wrote:
> I have a server that is IO-bound right now (it's 4 cores, and top indicates
> the use rarely hits 25%, but the Wait spikes above 25-40% regularly).

How long do the spikes last?

> The
> server is running postgresql 9.0 and tomcat 6.  As I have mentioned in a
> previous thread, I can't alter the hardware to add disks unfortunately, so
> I'm going to try and move postgresql off this application server to its own
> host, but this is a production environment, so in the meantime...
>
> Is it possible that some spikes in IO could be attributable to the
> autovacuum process?  Is there a way to check this theory?

set log_autovacuum_min_duration to 0 or some positive number, and see
if the vacuums correlate with periods of io stress (from sar or
vmstat, for example--the problem is that sar only takes snapshots
every 10 minutes, which is too coarse if the spikes are short).

> Would it be advisable (or even permissible to try/test) to disable
> autovacuum, and schedule a manual vacuumdb in the middle of the night, when
> this server is mostly-idle?

Scheduling a manual vacuum should be fine (but keep in mind that
vacuum has very different default cost_delay settings than autovacuum
does.  If the server is completely idle that shouldn't matter, but if
it is only mostly idle, you might want to throttle the IO a bit).  But
I certainly would not disable autovacuum without further evidence.  If
a table only needs to be vacuumed once a day and you preemptively do
it at 3a.m., then autovac won't bother to do it itself during the day.
 So there is no point, but much risk, in also turning autovac off.

Cheers,

Jeff


Re: autovacuum fringe case?

From
AJ Weber
Date:

On 1/23/2013 2:13 PM, Jeff Janes wrote:
> On Wed, Jan 23, 2013 at 8:53 AM, AJ Weber<aweber@comcast.net>  wrote:
>> I have a server that is IO-bound right now (it's 4 cores, and top indicates
>> the use rarely hits 25%, but the Wait spikes above 25-40% regularly).
> How long do the spikes last?
 From what I can gather, a few seconds to a few minutes.
>
>> The
>> server is running postgresql 9.0 and tomcat 6.  As I have mentioned in a
>> previous thread, I can't alter the hardware to add disks unfortunately, so
>> I'm going to try and move postgresql off this application server to its own
>> host, but this is a production environment, so in the meantime...
>>
>> Is it possible that some spikes in IO could be attributable to the
>> autovacuum process?  Is there a way to check this theory?
> set log_autovacuum_min_duration to 0 or some positive number, and see
> if the vacuums correlate with periods of io stress (from sar or
> vmstat, for example--the problem is that sar only takes snapshots
> every 10 minutes, which is too coarse if the spikes are short).
I used iotop last time it was going crazy, and there were 5 postgres
procs at the top of the list (and virtually nothing else) all doing a
SELECT.  So I'm also going to restart the DB this weekend with
log-min-duration enabled.  Could also be some misbehaving queries...

Is there a skinny set of instructions on loading pg_stat_statements?  Or
should I just log them and review them from there?

>
>> Would it be advisable (or even permissible to try/test) to disable
>> autovacuum, and schedule a manual vacuumdb in the middle of the night, when
>> this server is mostly-idle?
> Scheduling a manual vacuum should be fine (but keep in mind that
> vacuum has very different default cost_delay settings than autovacuum
> does.  If the server is completely idle that shouldn't matter, but if
> it is only mostly idle, you might want to throttle the IO a bit).  But
> I certainly would not disable autovacuum without further evidence.  If
> a table only needs to be vacuumed once a day and you preemptively do
> it at 3a.m., then autovac won't bother to do it itself during the day.
>   So there is no point, but much risk, in also turning autovac off.
If I set autovacuum_max_workers = 1, will that effectively single-thread
it so I don't have two running at once?  Maybe that'll mitigate disk
contention a little at least?
>
> Cheers,
>
> Jeff


Re: autovacuum fringe case?

From
Alvaro Herrera
Date:
AJ Weber escribió:

> On 1/23/2013 2:13 PM, Jeff Janes wrote:

> >Scheduling a manual vacuum should be fine (but keep in mind that
> >vacuum has very different default cost_delay settings than autovacuum
> >does.  If the server is completely idle that shouldn't matter, but if
> >it is only mostly idle, you might want to throttle the IO a bit).  But
> >I certainly would not disable autovacuum without further evidence.  If
> >a table only needs to be vacuumed once a day and you preemptively do
> >it at 3a.m., then autovac won't bother to do it itself during the day.
> >  So there is no point, but much risk, in also turning autovac off.
> If I set autovacuum_max_workers = 1, will that effectively
> single-thread it so I don't have two running at once?  Maybe that'll
> mitigate disk contention a little at least?

If you have a single one, it will go three times as fast.  If you want
to make the whole thing go slower (i.e. cause less impact on your I/O
system when running), crank up autovacuum_vacuum_cost_delay.

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: autovacuum fringe case?

From
Jeff Janes
Date:
On Wednesday, January 23, 2013, AJ Weber wrote:


Is there a skinny set of instructions on loading pg_stat_statements?  Or should I just log them and review them from there?

Make sure you have installed contrib.  (How you do that depends on how you installed PostgreSQL in the first place. If you installed from source, then just follow "sudo make install" with "cd contrib; sudo make install")

 
Then, just change postgresql.conf so that

shared_preload_libraries = 'pg_stat_statements'

And restart the server.

Then in psql run

create extension pg_stat_statements ;

Cheers,

Jeff