Re: Two Necessary Kernel Tweaks for Linux Systems - Mailing list pgsql-performance

From Midge Brown
Subject Re: Two Necessary Kernel Tweaks for Linux Systems
Date
Msg-id B994348460014EC7BA65F310E9D30EBD@BERNICE
Whole thread Raw
In response to Two Necessary Kernel Tweaks for Linux Systems  (Shaun Thomas <sthomas@optionshouse.com>)
Responses Re: Two Necessary Kernel Tweaks for Linux Systems  (Shaun Thomas <sthomas@optionshouse.com>)
List pgsql-performance
The kernel on our Linux system doesn't appear to have these two settings according to the list provided by sysctl -a. Please pardon my ignorance, but should I add them?
 
We have Postgresql 9.0 on Linux 2.6.18-164.el5 #1 SMP Thu Sep 3 03:28:30 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux
 
Thanks,
Midge
 
----- Original Message -----
Sent: Wednesday, January 02, 2013 1:46 PM
Subject: [PERFORM] Two Necessary Kernel Tweaks for Linux Systems

Hey everyone!

After much testing and hair-pulling, we've confirmed two kernel settings
that should always be modified in production Linux systems. Especially
new ones with the completely fair scheduler (CFS) as opposed to the O(1)
scheduler.

If you want to follow along, these are:

/proc/sys/kernel/sched_migration_cost
/proc/sys/kernel/sched_autogroup_enabled

Which correspond to sysctl settings:

kernel.sched_migration_cost
kernel.sched_autogroup_enabled

What do these settings do?
--------------------------

* sched_migration_cost

The migration cost is the total time the scheduler will consider a
migrated process "cache hot" and thus less likely to be re-migrated. By
default, this is 0.5ms (500000 ns), and as the size of the process table
increases, eventually causes the scheduler to break down. On our
systems, after a smooth degradation with increasing connection count,
system CPU spiked from 20 to 70% sustained and TPS was cut by 5-10x once
we crossed some invisible connection count threshold. For us, that was a
pgbench with 900 or more clients.

The migration cost should be increased, almost universally on server
systems with many processes. This means systems like PostgreSQL or
Apache would benefit from having higher migration costs. We've had good
luck with a setting of 5ms (5000000 ns) instead.

When the breakdown occurs, system CPU (as obtained from sar) increases
from 20% on a heavy pgbench (scale 3500 on a 72GB system) to over 70%,
and %nice/%user is cut by half or more. A higher migration cost
essentially eliminates this artificial throttle.

* sched_autogroup_enabled

This is a relatively new patch which Linus lauded back in late 2010. It
basically groups tasks by TTY so perceived responsiveness is improved.
But on server systems, large daemons like PostgreSQL are going to be
launched from the same pseudo-TTY, and be effectively choked out of CPU
cycles in favor of less important tasks.

The default setting is 1 (enabled) on some platforms. By setting this to
0 (disabled), we saw an outright 30% performance boost on the same
pgbench test. A fully cached scale 3500 database on a 72GB system went
from 67k TPS to 82k TPS with 900 client connections.

Total Benefit
-------------

At higher connections counts, such as systems that can't use pooling or
make extensive use of prepared queries, these can massively affect
performance. At 900 connections, our test systems were at 17k TPS
unaltered, but 85k TPS after these two modifications. Even with this
performance boost, we still had 40% CPU free instead of 0%. In effect,
the logarithmic performance of the new scheduler is returned to normal
under large process tables.

Some systems will have a higher "cracking" point than others. The effect
is amplified when a system is under high memory pressure, hence a lot of
expensive queries on a high number of concurrent connections is the
easiest way to replicate these results.

Admins migrating from older systems (RHEL 5.x) may find this especially
shocking, because the old O(1) scheduler was too "stupid" to have these
advanced features, hence it was impossible to cause this kind of behavior.

There's probably still a little room for improvement here, since 30-40%
CPU is still unclaimed in our larger tests. I'd like to see the total
performance drop (175k ideal TPS at 24-connections) decreased. But these
kernel tweaks are rarely discussed anywhere, it seems. There doesn't
seem to be any consensus on how these (and other) scheduler settings
should be modified under different usage scenarios.

I just figured I'd share, since we found this info so beneficial.

--
Shaun Thomas
OptionsHouse | 141 W. Jackson Blvd. | Suite 500 | Chicago IL, 60604
312-444-8534
sthomas@optionshouse.com

______________________________________________

See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

pgsql-performance by date:

Previous
From: AJ Weber
Date:
Subject: Re: Partition table in 9.0.x?
Next
From: Shaun Thomas
Date:
Subject: Re: Two Necessary Kernel Tweaks for Linux Systems