Re: Two Necessary Kernel Tweaks for Linux Systems - Mailing list pgsql-performance

From Henri Philipps
Subject Re: Two Necessary Kernel Tweaks for Linux Systems
Date
Msg-id CABvEAQs7cRUH9PLHyTo-F+0t8=iKgX8N3ZSfCuzeSWXBJ9hM3w@mail.gmail.com
Whole thread Raw
In response to Re: Two Necessary Kernel Tweaks for Linux Systems  (Shaun Thomas <sthomas@optionshouse.com>)
Responses Re: Two Necessary Kernel Tweaks for Linux Systems  (Shaun Thomas <sthomas@optionshouse.com>)
autovacuum fringe case?  (AJ Weber <aweber@comcast.net>)
List pgsql-performance
Hi,

we also hit this performance barrier a while ago, when migrating a
database on a big server (48 core Opteron, 512GB RAM) from Kernel
2.6.32 to 3.2 (both kernels from Debian packages). The system load was
getting very high, as you also observed (don't know the exact numbers
right now).

After some investigation I found out, that the reason for the high
system load was that the postgresql processes were migrating from core
to core at very high rates. So the behaviour of the CFS scheduler must
have changed in this regard between 2.6.32 and 3.2 kernels.

You can easily see this, if you have a look how much time the
migration kernel threads spend in the CPU (ps ax | grep migration). A
look into /proc/sched_debug also can give you some more insight into
the scheduler behaviour.

 On NUMA systems the scheduler tries to migrate processes to the nodes
on which they have the best memory-locality. But on a big database one
process is typically reading randomly from a dataset which is spread
above all nodes. On newer kernels the CFS scheduler seems to try more
aggressively to migrate processes to other cores. I don't know if it
is for better load balancing or for better memory locality. But
process migrations are consuming a lot of resources.

I had to change sched_migration_costs from 500000 (0.5ms) to 100000000
(100ms). This means, the scheduler is only considering a task for
migration if the task was running at least for 100ms instead of 0.5ms.
This solved the problem for us - the migration kernel threads didn't
have to do much work anymore and thus the system load was going down
again.

A general problem is, that the CFS scheduler has a lot of changes
between all kernel versions, so it is really hard to predict which
regressions you can hit when going to another kernel version.
Scheduling on NUMA systems is also very complex.

An interesting dissertations showing the inconsistent behaviour of the
CFS scheduler:
http://research.cs.wisc.edu/adsl/Publications/meehean-thesis11.pdf

Some parameters, which also could be considered for systematic benchmarking are

sched_latency_ns
sched_min_granularity_ns

I guess that higher numbers could improve performance too on systems
with many cores and many connections.

Thanks for starting this interesting thread!

Henri


pgsql-performance by date:

Previous
From: Merlin Moncure
Date:
Subject: Re: Simple join doesn't use index
Next
From: Andrzej Zawadzki
Date:
Subject: Slow query after upgrade from 9.0 to 9.2