Thread: Strange behavior: pgbench and new Linux kernels

Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
This week I've finished building and installing OSes on some new hardware
at home.  I have a pretty standard validation routine I go through to make
sure PostgreSQL performance is good on any new system I work with.  Found
a really strange behavior this time around that seems related to changes
in Linux.  Don't expect any help here, but if someone wanted to replicate
my tests I'd be curious to see if that can be done.  I tell the story
mostly because I think it's an interesting tale in hardware and software
validation paranoia, but there's a serious warning here as well for Linux
PostgreSQL users.

The motherboard is fairly new, and I couldn't get CentOS 5.1, which ships
with kernel 2.6.18, to install with the default settings.  I had to drop
back to "legacy IDE" mode to install.  But it was running everything in
old-school IDE mode, no DMA or antyhing.  "hdparm -Tt" showed a whopping
3MB/s on reads.

I pulled down the latest (at the time--only a few hours and I'm already
behind) Linux kernel, 2.6.24-4, and compiled that with the right modules
included.  Now I'm getting 70MB/s on simple reads.  Everything looked fine
from there until I got to the pgbench select-only tests running PG 8.2.7
(I do 8.2 then 8.3 separately because the checkpoint behavior on
write-heavy stuff is so different and I want to see both results).

Here's the regular thing I do to see how fast pgbench executes against
things in memory (but bigger than the CPU's cache):

-Set shared_buffers=256MB, start the server
-dropdb pgbench (if it's already there)
-createdb pgbench
-pgbench -i -s 10 pgbench    (makes about a 160MB database)
-pgbench -S -c <2*cores> -t 10000 pgbench

Since the database was just written out, the whole thing will still be in
the shared_buffers cache, so this should execute really fast.  This was an
Intel quad-core system, I used -c 8, and that got me around 25K
transactions/second.  Curious to see how high I could push this, I started
stepping up the number of clients.

There's where the weird thing happened.  Just by going to 12 clients
instead of 8, I dropped to 8.5K TPS, about 1/3 of what I get from 8
clients.  It was like that on every test run.  When I use 10 clients, it's
about 50/50; sometimes I get 25K, sometimes 8.5K.  The only thing it
seemed to correlate with is that vmstat on the 25K runs showed ~60K
context switches/second, while the 8.5K ones had ~44K.

Since I've never seen this before, I went back to my old benchmark system
with a dual-core AMD processor.  That started with CentOS 4 and kernel
2.6.9, but I happened to install kernel 2.6.24-3 on there to get better
support for my Areca card (it goes bonkers regularly on x64 2.6.9).
Never did a thorough perforance test of the new kernel though.  Sure
enough, the same behavior was there, except without a flip-flop point,
just a sharp decline.  Check this out:

-bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding
tps = 15787.684067 (excluding connections establishing)
tps = 15551.963484 (excluding connections establishing)
tps = 14904.218043 (excluding connections establishing)
tps = 15330.519289 (excluding connections establishing)
tps = 15606.683484 (excluding connections establishing)

-bash-3.00$ pgbench -S -c 12 -t 10000 pgbench | grep excluding
tps = 7593.572749 (excluding connections establishing)
tps = 7870.053868 (excluding connections establishing)
tps = 7714.047956 (excluding connections establishing)

Results are consistant, right?  Summarizing that and extending out, here's
what the median TPS numbers look like with 3 tests at each client load:

-c4:  16621    (increased -t to 20000 here)
-c8:  15551    (all these with t=10000)
-c9:  13269
-c10:  10832
-c11:  8993
-c12:  7714
-c16:  7311
-c32:  7141    (cut -t to 5000 here)

Now, somewhere around here I start thinking about CPU cache coherency, I
play with forcing tasks to particular CPUs, I try the deadline scheduler
instead of the default CFQ, but nothing makes a difference.

Wanna guess what did?  An earlier kernel.  These results are the same test
as above, same hardware, only difference is I used the standard CentOS 4
2.6.9-67.0.4 kernel instead of 2.6.24-3.

-c4:  18388
-c8:  15760
-c9:  15814    (one result of 12623)
-c12: 14339     (one result of 11105)
-c16:  14148
-c32:  13647    (one result of 10062)

We get the usual bit of pgbench flakiness, but using the earlier kernel is
faster in every case, only degrades slowly as clients increase, and is
almost twice as fast here in a typical high-client load case.

So in the case of this simple benchmark, I see an enormous performance
regression from the newest Linux kernel compared to a much older one.  I
need to do some version bisection to nail it down for sure, but my guess
is it's the change to the Completely Fair Scheduler in 2.6.23 that's to
blame.  The recent FreeBSD 7.0 PostgreSQL benchmarks at
http://people.freebsd.org/~kris/scaling/7.0%20and%20beyond.pdf showed an
equally brutal performance drop going from 2.6.22 to 2.6.23 (see page 16)
in around the same client load on a read-only test.  My initial guess is
that I'm getting nailed by a similar issue here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Matthew
Date:
On Thu, 17 Apr 2008, Greg Smith wrote:
> So in the case of this simple benchmark, I see an enormous performance
> regression from the newest Linux kernel compared to a much older one.  I need
> to do some version bisection to nail it down for sure, but my guess is it's
> the change to the Completely Fair Scheduler in 2.6.23 that's to blame.

That's a bit sad. From Documentation/sched-design-CFS.txt (2.6.23):

>                                                  There is only one
>   central tunable (you have to switch on CONFIG_SCHED_DEBUG):
>
>         /proc/sys/kernel/sched_granularity_ns
>
>   which can be used to tune the scheduler from 'desktop' (low
>   latencies) to 'server' (good batching) workloads. It defaults to a
>   setting suitable for desktop workloads. SCHED_BATCH is handled by the
>   CFS scheduler module too.

So it'd be worth compiling a kernel with CONFIG_SCHED_DEBUG switched on
and try increasing that value, and see if that fixes the problem.
Alternatively, use sched_setscheduler to set SCHED_BATCH, which should
increase the timeslice (a Linux-only option).

Matthew

--
Psychotics are consistently inconsistent. The essence of sanity is to
be inconsistently inconsistent.

Re: Strange behavior: pgbench and new Linux kernels

From
"Jeffrey Baker"
Date:
On Thu, Apr 17, 2008 at 12:58 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>  So in the case of this simple benchmark, I see an enormous performance
> regression from the newest Linux kernel compared to a much older one.

This has been discussed recently on linux-kernel.  It's definitely a
regression.  Instead of getting a nice, flat overload behavior when
the # of busy threads exceeds the number of CPUs, you get the
declining performance you noted.

Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22)
http://marc.info/?l=linux-kernel&m=120521826111587&w=2

-jwb

Re: Strange behavior: pgbench and new Linux kernels

From
Matthew
Date:
On Thu, 17 Apr 2008, Jeffrey Baker wrote:
> On Thu, Apr 17, 2008 at 12:58 AM, Greg Smith <gsmith@gregsmith.com> wrote:
>>  So in the case of this simple benchmark, I see an enormous performance
>> regression from the newest Linux kernel compared to a much older one.
>
> This has been discussed recently on linux-kernel.  It's definitely a
> regression.  Instead of getting a nice, flat overload behavior when
> the # of busy threads exceeds the number of CPUs, you get the
> declining performance you noted.
>
> Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22)
> http://marc.info/?l=linux-kernel&m=120521826111587&w=2

The last message in the thread says that 2.6.25-rc6 has the problem
nailed. That was a month ago. So I guess, upgrade to 2.6.25, which was
released today.

Matthew

--
"Prove to thyself that all circuits that radiateth and upon which thou worketh
 are grounded, lest they lift thee to high-frequency potential and cause thee
 to radiate also. "             -- The Ten Commandments of Electronics

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Thu, 17 Apr 2008, Jeffrey Baker wrote:

> This has been discussed recently on linux-kernel.

Excellent pointer, here's direct to the interesting link there:
http://marc.info/?l=linux-kernel&m=120574906013029&w=2

Ingo's test system has 16 cores and dives hard at >32 clients; my 4-core
system has trouble with >8 clients; looks like the same behavior.  And it
seems to be fixed in 2.6.25, which just "shipped" literally in the middle
of my testing last night.  Had I waited until today to grab a kernel I
probably would have missed the whole thing.

I'll have to re-run to be sure (I just love running a kernel with the
paint still wet) but it looks like the conclusion here is "don't run
PostgreSQL on kernels 2.6.23 or 2.6.24".  Good thing I already hated FC8.

If all these kernel developers are using sysbench, we really should get
that thing cleaned up so it runs well with PostgreSQL.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Thu, 17 Apr 2008, Matthew wrote:

> The last message in the thread says that 2.6.25-rc6 has the problem nailed.
> That was a month ago. So I guess, upgrade to 2.6.25, which was released
> today.

Ah, even more support for me to distrust everything I read.  The change
has flattened out things, so now the pgbench results are awful everywhere.
On this benchmark 2.6.25 is the worst kernel yet:

-bash-3.00$ pgbench -S -c 4 -t 10000 pgbench | grep excluding
tps = 8619.710649 (excluding connections establishing)
tps = 8664.321235 (excluding connections establishing)
tps = 8671.973915 (excluding connections establishing)
(was 18388 in 2.6.9 and 16621 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding
tps = 9011.728765 (excluding connections establishing)
tps = 9039.441796 (excluding connections establishing)
tps = 9206.574000 (excluding connections establishing)
(was 15760 in 2.6.9 and 15551 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 16 -t 10000 pgbench | grep excluding
tps = 7063.710786 (excluding connections establishing)
tps = 6956.266777 (excluding connections establishing)
tps = 7120.971600 (excluding connections establishing)
(was 14148 in 2.6.9 and 7311 in 2.6.23-3)

-bash-3.00$ pgbench -S -c 32 -t 10000 pgbench | grep excluding
tps = 7006.311636 (excluding connections establishing)
tps = 6971.305909 (excluding connections establishing)
tps = 7002.820583 (excluding connections establishing)
(was 13647 in 2.6.9 and 7141 in 2.6.23-3)

This is what happens when the kernel developers are using results from a
MySQL tool to optimize things I guess.  It seems I have a lot of work
ahead of me here to nail down and report what's going on here.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
david@lang.hm
Date:
On Thu, 17 Apr 2008, Greg Smith wrote:

> On Thu, 17 Apr 2008, Matthew wrote:
>
>> The last message in the thread says that 2.6.25-rc6 has the problem nailed.
>> That was a month ago. So I guess, upgrade to 2.6.25, which was released
>> today.
>
> Ah, even more support for me to distrust everything I read.  The change has
> flattened out things, so now the pgbench results are awful everywhere. On
> this benchmark 2.6.25 is the worst kernel yet:
>
> -bash-3.00$ pgbench -S -c 4 -t 10000 pgbench | grep excluding
> tps = 8619.710649 (excluding connections establishing)
> tps = 8664.321235 (excluding connections establishing)
> tps = 8671.973915 (excluding connections establishing)
> (was 18388 in 2.6.9 and 16621 in 2.6.23-3)
>
> -bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding
> tps = 9011.728765 (excluding connections establishing)
> tps = 9039.441796 (excluding connections establishing)
> tps = 9206.574000 (excluding connections establishing)
> (was 15760 in 2.6.9 and 15551 in 2.6.23-3)
>
> -bash-3.00$ pgbench -S -c 16 -t 10000 pgbench | grep excluding
> tps = 7063.710786 (excluding connections establishing)
> tps = 6956.266777 (excluding connections establishing)
> tps = 7120.971600 (excluding connections establishing)
> (was 14148 in 2.6.9 and 7311 in 2.6.23-3)
>
> -bash-3.00$ pgbench -S -c 32 -t 10000 pgbench | grep excluding
> tps = 7006.311636 (excluding connections establishing)
> tps = 6971.305909 (excluding connections establishing)
> tps = 7002.820583 (excluding connections establishing)
> (was 13647 in 2.6.9 and 7141 in 2.6.23-3)
>
> This is what happens when the kernel developers are using results from a
> MySQL tool to optimize things I guess.  It seems I have a lot of work ahead
> of me here to nail down and report what's going on here.

report this to the kernel list so that they know, and be ready to test
fixes. the kernel developers base sucess or failure on the results of
tests. if the only people providing test results are MySQL people, how
would they know there is a problem?

David Lang

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Thu, 17 Apr 2008, david@lang.hm wrote:

> report this to the kernel list so that they know, and be ready to test fixes.

Don't worry, I'm on that.  I'm already having enough problems with
database performance under Linux, if they start killing results on the
easy benchmarks I'll really be in trouble.

> if the only people providing test results are MySQL people, how would
> they know there is a problem?

The thing I was alluding to is that both FreeBSD and Linux kernel
developers are now doing all their PostgreSQL tests with sysbench, a MySQL
tool with rudimentary PostgreSQL support bolted on (badly).  I think
rather than complain about it someone (and I fear this will be me) needs
to just fix that so it works well.  It really handy for people to have
something they can get familiar with that runs against both databases in a
way they can be compared fairly.  Right now PG beats MySQL scalability
despite that on read tests, the big problems with PG+sysbench are when you
try to write with it.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> This is what happens when the kernel developers are using results from a
> MySQL tool to optimize things I guess.  It seems I have a lot of work
> ahead of me here to nail down and report what's going on here.

Yeah, it's starting to be obvious that we'd better not ignore sysbench
as "not our problem".  Do you have any roadmap on what needs to be done
to it?

            regards, tom lane

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Fri, 18 Apr 2008, Tom Lane wrote:

> Yeah, it's starting to be obvious that we'd better not ignore sysbench
> as "not our problem".  Do you have any roadmap on what needs to be done
> to it?

Just dug into this code again for a minute and it goes something like
this:

1) Wrap the write statements into transactions properly so the OLTP code
works.  There's a BEGIN/COMMIT in there, but last time I tried that test
it just deadlocked on me (I got a report of the same from someone else as
well).  There's some FIXME items in the code for PostgreSQL already that
might be related here.

2) Make sure the implementation is running statistics correctly (they
create a table and index, but there's certainly no ANALYZE in there).

3) Implement the part of the driver wrapper that haven't been done yet.

4) Try to cut down on crashes (I recall a lot of these when I tried to use
all the features).

5) Compare performance on some simple operations to pgbench to see if it's
competitive.  Look into whether there's code in the PG wrapper they use
that can be optimized usefully.

There's two performance-related things that jump right out as things I'd
want to confirm aren't causing issues:

-It's a threaded design
-The interesting tests look like they use prepared statements.

I think the overall approach sysbench uses is good, it just needs some
adjustments to work right against a PG database.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Matthew
Date:
On Thu, 17 Apr 2008, I wrote:
>>                                                  There is only one
>>   central tunable (you have to switch on CONFIG_SCHED_DEBUG):
>>
>>         /proc/sys/kernel/sched_granularity_ns
>>
>>   which can be used to tune the scheduler from 'desktop' (low
>>   latencies) to 'server' (good batching) workloads. It defaults to a
>>   setting suitable for desktop workloads. SCHED_BATCH is handled by the
>>   CFS scheduler module too.
>
> So it'd be worth compiling a kernel with CONFIG_SCHED_DEBUG switched on and
> try increasing that value, and see if that fixes the problem. Alternatively,
> use sched_setscheduler to set SCHED_BATCH, which should increase the
> timeslice (a Linux-only option).

Looking at the problem a bit closer, it's obvious to me that larger
timeslices would not have fixed this problem, so ignore my suggestion.

It appears that the problem is caused by inter-process communication
blocking and causing processes to be put right to the back of the run
queue, therefore causing a very fine-grained round-robin of the runnable
processes, which trashes the CPU caches. You may also be seeing processes
forced to switch between CPUs, which breaks the caches even more. So what
happens if you run pgbench on a separate machine to the server? Does the
problem still exist in that case?

Matthew

--
X's book explains this very well, but, poor bloke, he did the Cambridge Maths
Tripos...                               -- Computer Science Lecturer

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Fri, 18 Apr 2008, Matthew wrote:

> You may also be seeing processes forced to switch between CPUs, which
> breaks the caches even more. So what happens if you run pgbench on a
> separate machine to the server? Does the problem still exist in that
> case?

I haven't run that test yet but will before I submit a report.  I did
however try running things with the pgbench executable itself bound to a
single CPU with no improvement.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Fri, 18 Apr 2008, Matthew wrote:

> So what happens if you run pgbench on a separate machine to the server?
> Does the problem still exist in that case?

It does not.  At the low client counts, there's a big drop-off relative to
running on localhost just because of running over the network.  But once I
get to 4 clients the remote pgbench setup is even with the localhost one.
At 50 clients, the all local setup is at 8100 tps while the remote pgbench
is at 26000.

So it's pretty clear to me now that the biggest problem here is the
pgbench client itself not working well at all with the newer kernels.
It's difficult to see through that to tell for sure how well each kernel
version is handling the server portion of the job underneath.  I hope to
have time this week to finally submit all this to lkml.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
"Kevin Grittner"
Date:
>>> On Thu, Apr 17, 2008 at  7:26 PM, Greg Smith wrote:

> On this benchmark 2.6.25 is the worst kernel yet:

> It seems I have a lot of work ahead of me here
> to nail down and report what's going on here.

I don't remember seeing a follow-up on this issue from last year.
Are there still any particular kernels to avoid based on this?

Thanks,

-Kevin

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Tue, 31 Mar 2009, Kevin Grittner wrote:

>>>> On Thu, Apr 17, 2008 at  7:26 PM, Greg Smith wrote:
>
>> On this benchmark 2.6.25 is the worst kernel yet:
>
>> It seems I have a lot of work ahead of me here
>> to nail down and report what's going on here.
>
> I don't remember seeing a follow-up on this issue from last year.
> Are there still any particular kernels to avoid based on this?

I never got any confirmation that the patches that came out of my
discussions with the kernel developers were ever merged.  I'm in the
middle of a bunch of pgbench tests this week, and one of the things I
planned to try was seeing if the behavior has changed in 2.6.28 or 2.6.29.
I'm speaking about pgbench at the PostgreSQL East conference this weekend
and will have an update by then (along with a new toolchain for automating
large quantities of pgbench tests).

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Tue, 31 Mar 2009, Kevin Grittner wrote:

>>>> On Thu, Apr 17, 2008 at  7:26 PM, Greg Smith wrote:
>
>> On this benchmark 2.6.25 is the worst kernel yet:
>
> I don't remember seeing a follow-up on this issue from last year.
> Are there still any particular kernels to avoid based on this?

I just discovered something really fascinating here.  The problem is
strictly limited to when you're connecting via Unix-domain sockets; use
TCP/IP instead, and it goes away.

To refresh everyone's memory here, I reported a problem to the LKML here:
http://lkml.org/lkml/2008/5/21/292 Got some patches and some kernel tweaks
for the scheduler but never a clear resolution for the cause, which kept
anybody from getting too excited about merging anything.  Test results
comparing various tweaks on the hardware I'm still using now are at
http://lkml.org/lkml/2008/5/26/288

For example, here's kernel 2.6.25 running pgbench with 50 clients with a
Q6000 processor, demonstrating poor performance--I'd get >20K TPS here
with a pre-CFS kernel:

$ pgbench -S -t 4000 -c 50 -n pgbench
transaction type: SELECT only
scaling factor: 10
query mode: simple
number of clients: 50
number of transactions per client: 4000
number of transactions actually processed: 200000/200000
tps = 8288.047442 (including connections establishing)
tps = 8319.702195 (excluding connections establishing)

If I now execute exactly the same test, but using localhost, performance
returns to normal:

$ pgbench -S -t 4000 -c 50 -n -h localhost pgbench
transaction type: SELECT only
scaling factor: 10
query mode: simple
number of clients: 50
number of transactions per client: 4000
number of transactions actually processed: 200000/200000
tps = 17575.277771 (including connections establishing)
tps = 17724.651090 (excluding connections establishing)

That's 100% repeatable, I ran each test several times each way.

So the new summary here of what I've found is that if:

1) You're running Linux 2.6.23 or greater (confirmed in up to 2.6.26)
2) You connect over a Unix-domain socket
3) Your client count is relatively high (>8 clients/core)

You can expect your pgbench results to tank.  Switch to connecting over
TCP/IP to localhost, and everything is fine; it's not quite as fast as the
pre-CFS kernels in some cases, in others it's faster though.

I haven't gotten to testing kernels newer than 2.6.26 yet, when I saw a
17K TPS result during one of my tests on 2.6.25 I screeched to a halt to
isolate this instead.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Strange behavior: pgbench and new Linux kernels

From
Josh Berkus
Date:
On 4/4/09 9:07 AM, Greg Smith wrote:
> On Tue, 31 Mar 2009, Kevin Grittner wrote:
>
>>>>> On Thu, Apr 17, 2008 at 7:26 PM, Greg Smith wrote:
>>
>>> On this benchmark 2.6.25 is the worst kernel yet:
>>
>> I don't remember seeing a follow-up on this issue from last year.
>> Are there still any particular kernels to avoid based on this?
>
> I just discovered something really fascinating here. The problem is
> strictly limited to when you're connecting via Unix-domain sockets; use
> TCP/IP instead, and it goes away.

Have you sent this to any Linux kernel engineers?  My experience is that
they're fairly responsive to this sort of thing.

--Josh

Re: Strange behavior: pgbench and new Linux kernels

From
Greg Smith
Date:
On Sat, 4 Apr 2009, Josh Berkus wrote:

> Have you sent this to any Linux kernel engineers?  My experience is that
> they're fairly responsive to this sort of thing.

I'm going to submit an updated report to LKML once I get back from East, I
want to test against the latest kernel first.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD