Thread: Strange behavior: pgbench and new Linux kernels
This week I've finished building and installing OSes on some new hardware at home. I have a pretty standard validation routine I go through to make sure PostgreSQL performance is good on any new system I work with. Found a really strange behavior this time around that seems related to changes in Linux. Don't expect any help here, but if someone wanted to replicate my tests I'd be curious to see if that can be done. I tell the story mostly because I think it's an interesting tale in hardware and software validation paranoia, but there's a serious warning here as well for Linux PostgreSQL users. The motherboard is fairly new, and I couldn't get CentOS 5.1, which ships with kernel 2.6.18, to install with the default settings. I had to drop back to "legacy IDE" mode to install. But it was running everything in old-school IDE mode, no DMA or antyhing. "hdparm -Tt" showed a whopping 3MB/s on reads. I pulled down the latest (at the time--only a few hours and I'm already behind) Linux kernel, 2.6.24-4, and compiled that with the right modules included. Now I'm getting 70MB/s on simple reads. Everything looked fine from there until I got to the pgbench select-only tests running PG 8.2.7 (I do 8.2 then 8.3 separately because the checkpoint behavior on write-heavy stuff is so different and I want to see both results). Here's the regular thing I do to see how fast pgbench executes against things in memory (but bigger than the CPU's cache): -Set shared_buffers=256MB, start the server -dropdb pgbench (if it's already there) -createdb pgbench -pgbench -i -s 10 pgbench (makes about a 160MB database) -pgbench -S -c <2*cores> -t 10000 pgbench Since the database was just written out, the whole thing will still be in the shared_buffers cache, so this should execute really fast. This was an Intel quad-core system, I used -c 8, and that got me around 25K transactions/second. Curious to see how high I could push this, I started stepping up the number of clients. There's where the weird thing happened. Just by going to 12 clients instead of 8, I dropped to 8.5K TPS, about 1/3 of what I get from 8 clients. It was like that on every test run. When I use 10 clients, it's about 50/50; sometimes I get 25K, sometimes 8.5K. The only thing it seemed to correlate with is that vmstat on the 25K runs showed ~60K context switches/second, while the 8.5K ones had ~44K. Since I've never seen this before, I went back to my old benchmark system with a dual-core AMD processor. That started with CentOS 4 and kernel 2.6.9, but I happened to install kernel 2.6.24-3 on there to get better support for my Areca card (it goes bonkers regularly on x64 2.6.9). Never did a thorough perforance test of the new kernel though. Sure enough, the same behavior was there, except without a flip-flop point, just a sharp decline. Check this out: -bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding tps = 15787.684067 (excluding connections establishing) tps = 15551.963484 (excluding connections establishing) tps = 14904.218043 (excluding connections establishing) tps = 15330.519289 (excluding connections establishing) tps = 15606.683484 (excluding connections establishing) -bash-3.00$ pgbench -S -c 12 -t 10000 pgbench | grep excluding tps = 7593.572749 (excluding connections establishing) tps = 7870.053868 (excluding connections establishing) tps = 7714.047956 (excluding connections establishing) Results are consistant, right? Summarizing that and extending out, here's what the median TPS numbers look like with 3 tests at each client load: -c4: 16621 (increased -t to 20000 here) -c8: 15551 (all these with t=10000) -c9: 13269 -c10: 10832 -c11: 8993 -c12: 7714 -c16: 7311 -c32: 7141 (cut -t to 5000 here) Now, somewhere around here I start thinking about CPU cache coherency, I play with forcing tasks to particular CPUs, I try the deadline scheduler instead of the default CFQ, but nothing makes a difference. Wanna guess what did? An earlier kernel. These results are the same test as above, same hardware, only difference is I used the standard CentOS 4 2.6.9-67.0.4 kernel instead of 2.6.24-3. -c4: 18388 -c8: 15760 -c9: 15814 (one result of 12623) -c12: 14339 (one result of 11105) -c16: 14148 -c32: 13647 (one result of 10062) We get the usual bit of pgbench flakiness, but using the earlier kernel is faster in every case, only degrades slowly as clients increase, and is almost twice as fast here in a typical high-client load case. So in the case of this simple benchmark, I see an enormous performance regression from the newest Linux kernel compared to a much older one. I need to do some version bisection to nail it down for sure, but my guess is it's the change to the Completely Fair Scheduler in 2.6.23 that's to blame. The recent FreeBSD 7.0 PostgreSQL benchmarks at http://people.freebsd.org/~kris/scaling/7.0%20and%20beyond.pdf showed an equally brutal performance drop going from 2.6.22 to 2.6.23 (see page 16) in around the same client load on a read-only test. My initial guess is that I'm getting nailed by a similar issue here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 17 Apr 2008, Greg Smith wrote: > So in the case of this simple benchmark, I see an enormous performance > regression from the newest Linux kernel compared to a much older one. I need > to do some version bisection to nail it down for sure, but my guess is it's > the change to the Completely Fair Scheduler in 2.6.23 that's to blame. That's a bit sad. From Documentation/sched-design-CFS.txt (2.6.23): > There is only one > central tunable (you have to switch on CONFIG_SCHED_DEBUG): > > /proc/sys/kernel/sched_granularity_ns > > which can be used to tune the scheduler from 'desktop' (low > latencies) to 'server' (good batching) workloads. It defaults to a > setting suitable for desktop workloads. SCHED_BATCH is handled by the > CFS scheduler module too. So it'd be worth compiling a kernel with CONFIG_SCHED_DEBUG switched on and try increasing that value, and see if that fixes the problem. Alternatively, use sched_setscheduler to set SCHED_BATCH, which should increase the timeslice (a Linux-only option). Matthew -- Psychotics are consistently inconsistent. The essence of sanity is to be inconsistently inconsistent.
On Thu, Apr 17, 2008 at 12:58 AM, Greg Smith <gsmith@gregsmith.com> wrote: > So in the case of this simple benchmark, I see an enormous performance > regression from the newest Linux kernel compared to a much older one. This has been discussed recently on linux-kernel. It's definitely a regression. Instead of getting a nice, flat overload behavior when the # of busy threads exceeds the number of CPUs, you get the declining performance you noted. Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22) http://marc.info/?l=linux-kernel&m=120521826111587&w=2 -jwb
On Thu, 17 Apr 2008, Jeffrey Baker wrote: > On Thu, Apr 17, 2008 at 12:58 AM, Greg Smith <gsmith@gregsmith.com> wrote: >> So in the case of this simple benchmark, I see an enormous performance >> regression from the newest Linux kernel compared to a much older one. > > This has been discussed recently on linux-kernel. It's definitely a > regression. Instead of getting a nice, flat overload behavior when > the # of busy threads exceeds the number of CPUs, you get the > declining performance you noted. > > Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22) > http://marc.info/?l=linux-kernel&m=120521826111587&w=2 The last message in the thread says that 2.6.25-rc6 has the problem nailed. That was a month ago. So I guess, upgrade to 2.6.25, which was released today. Matthew -- "Prove to thyself that all circuits that radiateth and upon which thou worketh are grounded, lest they lift thee to high-frequency potential and cause thee to radiate also. " -- The Ten Commandments of Electronics
On Thu, 17 Apr 2008, Jeffrey Baker wrote: > This has been discussed recently on linux-kernel. Excellent pointer, here's direct to the interesting link there: http://marc.info/?l=linux-kernel&m=120574906013029&w=2 Ingo's test system has 16 cores and dives hard at >32 clients; my 4-core system has trouble with >8 clients; looks like the same behavior. And it seems to be fixed in 2.6.25, which just "shipped" literally in the middle of my testing last night. Had I waited until today to grab a kernel I probably would have missed the whole thing. I'll have to re-run to be sure (I just love running a kernel with the paint still wet) but it looks like the conclusion here is "don't run PostgreSQL on kernels 2.6.23 or 2.6.24". Good thing I already hated FC8. If all these kernel developers are using sysbench, we really should get that thing cleaned up so it runs well with PostgreSQL. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 17 Apr 2008, Matthew wrote: > The last message in the thread says that 2.6.25-rc6 has the problem nailed. > That was a month ago. So I guess, upgrade to 2.6.25, which was released > today. Ah, even more support for me to distrust everything I read. The change has flattened out things, so now the pgbench results are awful everywhere. On this benchmark 2.6.25 is the worst kernel yet: -bash-3.00$ pgbench -S -c 4 -t 10000 pgbench | grep excluding tps = 8619.710649 (excluding connections establishing) tps = 8664.321235 (excluding connections establishing) tps = 8671.973915 (excluding connections establishing) (was 18388 in 2.6.9 and 16621 in 2.6.23-3) -bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding tps = 9011.728765 (excluding connections establishing) tps = 9039.441796 (excluding connections establishing) tps = 9206.574000 (excluding connections establishing) (was 15760 in 2.6.9 and 15551 in 2.6.23-3) -bash-3.00$ pgbench -S -c 16 -t 10000 pgbench | grep excluding tps = 7063.710786 (excluding connections establishing) tps = 6956.266777 (excluding connections establishing) tps = 7120.971600 (excluding connections establishing) (was 14148 in 2.6.9 and 7311 in 2.6.23-3) -bash-3.00$ pgbench -S -c 32 -t 10000 pgbench | grep excluding tps = 7006.311636 (excluding connections establishing) tps = 6971.305909 (excluding connections establishing) tps = 7002.820583 (excluding connections establishing) (was 13647 in 2.6.9 and 7141 in 2.6.23-3) This is what happens when the kernel developers are using results from a MySQL tool to optimize things I guess. It seems I have a lot of work ahead of me here to nail down and report what's going on here. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 17 Apr 2008, Greg Smith wrote: > On Thu, 17 Apr 2008, Matthew wrote: > >> The last message in the thread says that 2.6.25-rc6 has the problem nailed. >> That was a month ago. So I guess, upgrade to 2.6.25, which was released >> today. > > Ah, even more support for me to distrust everything I read. The change has > flattened out things, so now the pgbench results are awful everywhere. On > this benchmark 2.6.25 is the worst kernel yet: > > -bash-3.00$ pgbench -S -c 4 -t 10000 pgbench | grep excluding > tps = 8619.710649 (excluding connections establishing) > tps = 8664.321235 (excluding connections establishing) > tps = 8671.973915 (excluding connections establishing) > (was 18388 in 2.6.9 and 16621 in 2.6.23-3) > > -bash-3.00$ pgbench -S -c 8 -t 10000 pgbench | grep excluding > tps = 9011.728765 (excluding connections establishing) > tps = 9039.441796 (excluding connections establishing) > tps = 9206.574000 (excluding connections establishing) > (was 15760 in 2.6.9 and 15551 in 2.6.23-3) > > -bash-3.00$ pgbench -S -c 16 -t 10000 pgbench | grep excluding > tps = 7063.710786 (excluding connections establishing) > tps = 6956.266777 (excluding connections establishing) > tps = 7120.971600 (excluding connections establishing) > (was 14148 in 2.6.9 and 7311 in 2.6.23-3) > > -bash-3.00$ pgbench -S -c 32 -t 10000 pgbench | grep excluding > tps = 7006.311636 (excluding connections establishing) > tps = 6971.305909 (excluding connections establishing) > tps = 7002.820583 (excluding connections establishing) > (was 13647 in 2.6.9 and 7141 in 2.6.23-3) > > This is what happens when the kernel developers are using results from a > MySQL tool to optimize things I guess. It seems I have a lot of work ahead > of me here to nail down and report what's going on here. report this to the kernel list so that they know, and be ready to test fixes. the kernel developers base sucess or failure on the results of tests. if the only people providing test results are MySQL people, how would they know there is a problem? David Lang
On Thu, 17 Apr 2008, david@lang.hm wrote: > report this to the kernel list so that they know, and be ready to test fixes. Don't worry, I'm on that. I'm already having enough problems with database performance under Linux, if they start killing results on the easy benchmarks I'll really be in trouble. > if the only people providing test results are MySQL people, how would > they know there is a problem? The thing I was alluding to is that both FreeBSD and Linux kernel developers are now doing all their PostgreSQL tests with sysbench, a MySQL tool with rudimentary PostgreSQL support bolted on (badly). I think rather than complain about it someone (and I fear this will be me) needs to just fix that so it works well. It really handy for people to have something they can get familiar with that runs against both databases in a way they can be compared fairly. Right now PG beats MySQL scalability despite that on read tests, the big problems with PG+sysbench are when you try to write with it. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
Greg Smith <gsmith@gregsmith.com> writes: > This is what happens when the kernel developers are using results from a > MySQL tool to optimize things I guess. It seems I have a lot of work > ahead of me here to nail down and report what's going on here. Yeah, it's starting to be obvious that we'd better not ignore sysbench as "not our problem". Do you have any roadmap on what needs to be done to it? regards, tom lane
On Fri, 18 Apr 2008, Tom Lane wrote: > Yeah, it's starting to be obvious that we'd better not ignore sysbench > as "not our problem". Do you have any roadmap on what needs to be done > to it? Just dug into this code again for a minute and it goes something like this: 1) Wrap the write statements into transactions properly so the OLTP code works. There's a BEGIN/COMMIT in there, but last time I tried that test it just deadlocked on me (I got a report of the same from someone else as well). There's some FIXME items in the code for PostgreSQL already that might be related here. 2) Make sure the implementation is running statistics correctly (they create a table and index, but there's certainly no ANALYZE in there). 3) Implement the part of the driver wrapper that haven't been done yet. 4) Try to cut down on crashes (I recall a lot of these when I tried to use all the features). 5) Compare performance on some simple operations to pgbench to see if it's competitive. Look into whether there's code in the PG wrapper they use that can be optimized usefully. There's two performance-related things that jump right out as things I'd want to confirm aren't causing issues: -It's a threaded design -The interesting tests look like they use prepared statements. I think the overall approach sysbench uses is good, it just needs some adjustments to work right against a PG database. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 17 Apr 2008, I wrote: >> There is only one >> central tunable (you have to switch on CONFIG_SCHED_DEBUG): >> >> /proc/sys/kernel/sched_granularity_ns >> >> which can be used to tune the scheduler from 'desktop' (low >> latencies) to 'server' (good batching) workloads. It defaults to a >> setting suitable for desktop workloads. SCHED_BATCH is handled by the >> CFS scheduler module too. > > So it'd be worth compiling a kernel with CONFIG_SCHED_DEBUG switched on and > try increasing that value, and see if that fixes the problem. Alternatively, > use sched_setscheduler to set SCHED_BATCH, which should increase the > timeslice (a Linux-only option). Looking at the problem a bit closer, it's obvious to me that larger timeslices would not have fixed this problem, so ignore my suggestion. It appears that the problem is caused by inter-process communication blocking and causing processes to be put right to the back of the run queue, therefore causing a very fine-grained round-robin of the runnable processes, which trashes the CPU caches. You may also be seeing processes forced to switch between CPUs, which breaks the caches even more. So what happens if you run pgbench on a separate machine to the server? Does the problem still exist in that case? Matthew -- X's book explains this very well, but, poor bloke, he did the Cambridge Maths Tripos... -- Computer Science Lecturer
On Fri, 18 Apr 2008, Matthew wrote: > You may also be seeing processes forced to switch between CPUs, which > breaks the caches even more. So what happens if you run pgbench on a > separate machine to the server? Does the problem still exist in that > case? I haven't run that test yet but will before I submit a report. I did however try running things with the pgbench executable itself bound to a single CPU with no improvement. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Fri, 18 Apr 2008, Matthew wrote: > So what happens if you run pgbench on a separate machine to the server? > Does the problem still exist in that case? It does not. At the low client counts, there's a big drop-off relative to running on localhost just because of running over the network. But once I get to 4 clients the remote pgbench setup is even with the localhost one. At 50 clients, the all local setup is at 8100 tps while the remote pgbench is at 26000. So it's pretty clear to me now that the biggest problem here is the pgbench client itself not working well at all with the newer kernels. It's difficult to see through that to tell for sure how well each kernel version is handling the server portion of the job underneath. I hope to have time this week to finally submit all this to lkml. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>>> On Thu, Apr 17, 2008 at 7:26 PM, Greg Smith wrote: > On this benchmark 2.6.25 is the worst kernel yet: > It seems I have a lot of work ahead of me here > to nail down and report what's going on here. I don't remember seeing a follow-up on this issue from last year. Are there still any particular kernels to avoid based on this? Thanks, -Kevin
On Tue, 31 Mar 2009, Kevin Grittner wrote: >>>> On Thu, Apr 17, 2008 at 7:26 PM, Greg Smith wrote: > >> On this benchmark 2.6.25 is the worst kernel yet: > >> It seems I have a lot of work ahead of me here >> to nail down and report what's going on here. > > I don't remember seeing a follow-up on this issue from last year. > Are there still any particular kernels to avoid based on this? I never got any confirmation that the patches that came out of my discussions with the kernel developers were ever merged. I'm in the middle of a bunch of pgbench tests this week, and one of the things I planned to try was seeing if the behavior has changed in 2.6.28 or 2.6.29. I'm speaking about pgbench at the PostgreSQL East conference this weekend and will have an update by then (along with a new toolchain for automating large quantities of pgbench tests). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Tue, 31 Mar 2009, Kevin Grittner wrote: >>>> On Thu, Apr 17, 2008 at 7:26 PM, Greg Smith wrote: > >> On this benchmark 2.6.25 is the worst kernel yet: > > I don't remember seeing a follow-up on this issue from last year. > Are there still any particular kernels to avoid based on this? I just discovered something really fascinating here. The problem is strictly limited to when you're connecting via Unix-domain sockets; use TCP/IP instead, and it goes away. To refresh everyone's memory here, I reported a problem to the LKML here: http://lkml.org/lkml/2008/5/21/292 Got some patches and some kernel tweaks for the scheduler but never a clear resolution for the cause, which kept anybody from getting too excited about merging anything. Test results comparing various tweaks on the hardware I'm still using now are at http://lkml.org/lkml/2008/5/26/288 For example, here's kernel 2.6.25 running pgbench with 50 clients with a Q6000 processor, demonstrating poor performance--I'd get >20K TPS here with a pre-CFS kernel: $ pgbench -S -t 4000 -c 50 -n pgbench transaction type: SELECT only scaling factor: 10 query mode: simple number of clients: 50 number of transactions per client: 4000 number of transactions actually processed: 200000/200000 tps = 8288.047442 (including connections establishing) tps = 8319.702195 (excluding connections establishing) If I now execute exactly the same test, but using localhost, performance returns to normal: $ pgbench -S -t 4000 -c 50 -n -h localhost pgbench transaction type: SELECT only scaling factor: 10 query mode: simple number of clients: 50 number of transactions per client: 4000 number of transactions actually processed: 200000/200000 tps = 17575.277771 (including connections establishing) tps = 17724.651090 (excluding connections establishing) That's 100% repeatable, I ran each test several times each way. So the new summary here of what I've found is that if: 1) You're running Linux 2.6.23 or greater (confirmed in up to 2.6.26) 2) You connect over a Unix-domain socket 3) Your client count is relatively high (>8 clients/core) You can expect your pgbench results to tank. Switch to connecting over TCP/IP to localhost, and everything is fine; it's not quite as fast as the pre-CFS kernels in some cases, in others it's faster though. I haven't gotten to testing kernels newer than 2.6.26 yet, when I saw a 17K TPS result during one of my tests on 2.6.25 I screeched to a halt to isolate this instead. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 4/4/09 9:07 AM, Greg Smith wrote: > On Tue, 31 Mar 2009, Kevin Grittner wrote: > >>>>> On Thu, Apr 17, 2008 at 7:26 PM, Greg Smith wrote: >> >>> On this benchmark 2.6.25 is the worst kernel yet: >> >> I don't remember seeing a follow-up on this issue from last year. >> Are there still any particular kernels to avoid based on this? > > I just discovered something really fascinating here. The problem is > strictly limited to when you're connecting via Unix-domain sockets; use > TCP/IP instead, and it goes away. Have you sent this to any Linux kernel engineers? My experience is that they're fairly responsive to this sort of thing. --Josh
On Sat, 4 Apr 2009, Josh Berkus wrote: > Have you sent this to any Linux kernel engineers? My experience is that > they're fairly responsive to this sort of thing. I'm going to submit an updated report to LKML once I get back from East, I want to test against the latest kernel first. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD