Thread: Linux I/O tuning: CFQ vs. deadline
Recently I've made a number of unsubstantiated claims that the deadline scheduler on Linux does bad things compared to CFQ when running real-world mixed I/O database tests. Unfortunately every time I do one of these I end up unable to release the results due to client confidentiality issues. However, I do keep an eye out for people who run into the same issues in public benchmarks, and I just found one: http://insights.oetiker.ch/linux/fsopbench/ The problem analyzed in the "Deadline considered harmful" section looks exactly like what I run into: deadline just does some bad things when the I/O workload gets complicated. And the conclusion reached there, "the deadline scheduler did not have advantages in any of our test cases", has been my conclusion for every round of pgbench-based testing I've done too. In that case, the specific issue is that reads get blocked badly when checkpoint writes are doing heavier work; you can see the read I/O numbers reported by "vmstat 1" go completely to zero for a second or more when it happens. That can happen with CFQ, too, but it consistently seems more likely to occur with deadline. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Greg Smith wrote: > Recently I've made a number of unsubstantiated claims that the deadline > scheduler on Linux does bad things compared to CFQ when running > real-world mixed I/O database tests. Unfortunately every time I do one > of these I end up unable to release the results due to client > confidentiality issues. However, I do keep an eye out for people who > run into the same issues in public benchmarks, and I just found one: > http://insights.oetiker.ch/linux/fsopbench/ That is interesting; particularly since I have made one quite different experience in which deadline outperformed CFQ by a factor of approximately 4. So I tried to look for differences, and I found two possible places: - My test case was read-only, our production system is read-mostly. - We did not have a RAID array, but a SAN box (with RAID inside). The "noop" scheduler performed about as well as "deadline". I wonder if the two differences above could explain the different result. Yours, Laurenz Albe
"Albe Laurenz" <laurenz.albe@wien.gv.at> wrote: > Greg Smith wrote: >> http://insights.oetiker.ch/linux/fsopbench/ > > That is interesting; particularly since I have made one quite > different experience in which deadline outperformed CFQ by a > factor of approximately 4. I haven't benchmarked it per se, but when we started using PostgreSQL on Linux, the benchmarks and posts I could find recommended deadline=elevator, so we went with that, and when the setting was missed on a machine it was generally found fairly quickly because people complained that the machine wasn't performing to expectations; changing this to deadline corrected the problem. > So I tried to look for differences, and I found two possible > places: > - My test case was read-only, our production system is > read-mostly. Yeah, our reads are typically several times our writes -- up to maybe 10 to 1. > - We did not have a RAID array, but a SAN box (with RAID inside). No SAN here, but if I recall correctly, this was mostly an issue on our larger arrays -- RAID 5 with dozens of spindles on a BBU hardware controller. Other differences between our environment and that of the benchmarks cited above: - We use SuSE Linux Enterprise Server, so we've been on *much* earlier kernel versions that this benchmark. - We've been using xfs, with noatime,nobarrier. I'll keep this in mind as something to try if we have problem performance in line with what that page describes, though.... -Kevin
Kevin Grittner wrote: > I'll keep this in mind as something to try if we have problem > performance in line with what that page describes, though.... > That's basically what I've been trying to make clear all along: people should keep an open mind, watch what happens, and not make any assumptions. There's no clear cut preference for one scheduler or the other in all situations. I've seen CFQ do much better, you and Albe report situations where the opposite is true. I was just happy to see another report of someone running into the same sort of issue I've been seeing, because I didn't have very much data to offer about why the standard advice of "always use deadline for a database app" might not apply to everyone. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
> That's basically what I've been trying to make clear all along: people > should keep an open mind, watch what happens, and not make any > assumptions. There's no clear cut preference for one scheduler or the > other in all situations. I've seen CFQ do much better, you and Albe > report situations where the opposite is true. I was just happy to see > another report of someone running into the same sort of issue I've been > seeing, because I didn't have very much data to offer about why the > standard advice of "always use deadline for a database app" might not > apply to everyone. Damn, you would have to make things complicated, eh? FWIW, back when deadline was first introduced Mark Wong did some tests and found Deadline to be the fastest of 4 on DBT2 ... but only by about 5%. If the read vs. checkpoint analysis is correct, what was happening is the penalty for checkpoints on deadline was almost wiping out the advantage for reads, but not quite. Those tests were also done on attached storage. So, what this suggests is: reads: deadline > CFQ writes: CFQ > deadline attached storage: deadline > CFQ Man, we'd need a lot of testing to settle this. I guess that's why Linux gives us the choice of 4 ... --Josh Berkus
On Mon, Feb 8, 2010 at 10:49 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> That's basically what I've been trying to make clear all along: people >> should keep an open mind, watch what happens, and not make any >> assumptions. There's no clear cut preference for one scheduler or the >> other in all situations. I've seen CFQ do much better, you and Albe >> report situations where the opposite is true. I was just happy to see >> another report of someone running into the same sort of issue I've been >> seeing, because I didn't have very much data to offer about why the >> standard advice of "always use deadline for a database app" might not >> apply to everyone. > > Damn, you would have to make things complicated, eh? > > FWIW, back when deadline was first introduced Mark Wong did some tests > and found Deadline to be the fastest of 4 on DBT2 ... but only by about > 5%. If the read vs. checkpoint analysis is correct, what was happening > is the penalty for checkpoints on deadline was almost wiping out the > advantage for reads, but not quite. > > Those tests were also done on attached storage. > > So, what this suggests is: > reads: deadline > CFQ > writes: CFQ > deadline > attached storage: deadline > CFQ > > Man, we'd need a lot of testing to settle this. I guess that's why > Linux gives us the choice of 4 ... Just to add to the data points. On an 8 core opteron Areca 1680 and a 12 disk RAID-10 for data and 2 disk RAID-1 for WAL, I get noticeably better performance (approximately 15%) and lower load factors (they drop from about 8 to 5 or 6) running noop over the default scheduler, with RHEL 5.4 with the 2.6.18-92.el5 kernel from RHEL 5.2.
On Mon, Feb 8, 2010 at 9:49 AM, Josh Berkus <josh@agliodbs.com> wrote: > >> That's basically what I've been trying to make clear all along: people >> should keep an open mind, watch what happens, and not make any >> assumptions. There's no clear cut preference for one scheduler or the >> other in all situations. I've seen CFQ do much better, you and Albe >> report situations where the opposite is true. I was just happy to see >> another report of someone running into the same sort of issue I've been >> seeing, because I didn't have very much data to offer about why the >> standard advice of "always use deadline for a database app" might not >> apply to everyone. > > Damn, you would have to make things complicated, eh? > > FWIW, back when deadline was first introduced Mark Wong did some tests > and found Deadline to be the fastest of 4 on DBT2 ... but only by about > 5%. If the read vs. checkpoint analysis is correct, what was happening > is the penalty for checkpoints on deadline was almost wiping out the > advantage for reads, but not quite. > > Those tests were also done on attached storage. > > So, what this suggests is: > reads: deadline > CFQ > writes: CFQ > deadline > attached storage: deadline > CFQ > > Man, we'd need a lot of testing to settle this. I guess that's why > Linux gives us the choice of 4 ... I wonder what the impact is from the underlying RAID configuration. Those DBT2 tests were also LVM striped volumes on top of single RAID0 LUNS (no jbod option). Regards. Mark
On Feb 8, 2010, at 9:49 AM, Josh Berkus wrote: > > Those tests were also done on attached storage. > > So, what this suggests is: > reads: deadline > CFQ > writes: CFQ > deadline > attached storage: deadline > CFQ > From my experience on reads: Large sequential scans mixed with concurrent random reads behave very differently between the two schedulers. Deadline has _significantly_ higher throughput in this situation, but the random read latency is higher. CFQ will starvethe sequential scan in favor of letting each concurrent read get some time. If your app is very latency sensitive onreads, that is good. If you need max throughput, getting the sequential scan out of the way instead of breaking it upinto lots of small chunks is critical. I think it is this behavior that causes the delays on writes -- from the scheduler's point of view, a large set of writesis usually somewhat sequential and deadline favors throughput over latency. Generally, my writes are large bulk writes, and I am not very latency sensitive but am very throughput sensitive. So deadlinehelps a great deal (combined with decently sized readahead). Other use cases will clearly have different preferences. My experience with scheduler performace tuning is on CentOS 5.3 and 5.4. With the changes to much of the I/O layer in thelatest kernels, I would not be surprised if things have changed. > Man, we'd need a lot of testing to settle this. I guess that's why > Linux gives us the choice of 4 ... > > --Josh Berkus > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Josh Berkus wrote: > FWIW, back when deadline was first introduced Mark Wong did some tests > and found Deadline to be the fastest of 4 on DBT2 ... but only by about > 5%. If the read vs. checkpoint analysis is correct, what was happening > is the penalty for checkpoints on deadline was almost wiping out the > advantage for reads, but not quite. > Wasn't that before 8.3, where the whole checkpoint spreading logic showed up? That's really a whole different write pattern now than it was then. 8.2 checkpoint writes were one big batch write amenable to optimizing for throughput. The new ones are not; the I/O is intermixed with reads most of the time. > Man, we'd need a lot of testing to settle this. I guess that's why > Linux gives us the choice of 4 ... > A recent on of these I worked on started with 4096 possible I/O configurations we pruned down the most likely good candidates from. I'm almost ready to schedule a week on Mark's HP performance test system in the lab now, to try and nail this down in a fully public environment for once. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
Hannu Krosing wrote: > Have you kept trace of what filesystems are in use ? > Almost everything I do on Linux has been with ext3. I had a previous diversion into VxFS and an upcoming one into XFS that may shed more light on all this. And, yes, the whole I/O scheduling approach in Linux was just completely redesigned for a very recent kernel update. So even what we think we know is already obsolete in some respects. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Mon, 8 Feb 2010, Greg Smith wrote: > Hannu Krosing wrote: >> Have you kept trace of what filesystems are in use ? >> > > Almost everything I do on Linux has been with ext3. I had a previous > diversion into VxFS and an upcoming one into XFS that may shed more light on > all this. it would be nice if you could try ext4 when doing your tests. It's new enough that I won't trust it for production data yet, but a lot of people are jumping on it as if it was just a minor update to ext3 instead of an almost entirely new filesystem. David Lang > And, yes, the whole I/O scheduling approach in Linux was just completely > redesigned for a very recent kernel update. So even what we think we know is > already obsolete in some respects. > >
On Feb 8, 2010, at 11:35 PM, david@lang.hm wrote: > >> And, yes, the whole I/O scheduling approach in Linux was just >> completely redesigned for a very recent kernel update. So even >> what we think we know is already obsolete in some respects. >> I'd done some testing a while ago on the schedulers and at the time deadline or noop smashed cfq. Now, it is 100% possible since then that they've made vast improvements to cfq and or the VM to get better or similar performance. I recall a vintage of 2.6 where they severely messed up the VM. Glad I didn't upgrade to that one :) Here's the old post: http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
Jeff wrote: > I'd done some testing a while ago on the schedulers and at the time > deadline or noop smashed cfq. Now, it is 100% possible since then > that they've made vast improvements to cfq and or the VM to get better > or similar performance. I recall a vintage of 2.6 where they severely > messed up the VM. Glad I didn't upgrade to that one :) > > Here's the old post: > http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php pgiosim doesn't really mix writes into there though, does it? The mixed read/write situations are the ones where the scheduler stuff gets messy. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Tue, Feb 9, 2010 at 11:37 PM, Greg Smith <greg@2ndquadrant.com> wrote: > Jeff wrote: >> >> I'd done some testing a while ago on the schedulers and at the time >> deadline or noop smashed cfq. Now, it is 100% possible since then that >> they've made vast improvements to cfq and or the VM to get better or similar >> performance. I recall a vintage of 2.6 where they severely messed up the >> VM. Glad I didn't upgrade to that one :) >> >> Here's the old post: >> http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php > > pgiosim doesn't really mix writes into there though, does it? The mixed > read/write situations are the ones where the scheduler stuff gets messy. I agree. I think the only way to really test it is by testing it against the system it's got to run under. I'd love to see someone do a comparison of early to mid 2.6 kernels (2.6.18 like RHEL5) to very up to date 2.6 kernels. On fast hardware. What it does on a laptop isn't that interesting and I don't have a big machine idle to test it on.
On Feb 10, 2010, at 1:37 AM, Greg Smith wrote: > Jeff wrote: >> I'd done some testing a while ago on the schedulers and at the time >> deadline or noop smashed cfq. Now, it is 100% possible since then >> that they've made vast improvements to cfq and or the VM to get >> better or similar performance. I recall a vintage of 2.6 where >> they severely messed up the VM. Glad I didn't upgrade to that one :) >> >> Here's the old post: http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php > > pgiosim doesn't really mix writes into there though, does it? The > mixed read/write situations are the ones where the scheduler stuff > gets messy. > It has the abillity to rewrite blocks randomly as well - but I honestly don't remember if I did that during my cfq/deadline test. I'd wager I didn't. Maybe I'll get some time to run some more tests on it in the next couple days > -- > Greg Smith 2ndQuadrant Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.com > -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
On Feb 9, 2010, at 10:37 PM, Greg Smith wrote: > Jeff wrote: >> I'd done some testing a while ago on the schedulers and at the time >> deadline or noop smashed cfq. Now, it is 100% possible since then >> that they've made vast improvements to cfq and or the VM to get better >> or similar performance. I recall a vintage of 2.6 where they severely >> messed up the VM. Glad I didn't upgrade to that one :) >> >> Here's the old post: >> http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php > > pgiosim doesn't really mix writes into there though, does it? The mixed > read/write situations are the ones where the scheduler stuff gets messy. > Also, read/write mix performance depend on the file system not just the scheduler. The block device readahead parameter can have a big impact too. If you test xfs, make sure you configure the 'allocsize' mount parameter properly as well. If there are any sequential readsor writes in there mixed with other reads/writes, that can have a big impact on how fragmented the filesystem gets. Ext3 has several characteristics for writes that might favor cfq that other file systems do not. Features like delayed allocation,extents, and write barriers significantly change the pattern of writes seen by the I/O scheduler. In short, one scheduler may be best for one filesystem, but not a good idea for others. And then on top of that, it all depends on what type of DB you're running. Lots of small fast mostly read queries? Largenumber of small writes? Large bulk writes? Large reporting queries? Different configurations and tuning is requiredto maximize performance on each. There is no single rule for Postgres on Linux that I can think of other than "never have ext3 in 'ordered' or 'journal' modefor your WAL on the same filesystem as your data". > -- > Greg Smith 2ndQuadrant Baltimore, MD > PostgreSQL Training, Services and Support > greg@2ndQuadrant.com www.2ndQuadrant.com > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
Scott Marlowe wrote: > I'd love to see someone do a comparison of early to mid 2.6 kernels (2.6.18 like RHEL5) to very > up to date 2.6 kernels. On fast hardware. I'd be happy just to find fast hardware that works on every kernel from the RHEL5 2.6.18 up to the latest one without issues. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Wed, 10 Feb 2010, Greg Smith wrote: > Scott Marlowe wrote: >> I'd love to see someone do a comparison of early to mid 2.6 kernels (2.6.18 >> like RHEL5) to very >> up to date 2.6 kernels. On fast hardware. > > I'd be happy just to find fast hardware that works on every kernel from the > RHEL5 2.6.18 up to the latest one without issues. it depends on your definition of 'fast hardware' I have boxes that were very fast at the time that work on all these kernels, but they wouldn't be considered fast by todays's standards. remember that there is a point release about every 3 months, 2.6.33 is about to be released, so this is a 3 x (33-18) = ~45 month old kernel. hardware progresses a LOT on 4 years. most of my new hardware has no problems with the old kernels as well, but once in a while I run into something that doesn't work. David Lang
david@lang.hm wrote: > most of my new hardware has no problems with the old kernels as well, > but once in a while I run into something that doesn't work. Quick survey just of what's within 20 feet of me: -Primary desktop: 2 years old, requires 2.6.23 or later for SATA to work -Server: 3 years old, requires 2.6.22 or later for the Areca card not to panic under load -Laptops: both about 2 years old, and require 2.6.28 to work at all; mostly wireless issues, but some power management ones that impact the processor working right too, occasional SATA ones too. I'm looking into a new primary desktop to step up to 8 HT cores; I fully expect it won't boot anything older than 2.6.28 and may take an even newer kernel just for basic processor and disks parts to work. We're kind of at a worst-case point right now for this sort of thing, on the tail side of the almost 3 year old RHEL5 using a 3.5 year old kernel as the standard for so many Linux server deployments. Until RHEL6 is ready to go, there's little motivation for the people who make server hardware to get all their drivers perfect in the newer kernels. Just after that ships will probably be a good time to do that sort of comparison, like it was possible to easily compare RHEL4 using 2.6.9 and RHEL5 with 2.6.18 easily in mid to late 2007 with many bits of high-performance hardware known to work well on each. -- Greg Smith 2ndQuadrant Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.com
On Mon, 2010-02-08 at 09:49 -0800, Josh Berkus wrote: > FWIW, back when deadline was first introduced Mark Wong did some tests > and found Deadline to be the fastest of 4 on DBT2 ... but only by about > 5%. If the read vs. checkpoint analysis is correct, what was happening > is the penalty for checkpoints on deadline was almost wiping out the > advantage for reads, but not quite. I also did some tests when I was putting together my Synchronized Scan benchmarks: http://j-davis.com/postgresql/83v82_scans.html CFQ was so slow that I didn't include it in the results at all. The tests weren't intended to compare schedulers, so I did most of the tests with anticipatory (at least the ones on linux; I also tested freebsd). However, I have some raw data from the tests I did run with CFQ: http://j-davis.com/postgresql/results/ They will take some interpretation (again, not intended as scheduler benchmarks). The server was modified to record a log message every N page accesses. Regards, Jeff Davis