Thread: Linux I/O tuning: CFQ vs. deadline

From:
Greg Smith
Date:

Recently I've made a number of unsubstantiated claims that the deadline
scheduler on Linux does bad things compared to CFQ when running
real-world mixed I/O database tests.  Unfortunately every time I do one
of these I end up unable to release the results due to client
confidentiality issues.  However, I do keep an eye out for people who
run into the same issues in public benchmarks, and I just found one:
http://insights.oetiker.ch/linux/fsopbench/

The problem analyzed in the "Deadline considered harmful" section looks
exactly like what I run into:  deadline just does some bad things when
the I/O workload gets complicated.  And the conclusion reached there,
"the deadline scheduler did not have advantages in any of our test
cases", has been my conclusion for every round of pgbench-based testing
I've done too.  In that case, the specific issue is that reads get
blocked badly when checkpoint writes are doing heavier work; you can see
the read I/O numbers reported by "vmstat 1" go completely to zero for a
second or more when it happens.  That can happen with CFQ, too, but it
consistently seems more likely to occur with deadline.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
"Albe Laurenz"
Date:

Greg Smith wrote:
> Recently I've made a number of unsubstantiated claims that the deadline
> scheduler on Linux does bad things compared to CFQ when running
> real-world mixed I/O database tests.  Unfortunately every time I do one
> of these I end up unable to release the results due to client
> confidentiality issues.  However, I do keep an eye out for people who
> run into the same issues in public benchmarks, and I just found one:
> http://insights.oetiker.ch/linux/fsopbench/

That is interesting; particularly since I have made one quite different
experience in which deadline outperformed CFQ by a factor of approximately 4.

So I tried to look for differences, and I found two possible places:
- My test case was read-only, our production system is read-mostly.
- We did not have a RAID array, but a SAN box (with RAID inside).

The "noop" scheduler performed about as well as "deadline".
I wonder if the two differences above could explain the different
result.

Yours,
Laurenz Albe

From:
"Kevin Grittner"
Date:

"Albe Laurenz" <> wrote:
> Greg Smith wrote:

>> http://insights.oetiker.ch/linux/fsopbench/
>
> That is interesting; particularly since I have made one quite
> different experience in which deadline outperformed CFQ by a
> factor of approximately 4.

I haven't benchmarked it per se, but when we started using
PostgreSQL on Linux, the benchmarks and posts I could find
recommended deadline=elevator, so we went with that, and when the
setting was missed on a machine it was generally found fairly
quickly because people complained that the machine wasn't performing
to expectations; changing this to deadline corrected the problem.

> So I tried to look for differences, and I found two possible
> places:
> - My test case was read-only, our production system is
>   read-mostly.

Yeah, our reads are typically several times our writes -- up to
maybe 10 to 1.

> - We did not have a RAID array, but a SAN box (with RAID inside).

No SAN here, but if I recall correctly, this was mostly an issue on
our larger arrays -- RAID 5 with dozens of spindles on a BBU
hardware controller.

Other differences between our environment and that of the benchmarks
cited above:

 - We use SuSE Linux Enterprise Server, so we've been on *much*
   earlier kernel versions that this benchmark.

 - We've been using xfs, with noatime,nobarrier.

I'll keep this in mind as something to try if we have problem
performance in line with what that page describes, though....

-Kevin

From:
Greg Smith
Date:

Kevin Grittner wrote:
> I'll keep this in mind as something to try if we have problem
> performance in line with what that page describes, though....
>

That's basically what I've been trying to make clear all along:  people
should keep an open mind, watch what happens, and not make any
assumptions.  There's no clear cut preference for one scheduler or the
other in all situations.  I've seen CFQ do much better, you and Albe
report situations where the opposite is true.  I was just happy to see
another report of someone running into the same sort of issue I've been
seeing, because I didn't have very much data to offer about why the
standard advice of "always use deadline for a database app" might not
apply to everyone.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Josh Berkus
Date:

> That's basically what I've been trying to make clear all along:  people
> should keep an open mind, watch what happens, and not make any
> assumptions.  There's no clear cut preference for one scheduler or the
> other in all situations.  I've seen CFQ do much better, you and Albe
> report situations where the opposite is true.  I was just happy to see
> another report of someone running into the same sort of issue I've been
> seeing, because I didn't have very much data to offer about why the
> standard advice of "always use deadline for a database app" might not
> apply to everyone.

Damn, you would have to make things complicated, eh?

FWIW, back when deadline was first introduced Mark Wong did some tests
and found Deadline to be the fastest of 4 on DBT2 ... but only by about
5%.  If the read vs. checkpoint analysis is correct, what was happening
is the penalty for checkpoints on deadline was almost wiping out the
advantage for reads, but not quite.

Those tests were also done on attached storage.

So, what this suggests is:
reads:  deadline > CFQ
writes: CFQ > deadline
attached storage:  deadline > CFQ

Man, we'd need a lot of testing to settle this.  I guess that's why
Linux gives us the choice of 4 ...

--Josh Berkus

From:
Scott Marlowe
Date:

On Mon, Feb 8, 2010 at 10:49 AM, Josh Berkus <> wrote:
>
>> That's basically what I've been trying to make clear all along:  people
>> should keep an open mind, watch what happens, and not make any
>> assumptions.  There's no clear cut preference for one scheduler or the
>> other in all situations.  I've seen CFQ do much better, you and Albe
>> report situations where the opposite is true.  I was just happy to see
>> another report of someone running into the same sort of issue I've been
>> seeing, because I didn't have very much data to offer about why the
>> standard advice of "always use deadline for a database app" might not
>> apply to everyone.
>
> Damn, you would have to make things complicated, eh?
>
> FWIW, back when deadline was first introduced Mark Wong did some tests
> and found Deadline to be the fastest of 4 on DBT2 ... but only by about
> 5%.  If the read vs. checkpoint analysis is correct, what was happening
> is the penalty for checkpoints on deadline was almost wiping out the
> advantage for reads, but not quite.
>
> Those tests were also done on attached storage.
>
> So, what this suggests is:
> reads:  deadline > CFQ
> writes: CFQ > deadline
> attached storage:  deadline > CFQ
>
> Man, we'd need a lot of testing to settle this.  I guess that's why
> Linux gives us the choice of 4 ...

Just to add to the data points.  On an 8 core opteron Areca 1680 and a
12 disk RAID-10 for data and 2 disk RAID-1 for WAL, I get noticeably
better performance (approximately 15%) and lower load factors (they
drop from about 8 to 5 or 6) running noop over the default scheduler,
with RHEL 5.4 with the 2.6.18-92.el5 kernel from RHEL 5.2.

From:
Mark Wong
Date:

On Mon, Feb 8, 2010 at 9:49 AM, Josh Berkus <> wrote:
>
>> That's basically what I've been trying to make clear all along:  people
>> should keep an open mind, watch what happens, and not make any
>> assumptions.  There's no clear cut preference for one scheduler or the
>> other in all situations.  I've seen CFQ do much better, you and Albe
>> report situations where the opposite is true.  I was just happy to see
>> another report of someone running into the same sort of issue I've been
>> seeing, because I didn't have very much data to offer about why the
>> standard advice of "always use deadline for a database app" might not
>> apply to everyone.
>
> Damn, you would have to make things complicated, eh?
>
> FWIW, back when deadline was first introduced Mark Wong did some tests
> and found Deadline to be the fastest of 4 on DBT2 ... but only by about
> 5%.  If the read vs. checkpoint analysis is correct, what was happening
> is the penalty for checkpoints on deadline was almost wiping out the
> advantage for reads, but not quite.
>
> Those tests were also done on attached storage.
>
> So, what this suggests is:
> reads:  deadline > CFQ
> writes: CFQ > deadline
> attached storage:  deadline > CFQ
>
> Man, we'd need a lot of testing to settle this.  I guess that's why
> Linux gives us the choice of 4 ...

I wonder what the impact is from the underlying RAID configuration.
Those DBT2 tests were also LVM striped volumes on top of single RAID0
LUNS (no jbod option).

Regards.
Mark

From:
Scott Carey
Date:

On Feb 8, 2010, at 9:49 AM, Josh Berkus wrote:

>
> Those tests were also done on attached storage.
>
> So, what this suggests is:
> reads:  deadline > CFQ
> writes: CFQ > deadline
> attached storage:  deadline > CFQ
>

From my experience on reads:
Large sequential scans mixed with concurrent random reads behave very differently between the two schedulers.
Deadline has _significantly_ higher throughput in this situation, but the random read latency is higher.  CFQ will
starvethe sequential scan in favor of letting each concurrent read get some time. If your app is very latency sensitive
onreads, that is good.  If you need max throughput, getting the sequential scan out of the way instead of breaking it
upinto lots of small chunks is critical. 

I think it is this behavior that causes the delays on writes -- from the scheduler's point of view, a large set of
writesis usually somewhat sequential and deadline favors throughput over latency. 

Generally, my writes are large bulk writes, and I am not very latency sensitive but am very throughput sensitive.   So
deadlinehelps a great deal (combined with decently sized readahead).  Other use cases will clearly have different
preferences.

My experience with scheduler performace tuning is on CentOS 5.3 and 5.4.   With the changes to much of the I/O layer in
thelatest kernels, I would not be surprised if things have changed.  


> Man, we'd need a lot of testing to settle this.  I guess that's why
> Linux gives us the choice of 4 ...
>
> --Josh Berkus
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
Greg Smith
Date:

Josh Berkus wrote:
> FWIW, back when deadline was first introduced Mark Wong did some tests
> and found Deadline to be the fastest of 4 on DBT2 ... but only by about
> 5%.  If the read vs. checkpoint analysis is correct, what was happening
> is the penalty for checkpoints on deadline was almost wiping out the
> advantage for reads, but not quite.
>

Wasn't that before 8.3, where the whole checkpoint spreading logic
showed up?  That's really a whole different write pattern now than it
was then.  8.2 checkpoint writes were one big batch write amenable to
optimizing for throughput.  The new ones are not; the I/O is intermixed
with reads most of the time.

> Man, we'd need a lot of testing to settle this.  I guess that's why
> Linux gives us the choice of 4 ...
>

A recent on of these I worked on started with 4096 possible I/O
configurations we pruned down the most likely good candidates from.  I'm
almost ready to schedule a week on Mark's HP performance test system in
the lab now, to try and nail this down in a fully public environment for
once.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Greg Smith
Date:

Hannu Krosing wrote:
> Have you kept trace of what filesystems are in use ?
>

Almost everything I do on Linux has been with ext3.  I had a previous
diversion into VxFS and an upcoming one into XFS that may shed more
light on all this.

And, yes, the whole I/O scheduling approach in Linux was just completely
redesigned for a very recent kernel update.  So even what we think we
know is already obsolete in some respects.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
david@lang.hm
Date:

On Mon, 8 Feb 2010, Greg Smith wrote:

> Hannu Krosing wrote:
>> Have you kept trace of what filesystems are in use ?
>>
>
> Almost everything I do on Linux has been with ext3.  I had a previous
> diversion into VxFS and an upcoming one into XFS that may shed more light on
> all this.

it would be nice if you could try ext4 when doing your tests.

It's new enough that I won't trust it for production data yet, but a lot
of people are jumping on it as if it was just a minor update to ext3
instead of an almost entirely new filesystem.

David Lang

> And, yes, the whole I/O scheduling approach in Linux was just completely
> redesigned for a very recent kernel update.  So even what we think we know is
> already obsolete in some respects.
>
>

From:
Jeff
Date:

On Feb 8, 2010, at 11:35 PM,  wrote:
>
>> And, yes, the whole I/O scheduling approach in Linux was just
>> completely redesigned for a very recent kernel update.  So even
>> what we think we know is already obsolete in some respects.
>>

I'd done some testing a while ago on the schedulers and at the time
deadline or noop smashed cfq.  Now, it is 100% possible since then
that they've made vast improvements to cfq and or the VM to get better
or similar performance.  I recall a vintage of 2.6 where they severely
messed up the VM. Glad I didn't upgrade to that one :)

Here's the old post: http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php


--
Jeff Trout <>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




From:
Greg Smith
Date:

Jeff wrote:
> I'd done some testing a while ago on the schedulers and at the time
> deadline or noop smashed cfq.  Now, it is 100% possible since then
> that they've made vast improvements to cfq and or the VM to get better
> or similar performance.  I recall a vintage of 2.6 where they severely
> messed up the VM. Glad I didn't upgrade to that one :)
>
> Here's the old post:
> http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php

pgiosim doesn't really mix writes into there though, does it?  The mixed
read/write situations are the ones where the scheduler stuff gets messy.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Scott Marlowe
Date:

On Tue, Feb 9, 2010 at 11:37 PM, Greg Smith <> wrote:
> Jeff wrote:
>>
>> I'd done some testing a while ago on the schedulers and at the time
>> deadline or noop smashed cfq.  Now, it is 100% possible since then that
>> they've made vast improvements to cfq and or the VM to get better or similar
>> performance.  I recall a vintage of 2.6 where they severely messed up the
>> VM. Glad I didn't upgrade to that one :)
>>
>> Here's the old post:
>> http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php
>
> pgiosim doesn't really mix writes into there though, does it?  The mixed
> read/write situations are the ones where the scheduler stuff gets messy.

I agree. I think the only way to really test it is by testing it
against the system it's got to run under.  I'd love to see someone do
a comparison of early to mid 2.6 kernels (2.6.18 like RHEL5) to very
up to date 2.6 kernels.  On fast hardware.  What it does on a laptop
isn't that interesting and I don't have a big machine idle to test it
on.

From:
Jeff
Date:

On Feb 10, 2010, at 1:37 AM, Greg Smith wrote:

> Jeff wrote:
>> I'd done some testing a while ago on the schedulers and at the time
>> deadline or noop smashed cfq.  Now, it is 100% possible since then
>> that they've made vast improvements to cfq and or the VM to get
>> better or similar performance.  I recall a vintage of 2.6 where
>> they severely messed up the VM. Glad I didn't upgrade to that one :)
>>
>> Here's the old post: http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php
>
> pgiosim doesn't really mix writes into there though, does it?  The
> mixed read/write situations are the ones where the scheduler stuff
> gets messy.
>

It has the abillity to rewrite blocks randomly as well - but I
honestly don't remember if I did that during my cfq/deadline test.
I'd wager I didn't.  Maybe I'll get some time to run some more tests
on it in the next couple days

> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
>   www.2ndQuadrant.com
>

--
Jeff Trout <>
http://www.stuarthamm.net/
http://www.dellsmartexitin.com/




From:
Scott Carey
Date:

On Feb 9, 2010, at 10:37 PM, Greg Smith wrote:

> Jeff wrote:
>> I'd done some testing a while ago on the schedulers and at the time
>> deadline or noop smashed cfq.  Now, it is 100% possible since then
>> that they've made vast improvements to cfq and or the VM to get better
>> or similar performance.  I recall a vintage of 2.6 where they severely
>> messed up the VM. Glad I didn't upgrade to that one :)
>>
>> Here's the old post:
>> http://archives.postgresql.org/pgsql-performance/2008-04/msg00155.php
>
> pgiosim doesn't really mix writes into there though, does it?  The mixed
> read/write situations are the ones where the scheduler stuff gets messy.
>

Also, read/write mix performance depend on the file system not just the scheduler.
The block device readahead parameter can have a big impact too.

If you test xfs, make sure you configure the 'allocsize' mount parameter properly as well.  If there are any sequential
readsor writes in there mixed with other reads/writes, that can have a big impact on how fragmented the filesystem
gets.

Ext3 has several characteristics for writes that might favor cfq that other file systems do not.  Features like delayed
allocation,extents, and write barriers significantly change the pattern of writes seen by the I/O scheduler. 

In short, one scheduler may be best for one filesystem, but not a good idea for others.

And then on top of that, it all depends on what type of DB you're running.  Lots of small fast mostly read queries?
Largenumber of small writes?  Large bulk writes?  Large reporting queries?  Different configurations and tuning is
requiredto maximize performance on each. 

There is no single rule for Postgres on Linux that I can think of other than "never have ext3 in 'ordered' or 'journal'
modefor your WAL on the same filesystem as your data". 

> --
> Greg Smith    2ndQuadrant   Baltimore, MD
> PostgreSQL Training, Services and Support
>   www.2ndQuadrant.com
>
>
> --
> Sent via pgsql-performance mailing list ()
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance


From:
Greg Smith
Date:

Scott Marlowe wrote:
> I'd love to see someone do a comparison of early to mid 2.6 kernels (2.6.18 like RHEL5) to very
> up to date 2.6 kernels.  On fast hardware.

I'd be happy just to find fast hardware that works on every kernel from
the RHEL5 2.6.18 up to the latest one without issues.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
david@lang.hm
Date:

On Wed, 10 Feb 2010, Greg Smith wrote:

> Scott Marlowe wrote:
>> I'd love to see someone do a comparison of early to mid 2.6 kernels (2.6.18
>> like RHEL5) to very
>> up to date 2.6 kernels.  On fast hardware.
>
> I'd be happy just to find fast hardware that works on every kernel from the
> RHEL5 2.6.18 up to the latest one without issues.

it depends on your definition of 'fast hardware'

I have boxes that were very fast at the time that work on all these
kernels, but they wouldn't be considered fast by todays's standards.

remember that there is a point release about every 3 months, 2.6.33 is
about to be released, so this is a 3 x (33-18) = ~45 month old kernel.

hardware progresses a LOT on 4 years.

most of my new hardware has no problems with the old kernels as well, but
once in a while I run into something that doesn't work.

David Lang

From:
Greg Smith
Date:

 wrote:
> most of my new hardware has no problems with the old kernels as well,
> but once in a while I run into something that doesn't work.

Quick survey just of what's within 20 feet of me:
-Primary desktop:  2 years old, requires 2.6.23 or later for SATA to work
-Server:  3 years old, requires 2.6.22 or later for the Areca card not
to panic under load
-Laptops:  both about 2 years old, and require 2.6.28 to work at all;
mostly wireless issues, but some power management ones that impact the
processor working right too, occasional SATA ones too.

I'm looking into a new primary desktop to step up to 8 HT cores; I fully
expect it won't boot anything older than 2.6.28 and may take an even
newer kernel just for basic processor and disks parts to work.

We're kind of at a worst-case point right now for this sort of thing, on
the tail side of the almost 3 year old RHEL5 using a 3.5 year old kernel
as the standard for so many Linux server deployments.  Until RHEL6 is
ready to go, there's little motivation for the people who make server
hardware to get all their drivers perfect in the newer kernels.  Just
after that ships will probably be a good time to do that sort of
comparison, like it was possible to easily compare RHEL4 using 2.6.9 and
RHEL5 with 2.6.18 easily in mid to late 2007 with many bits of
high-performance hardware known to work well on each.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
  www.2ndQuadrant.com


From:
Jeff Davis
Date:

On Mon, 2010-02-08 at 09:49 -0800, Josh Berkus wrote:
> FWIW, back when deadline was first introduced Mark Wong did some tests
> and found Deadline to be the fastest of 4 on DBT2 ... but only by about
> 5%.  If the read vs. checkpoint analysis is correct, what was happening
> is the penalty for checkpoints on deadline was almost wiping out the
> advantage for reads, but not quite.

I also did some tests when I was putting together my Synchronized Scan
benchmarks:

http://j-davis.com/postgresql/83v82_scans.html

CFQ was so slow that I didn't include it in the results at all.

The tests weren't intended to compare schedulers, so I did most of the
tests with anticipatory (at least the ones on linux; I also tested
freebsd). However, I have some raw data from the tests I did run with
CFQ:

http://j-davis.com/postgresql/results/

They will take some interpretation (again, not intended as scheduler
benchmarks). The server was modified to record a log message every N
page accesses.

Regards,
    Jeff Davis