Thread: linux deadline i/o elevator tuning
Hi all, Has anyone experimented with the Linux deadline parameters and have some experiences to share? Regards, Mark
Mark Wong <markwkm@gmail.com> wrote: > Has anyone experimented with the Linux deadline parameters and > have some experiences to share? We've always used elevator=deadline because of posts like this: http://archives.postgresql.org/pgsql-performance/2008-04/msg00148.php I haven't benchmarked it, but when one of our new machines seemed a little sluggish, I found this hadn't been set. Setting this and rebooting Linux got us back to our normal level of performance. -Kevin
acording to kernel folks, anticipatory scheduler is even better for dbs. Oh well, it probably means everyone has to test it on their own at the end of day.
On Thu, 9 Apr 2009, Grzegorz Jaśkiewicz wrote: > acording to kernel folks, anticipatory scheduler is even better for dbs. > Oh well, it probably means everyone has to test it on their own at the > end of day. But the anticipatory scheduler basically makes the huge assumption that you have one single disc in the system that takes a long time to seek from one place to another. This assumption fails on both RAID arrays and SSDs, so I'd be interested to see some numbers to back that one up. Matthew -- import oz.wizards.Magic; if (Magic.guessRight())... -- Computer Science Lecturer
On Thu, Apr 9, 2009 at 3:32 PM, Matthew Wakeling <matthew@flymine.org> wrote: > On Thu, 9 Apr 2009, Grzegorz Jaśkiewicz wrote: >> >> acording to kernel folks, anticipatory scheduler is even better for dbs. >> Oh well, it probably means everyone has to test it on their own at the >> end of day. > > But the anticipatory scheduler basically makes the huge assumption that you > have one single disc in the system that takes a long time to seek from one > place to another. This assumption fails on both RAID arrays and SSDs, so I'd > be interested to see some numbers to back that one up. (btw, CFQ is the anticipatory scheduler). no they not. They only assume that application reads blocks in synchronous fashion, and that data read in block N will determine where the N+1 block is going to be. So to avoid possible starvation problem, it will wait for short amount of time - in hope that app will want to read possibly next block on disc, and putting that request at the end of queue could potentially starve it. (that reason alone is why 2.6 linux feels so much more responsive). -- GJ
Matthew Wakeling <matthew@flymine.org> wrote: > On Thu, 9 Apr 2009, Grzegorz Jaœkiewicz wrote: >> acording to kernel folks, anticipatory scheduler is even better for >> dbs. Oh well, it probably means everyone has to test it on their >> own at the end of day. > > But the anticipatory scheduler basically makes the huge assumption > that you have one single disc in the system that takes a long time > to seek from one place to another. This assumption fails on both > RAID arrays and SSDs, so I'd be interested to see some numbers to > back that one up. Yeah, we're running on servers with at least 4 effective spindles, with some servers having several dozen effective spindles. Assuming one is not very effective. The setting which seemed sluggish for our environment was the anticipatory scheduler, so the kernel guys apparently aren't thinking about the type of load we have on the hardware we have. -Kevin
Grzegorz Jaœkiewicz <gryzman@gmail.com> wrote: > (btw, CFQ is the anticipatory scheduler). These guys have it wrong?: http://www.wlug.org.nz/LinuxIoScheduler -Kevin
On Thu, 9 Apr 2009, Grzegorz Jaśkiewicz wrote: > (btw, CFQ is the anticipatory scheduler). No, CFQ and anticipatory are two completely different schedulers. You can choose between them. >> But the anticipatory scheduler basically makes the huge assumption that you >> have one single disc in the system that takes a long time to seek from one >> place to another. This assumption fails on both RAID arrays and SSDs, so I'd >> be interested to see some numbers to back that one up. > > So to avoid possible starvation problem, it will wait for short amount > of time - in hope that app will want to read possibly next block on > disc, and putting that request at the end of queue could potentially > starve it. (that reason alone is why 2.6 linux feels so much more > responsive). This only actually helps if the assumptions I stated above are true. Anticipatory is an opportunistic scheduler - it actually witholds requests from the disc as you describe, in the hope that a block will be fetched soon right next to the last one. However, if you have more than one disc, then witholding requests means that you lose the ability to perform more than one request at once. Also, it assumes that it will take longer to seek to the next real request that it will for the program to issue its next request, which is broken on SSDs. Anticipatory attempts to increase performance by being unfair - it is essentially the opposite of CFQ. Matthew -- Now you see why I said that the first seven minutes of this section will have you looking for the nearest brick wall to beat your head against. This is why I do it at the end of the lecture - so I can run. -- Computer Science lecturer
On Thu, Apr 9, 2009 at 3:42 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Grzegorz Jaœkiewicz <gryzman@gmail.com> wrote: >> (btw, CFQ is the anticipatory scheduler). > > These guys have it wrong?: > > http://www.wlug.org.nz/LinuxIoScheduler sorry, I meant it replaced it :) (is default now). -- GJ
On 9-4-2009 16:09 Kevin Grittner wrote: > I haven't benchmarked it, but when one of our new machines seemed a > little sluggish, I found this hadn't been set. Setting this and > rebooting Linux got us back to our normal level of performance. Why would you reboot after changing the elevator? For 2.6-kernels, it can be adjusted on-the-fly for each device separately (echo 'deadline' > /sys/block/sda/queue/scheduler). I saw a nice reduction in load and slowness too after adjusting the cfq to deadline for a machine that was at its maximum I/O-capacity on a raid-array. Apart from deadline, 'noop' should also be interesting for RAID and SSD-owners, as it basically just forwards the I/O-request to the device and doesn't do much (if any?) scheduling. Best regards, Arjen
On Thu, Apr 9, 2009 at 7:00 AM, Mark Wong <markwkm@gmail.com> wrote: > Hi all, > > Has anyone experimented with the Linux deadline parameters and have some > experiences to share? Hi all, Thanks for all the responses, but I didn't mean selecting deadline as much as its parameters such as: antic_expire read_batch_expire read_expire write_batch_expire write_expire Regards, Mark
Arjen van der Meijden <acmmailing@tweakers.net> wrote: > On 9-4-2009 16:09 Kevin Grittner wrote: >> I haven't benchmarked it, but when one of our new machines seemed a >> little sluggish, I found this hadn't been set. Setting this and >> rebooting Linux got us back to our normal level of performance. > > Why would you reboot after changing the elevator? For 2.6-kernels, > it can be adjusted on-the-fly for each device separately > (echo 'deadline' > /sys/block/sda/queue/scheduler). On the OS where this happened, not yet an option: kgrittn@DBUTL-PG:~> cat /proc/version Linux version 2.6.5-7.315-bigsmp (geeko@buildhost) (gcc version 3.3.3 (SuSE Linux)) #1 SMP Wed Nov 26 13:03:18 UTC 2008 kgrittn@DBUTL-PG:~> ls -l /sys/block/sda/queue/ total 0 drwxr-xr-x 2 root root 0 2009-03-06 15:27 iosched -rw-r--r-- 1 root root 4096 2009-03-06 15:27 nr_requests -rw-r--r-- 1 root root 4096 2009-03-06 15:27 read_ahead_kb On machines built more recently than the above, I do see a scheduler entry in the /sys/block/sda/queue/ directory. I didn't know about this enhancement, but I'll keep it in mind. Thanks for the tip! > Apart from deadline, 'noop' should also be interesting for RAID and > SSD-owners, as it basically just forwards the I/O-request to the > device and doesn't do much (if any?) scheduling. Yeah, I've been tempted to give that a try, given that we have BBU cache with write-back. Without a performance problem using elevator, though, it hasn't seemed worth the time. -Kevin
On Thu, Apr 9, 2009 at 7:53 AM, Mark Wong <markwkm@gmail.com> wrote: > On Thu, Apr 9, 2009 at 7:00 AM, Mark Wong <markwkm@gmail.com> wrote: >> Hi all, >> >> Has anyone experimented with the Linux deadline parameters and have some >> experiences to share? > > Hi all, > > Thanks for all the responses, but I didn't mean selecting deadline as > much as its parameters such as: > > antic_expire > read_batch_expire > read_expire > write_batch_expire > write_expire And I dumped the parameters for the anticipatory scheduler. :p Here are the deadline parameters: fifo_batch front_merges read_expire write_expire writes_starved Regards, Mark
The anticipatory scheduler gets absolutely atrocious performance for server workloads on even moderate server hardware. It is applicable only to single spindle setups on desktop-like worlkoads. Seriously, never use this for a database. It _literally_ will limit you to 100 iops maximum random access iops by waiting 10ms for 'nearby' LBA requests. For Postgres, deadline, cfq, and noop are the main options. Noop is good for ssds and a few high performance hardware caching RAID cards (and only a few of the good ones), and poor otherwise. Cfq tends to favor random access over sequential access in mixed load environments and does not tend to favor reads over writes. Because it batches its elevator algorithm by requesting process, it becomes less efficient with lots of spindles where multiple processes have requests from nearby disk regions. Deadline tends to favor reads over writes and slightly favor sequential access to random access (and gets more MB/sec on average as a result in mixed loads). It tends to work well for large stand-alone servers and not as well for desktop/workstation type loads. I have done a little tuning of the parameters of cfq and deadline, and never noticed much difference. I suppose you could shift the deadline biases to read or write with these. On 4/9/09 7:27 AM, "Grzegorz Jaśkiewicz" <gryzman@gmail.com> wrote: > acording to kernel folks, anticipatory scheduler is even better for dbs. > Oh well, it probably means everyone has to test it on their own at the > end of day. > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Grzegorz Jaskiewicz wrote: > acording to kernel folks, anticipatory scheduler is even better for dbs. > Oh well, it probably means everyone has to test it on their own at the > end of day. In my test case, noop and deadline performed well, deadline being a little better than noop. Both anticipatory and CFQ sucked big time. Yours, Laurenz Albe
On Apr 10, 2009, at 2:47 AM, Albe Laurenz *EXTERN* wrote: > Grzegorz Jaskiewicz wrote: >> acording to kernel folks, anticipatory scheduler is even better for >> dbs. >> Oh well, it probably means everyone has to test it on their own at >> the >> end of day. > > In my test case, noop and deadline performed well, deadline being a > little > better than noop. > > Both anticipatory and CFQ sucked big time. > This is my experience as well, I posted about playing with the scheduler a while ago on -performance, but I can't seem to find it. If you have a halfway OK raid controller, CFQ is useless. You can fire up something such as pgbench or pgiosim, fire up an iostat and then watch your iops jump high when you flip to noop or deadline and plummet on cfq. Try it. it's neat! -- Jeff Trout <jeff@jefftrout.com> http://www.stuarthamm.net/ http://www.dellsmartexitin.com/
Jeff <threshar@torgo.978.org> wrote: > If you have a halfway OK raid controller, CFQ is useless. You can fire > up something such as pgbench or pgiosim, fire up an iostat and then > watch your iops jump high when you flip to noop or deadline and > plummet on cfq. An interesting data point, but not, by itself, conclusive. One of the nice things about a good scheduler is that it allows multiple writes to the OS to be combined into a single write to the controller cache. I think that having a large OS cache and the deadline elevator allowed us to use what some considered extremely aggressive background writer settings without *any* discernible increase in OS output to the disk. The significant measure is throughput from the application point of view; if you see that drop as cfq causes the disk I/O to drop, *then* you've proven your point. Of course, I'm betting that's what you do see.... -Kevin