Thread: Effects of setting linux block device readahead size

Effects of setting linux block device readahead size

From
"Mark Wong"
Date:
Hi all,

I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.  There
are lots of raw data buried in the page, but this is what I've
distilled thus far.  Please have a look and let me know what you
think:

http://wiki.postgresql.org/wiki/HP_ProLiant_DL380_G5_Tuning_Guide#Readahead_Buffer_Size

Regards,
Mark

Re: Effects of setting linux block device readahead size

From
Greg Smith
Date:
On Tue, 9 Sep 2008, Mark Wong wrote:

> I've started to display the effects of changing the Linux block device
> readahead buffer to the sequential read performance using fio.

Ah ha, told you that was your missing tunable.  I'd really like to see the
whole table of one disk numbers re-run when you get a chance.  The
reversed ratio there on ext2 (59MB read/92MB write) was what tipped me off
that something wasn't quite right initially, and until that's fixed it's
hard to analyze the rest.

Based on your initial data, I'd say that the two useful read-ahead
settings for this system are 1024KB (conservative but a big improvement)
and 8192KB (point of diminishing returns).  The one-disk table you've got
(labeled with what the default read-ahead is) and new tables at those two
values would really flesh out what each disk is capable of.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
How does that readahead tunable affect random reads or mixed random / sequential situations?  In many databases, the worst case scenarios aren't when you have a bunch of concurrent sequential scans but when there is enough random read/write concurrently to slow the whole thing down to a crawl.   How the file system behaves under this sort of concurrency

I would be very interested in a mixed fio profile with a "background writer" doing moderate, paced random and sequential writes combined with concurrent sequential reads and random reads.

-Scott

On Wed, Sep 10, 2008 at 7:49 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Tue, 9 Sep 2008, Mark Wong wrote:

I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.

Ah ha, told you that was your missing tunable.  I'd really like to see the whole table of one disk numbers re-run when you get a chance.  The reversed ratio there on ext2 (59MB read/92MB write) was what tipped me off that something wasn't quite right initially, and until that's fixed it's hard to analyze the rest.

Based on your initial data, I'd say that the two useful read-ahead settings for this system are 1024KB (conservative but a big improvement) and 8192KB (point of diminishing returns).  The one-disk table you've got (labeled with what the default read-ahead is) and new tables at those two values would really flesh out what each disk is capable of.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Effects of setting linux block device readahead size

From
"Mark Wong"
Date:
On Wed, Sep 10, 2008 at 9:26 AM, Scott Carey <scott@richrelevance.com> wrote:
> How does that readahead tunable affect random reads or mixed random /
> sequential situations?  In many databases, the worst case scenarios aren't
> when you have a bunch of concurrent sequential scans but when there is
> enough random read/write concurrently to slow the whole thing down to a
> crawl.   How the file system behaves under this sort of concurrency
>
> I would be very interested in a mixed fio profile with a "background writer"
> doing moderate, paced random and sequential writes combined with concurrent
> sequential reads and random reads.

The data for the other fio profiles we've been using are on the wiki,
if your eyes can take the strain.  We are working on presenting the
data in a more easily digestible manner.  I don't think we'll add any
more fio profiles in the interest of moving on to doing some sizing
exercises with the dbt2 oltp workload.  We're just going to wrap up a
couple more scenarios first and get through a couple of conference
presentations.  The two conferences in particular are the Linux
Plumbers Conference, and the PostgreSQL Conference: West 08, which are
both in Portland, Oregon.

Regards,
Mark

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
I am planning my own I/O tuning exercise for a new DB and am setting up some fio profiles.  I appreciate the work and will use some of yours as a baseline to move forward.  I will be making some mixed mode fio profiles and running our own application and database as a test as well.  I'll focus on ext3 versus xfs (Linux) and zfs (Solaris) however, and expect to be working with sequential transfer rates many times larger than your test and am interested in performance under heavy concurrency -- so the results may differ quite a bit.

I'll share the info I can.


On Wed, Sep 10, 2008 at 10:38 AM, Mark Wong <markwkm@gmail.com> wrote:
On Wed, Sep 10, 2008 at 9:26 AM, Scott Carey <scott@richrelevance.com> wrote:
> How does that readahead tunable affect random reads or mixed random /
> sequential situations?  In many databases, the worst case scenarios aren't
> when you have a bunch of concurrent sequential scans but when there is
> enough random read/write concurrently to slow the whole thing down to a
> crawl.   How the file system behaves under this sort of concurrency
>
> I would be very interested in a mixed fio profile with a "background writer"
> doing moderate, paced random and sequential writes combined with concurrent
> sequential reads and random reads.

The data for the other fio profiles we've been using are on the wiki,
if your eyes can take the strain.  We are working on presenting the
data in a more easily digestible manner.  I don't think we'll add any
more fio profiles in the interest of moving on to doing some sizing
exercises with the dbt2 oltp workload.  We're just going to wrap up a
couple more scenarios first and get through a couple of conference
presentations.  The two conferences in particular are the Linux
Plumbers Conference, and the PostgreSQL Conference: West 08, which are
both in Portland, Oregon.

Regards,
Mark

Re: Effects of setting linux block device readahead size

From
Greg Smith
Date:
On Wed, 10 Sep 2008, Scott Carey wrote:

> How does that readahead tunable affect random reads or mixed random /
> sequential situations?

It still helps as long as you don't make the parameter giant.  The read
cache in a typical hard drive noawadays is 8-32MB.  If you're seeking a
lot, you still might as well read the next 1MB or so after the block
requested once you've gone to the trouble of moving the disk somewhere.
Seek-bound workloads will only waste a relatively small amount of the
disk's read cache that way--the slow seek rate itself keeps that from
polluting the buffer cache too fast with those reads--while sequential
ones benefit enormously.

If you look at Mark's tests, you can see approximately where the readahead
is filling the disk's internal buffers, because what happens then is the
sequential read performance improvement levels off.  That looks near 8MB
for the array he's tested, but I'd like to see a single disk to better
feel that out.  Basically, once you know that, you back off from there as
much as you can without killing sequential performance completely and that
point should still support a mixed workload.

Disks are fairly well understood physical components, and if you think in
those terms you can build a gross model easily enough:

Average seek time:      4ms
Seeks/second:        250
Data read/seek:        1MB    (read-ahead number goes here)
Total read bandwidth:    250MB/s

Since that's around what a typical interface can support, that's why I
suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's
pretty close to optimal for sequential as well here (big improvement from
the default Linux RA of 256 blocks=128K).  If you know your work is biased
heavily toward sequential scans, you might pick the 8MB read-ahead
instead.  That value (--setra=16384 -> 8MB) has actually been the standard
"start here" setting 3ware suggests on Linux for a while now:
http://www.3ware.com/kb/Article.aspx?id=11050

> I would be very interested in a mixed fio profile with a "background writer"
> doing moderate, paced random and sequential writes combined with concurrent
> sequential reads and random reads.

Trying to make disk benchmarks really complicated is a path that leads to
a lot of wasted time.  I one made this gigantic design plan for something
that worked like the PostgreSQL buffer management system to work as a disk
benchmarking tool.  I threw it away after confirming I could do better
with carefully scripted pgbench tests.

If you want to benchmark something that looks like a database workload,
benchmark a database workload.  That will always be better than guessing
what such a workload acts like in a synthetic fashion.  The "seeks/second"
number bonnie++ spits out is good enough for most purposes at figuring out
if you've detuned seeks badly.

"pgbench -S" run against a giant database gives results that look a lot
like seeks/second, and if you mix multiple custom -f tests together it
will round-robin between them at random...

It's really helpful to measure these various disk subsystem parameters
individually.  Knowing the sequential read/write, seeks/second, and commit
rate for a disk setup is mainly valuable at making sure you're getting the
full performance expected from what you've got.  Like in this example,
where something was obviously off on the single disk results because reads
were significantly slower than writes.  That's not supposed to happen, so
you know something basic is wrong before you even get into RAID and such.
Beyond confirming whether or not you're getting approximately what you
should be out of the basic hardware, disk benchmarks are much less useful
than application ones.

With all that, I think I just gave away what the next conference paper
I've been working on is about.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
Great info Greg,

Some follow-up questions and information in-line:

On Wed, Sep 10, 2008 at 12:44 PM, Greg Smith <gsmith@gregsmith.com> wrote:
On Wed, 10 Sep 2008, Scott Carey wrote:

How does that readahead tunable affect random reads or mixed random /
sequential situations?

It still helps as long as you don't make the parameter giant.  The read cache in a typical hard drive noawadays is 8-32MB.  If you're seeking a lot, you still might as well read the next 1MB or so after the block requested once you've gone to the trouble of moving the disk somewhere. Seek-bound workloads will only waste a relatively small amount of the disk's read cache that way--the slow seek rate itself keeps that from polluting the buffer cache too fast with those reads--while sequential ones benefit enormously.

If you look at Mark's tests, you can see approximately where the readahead is filling the disk's internal buffers, because what happens then is the sequential read performance improvement levels off.  That looks near 8MB for the array he's tested, but I'd like to see a single disk to better feel that out.  Basically, once you know that, you back off from there as much as you can without killing sequential performance completely and that point should still support a mixed workload.

Disks are fairly well understood physical components, and if you think in those terms you can build a gross model easily enough:

Average seek time:      4ms
Seeks/second:           250
Data read/seek:         1MB     (read-ahead number goes here)
Total read bandwidth:   250MB/s

Since that's around what a typical interface can support, that's why I suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's pretty close to optimal for sequential as well here (big improvement from the default Linux RA of 256 blocks=128K).  If you know your work is biased heavily toward sequential scans, you might pick the 8MB read-ahead instead.  That value (--setra=16384 -> 8MB) has actually been the standard "start here" setting 3ware suggests on Linux for a while now: http://www.3ware.com/kb/Article.aspx?id=11050

Ok, so this is a drive level parameter that affects the data going into the disk cache?  Or does it also get pulled over the SATA/SAS link into the OS page cache?  I've been searching around with google for the answer and can't seem to find it.

Additionally, I would like to know how this works with hardware RAID -- Does it set this value per disk?  Does it set it at the array level (so that 1MB with an 8 disk stripe is actually 128K per disk)?  Is it RAID driver dependant?  If it is purely the OS, then it is above raid level and affects the whole array -- and is hence almost useless.  If it is for the whole array, it would have horrendous negative impact on random I/O per second if the total readahead became longer than a stripe width -- if it is a full stripe then each I/O, even those less than the size of a stripe, would cause an I/O on every drive, dropping the I/O per second to that of a single drive.
If it is a drive level setting, then it won't affect i/o per sec by making i/o's span multiple drives in a RAID, which is good.

Additionally, the O/S should have a good heuristic based read-ahead process that should make the drive/device level read-ahead much less important.  I don't know how long its going to take for Linux to do this right:
http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
http://kerneltrap.org/node/6642


Lets expand a bit on your model above for a single disk:

A single disk, with 4ms seeks, and max disk throughput of 125MB/sec.  The interface can transfer 300MB/sec.
250 seeks/sec. Some chunk of data in that seek is free, afterwords it is surely not.
512KB can be read in 4ms then.  A 1MB read-ahead would result in:
4ms seek, 8ms read.   1MB seeks/sec ~=83 seeks/sec.
However, some chunk of that 1MB is "free" with the seek.  I'm not sure how much per drive, but it is likely on the order of 8K - 64K.

I suppose I'll have to experiment in order to find out.  But I can't see how a 1MB read-ahead, which should take 2x as long as seek time to read off the platters, could not have significant impact on random I/O per second on single drives.   For SATA drives the transfer rate to seek time ratio is smaller, and their caches are bigger, so a larger read-ahead will impact things less.


 


I would be very interested in a mixed fio profile with a "background writer"
doing moderate, paced random and sequential writes combined with concurrent
sequential reads and random reads.

Trying to make disk benchmarks really complicated is a path that leads to a lot of wasted time.  I one made this gigantic design plan for something that worked like the PostgreSQL buffer management system to work as a disk benchmarking tool.  I threw it away after confirming I could do better with carefully scripted pgbench tests.

If you want to benchmark something that looks like a database workload, benchmark a database workload.  That will always be better than guessing what such a workload acts like in a synthetic fashion.  The "seeks/second" number bonnie++ spits out is good enough for most purposes at figuring out if you've detuned seeks badly.

"pgbench -S" run against a giant database gives results that look a lot like seeks/second, and if you mix multiple custom -f tests together it will round-robin between them at random...

I suppose I should learn more about pgbench.  Most of this depends on how much time it takes to do one versus the other.  In my case, setting up the DB will take significantly longer than writing 1 or 2 more fio profiles.  I categorize mixed load tests as basic test -- you don't want to uncover configuration issues after the application test that running a mix of read/write and sequential/random could have uncovered with a simple test.  This is similar to increasing the concurrency.  Some file systems deal with concurrency much better than others. 
 

It's really helpful to measure these various disk subsystem parameters individually.  Knowing the sequential read/write, seeks/second, and commit rate for a disk setup is mainly valuable at making sure you're getting the full performance expected from what you've got.  Like in this example, where something was obviously off on the single disk results because reads were significantly slower than writes.  That's not supposed to happen, so you know something basic is wrong before you even get into RAID and such. Beyond confirming whether or not you're getting approximately what you should be out of the basic hardware, disk benchmarks are much less useful than application ones.

Absolutely -- its critical to run the synthetic tests, and the random read/write and sequential read/write are critical.  These should be tuned and understood before going on to more complicated things. 
However, once you actually go and set up a database test, there are tons of questions -- what type of database? what type of query load?  what type of mix? how big?  In my case, the answer is, our database, our queries, and big.  That takes a lot of setup effort, and redoing it for each new file system will take a long time in my case -- pg_restore takes a day+.  Therefore, I'd like to know ahead of time what file system + configuration combinations are a waste of time because they don't perform under concurrency with mixed workload.  Thats my admiteddly greedy need for the extra test results.

 
With all that, I think I just gave away what the next conference paper I've been working on is about.


--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Looking forward to it!

Re: Effects of setting linux block device readahead size

From
Greg Smith
Date:
On Wed, 10 Sep 2008, Scott Carey wrote:

> Ok, so this is a drive level parameter that affects the data going into the
> disk cache?  Or does it also get pulled over the SATA/SAS link into the OS
> page cache?

It's at the disk block driver level in Linux, so I believe that's all
going into the OS page cache.  They've been rewriting that section a bit
and I haven't checked it since that change (see below).

> Additionally, I would like to know how this works with hardware RAID -- Does
> it set this value per disk?

Hardware RAID controllers usually have their own read-ahead policies that
may or may not impact whether the OS-level read-ahead is helpful.  Since
Mark's tests are going straight into the RAID controller, that's why it's
helpful here, and why many people don't ever have to adjust this
parameter.  For example, it doesn't give a dramatic gain on my Areca card
even in JBOD mode, because that thing has its own cache to manage with its
own agenda.

Once you start fiddling with RAID stripe sizes as well the complexity
explodes, and next thing you know you're busy moving the partition table
around to make the logical sectors line up with the stripes better and
similar exciting work.

> Additionally, the O/S should have a good heuristic based read-ahead process
> that should make the drive/device level read-ahead much less important.  I
> don't know how long its going to take for Linux to do this right:
> http://archives.postgresql.org/pgsql-performance/2006-04/msg00491.php
> http://kerneltrap.org/node/6642

That was committed in 2.6.23:

http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

but clearly some larger minimum hints still helps, as the system we've
been staring at benchmarks has that feature.

> Some chunk of data in that seek is free, afterwords it is surely not...

You can do a basic model of the drive to get a ballpark estimate on these
things like I threw out, but trying to break down every little bit gets
hairy.  In most estimation cases you see, where 128kB is the amount being
read, the actual read time is so small compared to the rest of the numbers
that it just gets ignored.

I was actually being optimistic about how much cache can get filled by
seeks.  If the disk is spinning at 15000RPM, that's 4ms to do a full
rotation.  That means that on average you'll also wait 2ms just to get the
heads lined up to read that one sector on top of the 4ms seek to get in
the area; now we're at 6ms before you've read anything, topping seeks out
at under 167/second.  That number--average seek time plus half a
rotation--is what a lot of people call the IOPS for the drive.  There,
typically the time spent actually reading data once you've gone through
all that doesn't factor in.  IOPS is not very well defined, some people
*do* include the reading time once you're there; one reason I don't like
to use it.  There's a nice chart showing some typical computations here at
http://www.dbasupport.com/oracle/ora10g/disk_IO_02.shtml if anybody wants
to see how this works for other classes of disk.  The other reason I don't
like focusing too much on IOPS (some people act like it's the only
measurement that matters) is that it tells you nothing about the
sequential read rate, and you have to consider both at once to get a clear
picture--particularly when there are adjustments that impact those two
oppositely, like read-ahead.

As far as the internal transfer speed of the heads to the drive's cache
once it's lined up, those are creeping up toward the 200MB/s range for the
kind of faster drives the rest of these stats come from.  So the default
of 128kB is going to take 0.6ms, while a full 1MB might take 5ms.  You're
absolutely right to question how hard that will degrade seek performance;
these slightly more accurate numbers suggest that might be as bad as going
from 6.6ms to 11ms per seek, or from 150 IOPS to 91 IOPS.  It also points
out how outrageously large the really big read-ahead numbers are once
you're seeking instead of sequentially reading.

One thing it's hard to know is how much read-ahead the drive was going to
do on its own, no matter what you told it, anyway as part of its caching
algorithm.

> I suppose I should learn more about pgbench.

Most people use it as just a simple benchmark that includes a mixed
read/update/insert workload.  But that's internally done using a little
command substition "language" that let's you easily write things like
"generate a random number between 1 and 1M, read the record from this
table, and then update this associated record" that scale based on how big
the data set you've given it is.  You an write your own scripts in that
form too.  And if you specify several scripts like that at a time, it will
switch between them at random, and you can analyze the average execution
time broken down per type if you save the latency logs. Makes it real easy
to adjust the number of clients and the mix of things you have them do.

The main problem: it doesn't scale to large numbers of clients very well.
But it can easily simulate 50-100 banging away at a time which is usually
enough to rank filesystem concurrency capabilities, for example.  It's
certainly way easier to throw together a benchmark using it that is
similar to an abstract application than it is to try and model multi-user
database I/O using fio.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Effects of setting linux block device readahead size

From
James Mansion
Date:
Greg Smith wrote:
> Average seek time:      4ms
> Seeks/second:        250
> Data read/seek:        1MB    (read-ahead number goes here)
> Total read bandwidth:    250MB/s
>
Most spinning disks now are nearer to 100MB/s streaming.  You've talked
yourself into twice that, random access!

James


Re: Effects of setting linux block device readahead size

From
"Scott Marlowe"
Date:
On Wed, Sep 10, 2008 at 11:21 PM, James Mansion
<james@mansionfamily.plus.com> wrote:
> Greg Smith wrote:
>>
>> Average seek time:      4ms
>> Seeks/second:        250
>> Data read/seek:        1MB    (read-ahead number goes here)
>> Total read bandwidth:    250MB/s
>>
> Most spinning disks now are nearer to 100MB/s streaming.  You've talked
> yourself into twice that, random access!

The fastest cheetahs on this page hit 171MB/second:

http://www.seagate.com/www/en-us/products/servers/cheetah/

Are there any drives that have a faster sequential transfer rate out there?

Checked out hitachi's global storage site and they're fastest drive
seems just a tad slower.

Re: Effects of setting linux block device readahead size

From
Greg Smith
Date:
On Thu, 11 Sep 2008, James Mansion wrote:

> Most spinning disks now are nearer to 100MB/s streaming.  You've talked
> yourself into twice that, random access!

The point I was trying to make there is that even under impossibly optimal
circumstances, you'd be hard pressed to blow out the disk's read cache
with seek-dominated data even if you read a lot at each seek point.  That
idea didn't make it from my head into writing very well though.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
Hmm, I would expect this tunable to potentially be rather file system dependent, and potentially raid controller dependant.  The test was using ext2, perhaps the others automatically prefetch or read ahead?   Does it vary by RAID controller?

Well I went and found out, using ext3 and xfs.  I have about 120+ data points but here are a few interesting ones before I compile the rest and answer a few other questions of my own.

1:  readahead does not affect "pure" random I/O -- there seems to be a heuristic trigger -- a single process or file probably has to request a sequence of linear I/O of some size to trigger it.  I set it to over 64MB of read-ahead and random iops remained the same to prove this.
2:  File system matters more than you would expect.  XFS sequential transfers when readahead was tuned had TWICE the sequential throughput of ext3, both for a single reader and 8 concurrent readers on 8 different files.
3:  The RAID controller and its configuration make a pretty significant difference as well.

Hardware:
12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3, xfs)
I also have some results with PERC raid 10 with 4x 15K SAS, not reporting in this message though


Testing process:
All tests begin with
#sync; echo 3 > /proc/sys/vm/drop_caches;
followed by
#blockdev --setra XXX /dev/sdb
Even though FIO claims that it issues reads that don't go to cache, the read-ahead DOES go to the file system cache, and so one must drop them to get consistent results unless you disable the read-ahead.  Even if you are reading more than 2x the physical RAM, that first half of the test is distorted.  By flushing the cache first my results became consistent within about +-2%.

Tests
-- fio, read 8 files concurrently, sequential read profile, one process per file:
[seq-read8]
rw=read
; this will be total of all individual files per process
size=8g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=8
; this is number of files total per process
nrfiles=1
runtime=1m

-- fio, read one large file sequentially with one process
[seq-read]
rw=read
; this will be total of all individual files per process
size=64g
directory=/data/test
fadvise_hint=0
blocksize=8k
direct=0
ioengine=sync
iodepth=1
numjobs=1
; this is number of files total per process
nrfiles=1
runtime=1m

-- 'dd' in a few ways:
Measure direct to partition / disk read rate at the start of the disk:
'dd if=/dev/sdb of=/dev/null ibs=24M obs=64K'
Measure direct to partition / disk read rate near the end of the disk:
'dd if=/dev/sdb1 of=/dev/null ibs=24M obs=64K skip=160K'
Measure direct read of the large file used by the FIO one sequential file test:
'dd if=/data/test/seq-read.1.0 of=/dev/null ibs=32K obs=32K'

the dd paramters for block sizes were chosen with much experimentation to get the best result.


Results:
I've got a lot of results, I'm only going to put a few of them here for now while I investigate a few other things (see the end of this message)
Preliminary summary:

PERC 6, ext3, full partition.
dd beginning of disk :  642MB/sec
dd end of disk: 432MB/sec
dd large file (readahead 49152): 312MB/sec
-- maximum expected sequential capabilities above?

fio: 8 concurrent readers and 1 concurrent reader results
readahead is in 512 byte blocks, sequential transfer rate in MiB/sec as reported by fio.

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  311  |  314
16384  |  312  |  312
12288  |  304  |  309
 8192  |  292  |
 4096  |  264  |
 2048  |  211  |
 1024  |  162  |  302
  512  |  108  |
  256  |  81  | 300
    8  |  38  |

Conclusion, on this array going up to 12288 (6MB) readahead makes a huge impact on concurrent sequential reads.  That is 1MB per raid slice (6, 12 disks raid 10).  Sequential read performance under concurrent.  It has almost no impact at all on one sequential read alone, the OS or the RAID controller are dealing with that case just fine.

But, how much of the above effect is ext3?  How much is it the RAID card?  At the top end, the sequential rate for both concurrent and single sequential access is in line with what dd can get going through ext3.  But it is not even close to what you can get going right to the device and bypassing the file system.

Lets try a different RAID card first.  The disks aren't exactly the same, and there is no guarantee that the file is positioned near the beginning or end, but I've got another 12 disk RAID 10, using a 3Ware 9650 card.

Results, as above -- don't conclude this card is faster, the files may have just been closer to the front of the partition.
dd, beginning of disk: 522MB/sec
dd, end of disk array: 412MB/sec
dd, file read via file system (readahead 49152): 391MB/sec

readahead  |  8 conc read rate  |  1 conc read rate
49152  |  343  |  392
16384  |  349  |  379
12288  |  348  |  387
 8192  |  344  |
 6144  |      |  376
 4096  |  340  |
 2048  |  319  |
 1024  |  284  |  371
  512  |  239  |  376
  256  |  204  |  377
  128  |  169  |  386
    8  |  47  |  382

Conclusion, this RAID controller definitely behaves differently:  It is much less sensitive to the readahead.  Perhaps it has a larger stripe size?  Most likely, this one is set up with a 256K stripe, the other one I do not know, though the PERC 6 default is 64K which may be likely.
 

Ok, so the next question is how file systems play into this.
First, I ran a bunch of tests with xfs, and the results were rather odd.  That is when I realized that the platter speeds at the start and end of the arrays is significantly different, and xfs and ext3 will both make different decisions on where to put the files on an empty partition (xfs will spread them evenly, ext3 more close together but still somewhat random on the actual position).

so, i created a partition that was roughly 10% the size of the whole thing, at the beginning of the array.

Using the PERC 6 setup, this leads to:
dd, against partition: 660MB/sec max result, 450MB/sec min -- not a reliable test for some reason
dd, against file on the partition (ext3): 359MB/sec

ext3 (default settings):
readahead  |  8 conc read rate  |  1 conc read rate
49152  |  363  | 
12288  |  359  | 
  6144  |  319  | 
  1024  |  176  |
   256  |      |
 
Analysis:  I only have 8 concurrent read results here, as these are the most interesting based on the results from the whole disk tests above.  I also did not collect a lot of data points.
What is clear, is that the partition at the front does make a difference, compared to the whole partition results we have about 15% more throughput on the 8 concurrent read test, meaning that ext3 probably put the files in the whole disk case near the middle of the drive geometry.
The 8 concurrent read test has the same "break point" at about 6MB read ahead buffer, which is also consistent.

And now, for XFS, a full result set and VERY surprising results.  I dare say, the benchmarks that led me to do these tests are not complete without XFS tests:

xfs (default settings):
readahead  |  8 conc read rate  |  1 conc read rate
98304  |  651  |  640
65536  |  636  |  609
49152  |  621  |  595
32768  |  602  |  565
24576  |  595  |  548
16384  |  560  |  518
12288  |  505  |  480
 8192  |  437  |  394
 6144  |  412  |  415 *
 4096  |  357  |  281 *
 3072  |  329  |  338
 2048  |  259  |  383
 1536  |  230  |  445
 1280  |  207  |  542
 1024  |  182  |  605  *
  896  |  167  |  523
  768  |  148  |  456
  512  |  119  |  354
  256  |   88   |  303
   64  |   60   | 171
    8  |  36  |  55

* these local max and mins for the sequential transfer were tested several times to validate.  They may have something to do with me not tuning the inode layout for an array using the xfs stripe unit and stripe width parameters.

dd, on the file used in the single reader sequential read test:
660MB/sec.   One other result for the sequential transfer, using a gigantic 393216 (192MB) readahead:
672 MB/sec.

Analysis:
XFS gets significantly higher sequential (read) transfer rates than ext3.  It had higher write results but I've only done one of those.
Both ext3 and xfs can be tuned a bit more, mainly with noatime and some parameters so they know about the geometry of the raid array.


Other misc results:
 I used the deadline scheduler, it didn't impact the results here.
 I ran some tests to "feel out" the sequential transfer rate sensitivity to readahead for a 4x 15K RPM SAS raid setup -- it is less sensitive:
  ext3, 8 concurrent reads -- readahead = 256, 195MB/sec;  readahead = 3072, 200MB/sec; readahead = 32768, 210MB/sec; readahead =64, 120MB/sec
On the 3ware setup, with ext3, postgres was installed and a select count(1) from table reported between 300 and 320 MB/sec against tables larger than 5GB, and disk utilization was about 88%.  dd can get 390 with the settings used (readahead 12288).
Setting the readahead back to the default, postgres gets about 220MB/sec at 100% disk util on similar tables.  I will be testing out xfs on this same data eventually, and expect it to provide significant gains there.

Remaining questions:
Readahead does NOT activate for pure random requests, which is a good thing.   The question is, when does it activate?  I'll have to write some custom fio tests to find out.  I suspect that when the OS detects either:  X number of sequential requests on the same file (or from the same process), it activates.  OR after sequential acces of at least Y bytes.  I'll report results once I know, to construct some worst case scenarios of using a large readahead.  
I will also measure its affect when mixed random access and streaming reads occur.


On Wed, Sep 10, 2008 at 7:49 AM, Greg Smith <gsmith@gregsmith.com> wrote:
On Tue, 9 Sep 2008, Mark Wong wrote:

I've started to display the effects of changing the Linux block device
readahead buffer to the sequential read performance using fio.

Ah ha, told you that was your missing tunable.  I'd really like to see the whole table of one disk numbers re-run when you get a chance.  The reversed ratio there on ext2 (59MB read/92MB write) was what tipped me off that something wasn't quite right initially, and until that's fixed it's hard to analyze the rest.

Based on your initial data, I'd say that the two useful read-ahead settings for this system are 1024KB (conservative but a big improvement) and 8192KB (point of diminishing returns).  The one-disk table you've got (labeled with what the default read-ahead is) and new tables at those two values would really flesh out what each disk is capable of.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Effects of setting linux block device readahead size

From
James Mansion
Date:
Greg Smith wrote:
> The point I was trying to make there is that even under impossibly
> optimal circumstances, you'd be hard pressed to blow out the disk's
> read cache with seek-dominated data even if you read a lot at each
> seek point.  That idea didn't make it from my head into writing very
> well though.
>
Isn't there a bigger danger in blowing out the cache on the controller
and causing premature pageout of its dirty pages?

If you could get the readahead to work on the drive and not return data
to the controller, that might be dandy, but I'm sceptical.

James


Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
Drives have their own read-ahead in the firmware.  Many can keep track of 2 or 4 concurrent file accesses.  A few can keep track of more.  This also plays in with the NCQ or SCSI command queuing implementation.

Consumer drives will often read-ahead much more than server drives optimized for i/o per second.
The difference in read-ahead sensitivity between the two setups I tested may be due to one setup using nearline-SAS (SATA, tuned for io-per sec using SAS firmware) and the other used consumer SATA. 
For example, here is one "nearline SAS" style server tuned drive versus a consumer tuned one:
http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2

The Linux readahead setting is _definitely_ in the kernel, definitely uses and fills the page cache, and from what I can gather, simply issues extra I/O's to the hardware beyond the last one requested by an app in certain situations.  It does not make your I/O request larger, it just queues an extra I/O following your request.

On Thu, Sep 11, 2008 at 12:54 PM, James Mansion <james@mansionfamily.plus.com> wrote:
Greg Smith wrote:
The point I was trying to make there is that even under impossibly optimal circumstances, you'd be hard pressed to blow out the disk's read cache with seek-dominated data even if you read a lot at each seek point.  That idea didn't make it from my head into writing very well though.

Isn't there a bigger danger in blowing out the cache on the controller and causing premature pageout of its dirty pages?

If you could get the readahead to work on the drive and not return data to the controller, that might be dandy, but I'm sceptical.

James



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
Sorry, I forgot to mention the Linux kernel version I'm using, etc:

2.6.18-92.1.10.el5 #1 SMP x86_64
CentOS 5.2.

The "adaptive" read-ahead, as well as other enhancements in the kernel, are taking place or coming soon in the most recent stuff.  Some distributions offer the adaptive read-ahead as an add-on (Debian, for example).  This is an area where much can be improved in Linux http://kerneltrap.org/node/6642
http://kernelnewbies.org/Linux_2_6_23#head-102af265937262a7a21766ae58fddc1a29a5d8d7

I obviously did not test how the new read-ahead stuff impacts these sorts of tests.

On Thu, Sep 11, 2008 at 12:07 PM, Scott Carey <scott@richrelevance.com> wrote:
Hmm, I would expect this tunable to potentially be rather file system dependent, and potentially raid controller dependant.  The test was using ext2, perhaps the others automatically prefetch or read ahead?   Does it vary by RAID controller?

Well I went and found out, using ext3 and xfs.  I have about 120+ data points but here are a few interesting ones before I compile the rest and answer a few other questions of my own.

1:  readahead does not affect "pure" random I/O -- there seems to be a heuristic trigger -- a single process or file probably has to request a sequence of linear I/O of some size to trigger it.  I set it to over 64MB of read-ahead and random iops remained the same to prove this.
2:  File system matters more than you would expect.  XFS sequential transfers when readahead was tuned had TWICE the sequential throughput of ext3, both for a single reader and 8 concurrent readers on 8 different files.
3:  The RAID controller and its configuration make a pretty significant difference as well.

Hardware:
12 7200RPM SATA (Seagate) in raid 10 on 3Ware 9650 (only ext3)
12 7200RPM SATA ('nearline SAS' : Seagate ES.2) on PERC 6 in raid 10 (ext3, xfs)
I also have some results with PERC raid 10 with 4x 15K SAS, not reporting in this message though

  . . . {snip}

Re: Effects of setting linux block device readahead size

From
david@lang.hm
Date:
On Thu, 11 Sep 2008, Scott Carey wrote:

> Drives have their own read-ahead in the firmware.  Many can keep track of 2
> or 4 concurrent file accesses.  A few can keep track of more.  This also
> plays in with the NCQ or SCSI command queuing implementation.
>
> Consumer drives will often read-ahead much more than server drives optimized
> for i/o per second.
> The difference in read-ahead sensitivity between the two setups I tested may
> be due to one setup using nearline-SAS (SATA, tuned for io-per sec using SAS
> firmware) and the other used consumer SATA.
> For example, here is one "nearline SAS" style server tuned drive versus a
> consumer tuned one:
>
http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2
>
> The Linux readahead setting is _definitely_ in the kernel, definitely uses
> and fills the page cache, and from what I can gather, simply issues extra
> I/O's to the hardware beyond the last one requested by an app in certain
> situations.  It does not make your I/O request larger, it just queues an
> extra I/O following your request.

that extra I/O will be merged with your request by the I/O scheduler code
so that by the time it gets to the drive it will be a single request.

by even if it didn't, most modern drives read the entire cylinder into
their buffer so any additional requests to the drive will be satisfied
from this buffer and not have to wait for the disk itself.

David Lang

> On Thu, Sep 11, 2008 at 12:54 PM, James Mansion <
> james@mansionfamily.plus.com> wrote:
>
>> Greg Smith wrote:
>>
>>> The point I was trying to make there is that even under impossibly optimal
>>> circumstances, you'd be hard pressed to blow out the disk's read cache with
>>> seek-dominated data even if you read a lot at each seek point.  That idea
>>> didn't make it from my head into writing very well though.
>>>
>>>  Isn't there a bigger danger in blowing out the cache on the controller
>> and causing premature pageout of its dirty pages?
>>
>> If you could get the readahead to work on the drive and not return data to
>> the controller, that might be dandy, but I'm sceptical.
>>
>> James
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>

Re: Effects of setting linux block device readahead size

From
"Scott Marlowe"
Date:
On Thu, Sep 11, 2008 at 3:36 PM,  <david@lang.hm> wrote:
> On Thu, 11 Sep 2008, Scott Carey wrote:
>
>> Drives have their own read-ahead in the firmware.  Many can keep track of
>> 2
>> or 4 concurrent file accesses.  A few can keep track of more.  This also
>> plays in with the NCQ or SCSI command queuing implementation.
>>
>> Consumer drives will often read-ahead much more than server drives
>> optimized
>> for i/o per second.
>> The difference in read-ahead sensitivity between the two setups I tested
>> may
>> be due to one setup using nearline-SAS (SATA, tuned for io-per sec using
>> SAS
>> firmware) and the other used consumer SATA.
>> For example, here is one "nearline SAS" style server tuned drive versus a
>> consumer tuned one:
>>
>>
http://www.storagereview.com/php/benchmark/suite_v4.php?typeID=10&testbedID=4&osID=6&raidconfigID=1&numDrives=1&devID_0=354&devID_1=348&devCnt=2
>>
>> The Linux readahead setting is _definitely_ in the kernel, definitely uses
>> and fills the page cache, and from what I can gather, simply issues extra
>> I/O's to the hardware beyond the last one requested by an app in certain
>> situations.  It does not make your I/O request larger, it just queues an
>> extra I/O following your request.
>
> that extra I/O will be merged with your request by the I/O scheduler code so
> that by the time it gets to the drive it will be a single request.
>
> by even if it didn't, most modern drives read the entire cylinder into their
> buffer so any additional requests to the drive will be satisfied from this
> buffer and not have to wait for the disk itself.

Generally speaking I agree, but I would still make a separate logical
partition for pg_xlog so that if the OS fills up the /var/log dir or
something, it doesn't impact the db.

Re: Effects of setting linux block device readahead size

From
david@lang.hm
Date:
On Thu, 11 Sep 2008, Scott Marlowe wrote:

> On Thu, Sep 11, 2008 at 3:36 PM,  <david@lang.hm> wrote:
>> by even if it didn't, most modern drives read the entire cylinder into their
>> buffer so any additional requests to the drive will be satisfied from this
>> buffer and not have to wait for the disk itself.
>
> Generally speaking I agree, but I would still make a separate logical
> partition for pg_xlog so that if the OS fills up the /var/log dir or
> something, it doesn't impact the db.

this is a completely different discussion :-)

while I agree with you in theory, in practice I've seen multiple
partitions cause far more problems than they have prevented (due to the
partitions ending up not being large enough and having to be resized after
they fill up, etc) so I tend to go in the direction of a few large
partitions.

the only reason I do multiple partitions (besides when the hardware or
performance considerations require it) is when I can identify that there
is some data that I would not want to touch on a OS upgrade. I try to make
it so that an OS upgrade can wipe the OS partitions if nessasary.

David Lang


Re: Effects of setting linux block device readahead size

From
Alan Hodgson
Date:
On Thursday 11 September 2008, david@lang.hm wrote:
> while I agree with you in theory, in practice I've seen multiple
> partitions cause far more problems than they have prevented (due to the
> partitions ending up not being large enough and having to be resized
> after they fill up, etc) so I tend to go in the direction of a few large
> partitions.

I used to feel this way until LVM became usable. LVM plus online resizable
filesystems really makes multiple partitions manageable.


--
Alan

Re: Effects of setting linux block device readahead size

From
david@lang.hm
Date:
On Thu, 11 Sep 2008, Alan Hodgson wrote:

> On Thursday 11 September 2008, david@lang.hm wrote:
>> while I agree with you in theory, in practice I've seen multiple
>> partitions cause far more problems than they have prevented (due to the
>> partitions ending up not being large enough and having to be resized
>> after they fill up, etc) so I tend to go in the direction of a few large
>> partitions.
>
> I used to feel this way until LVM became usable. LVM plus online resizable
> filesystems really makes multiple partitions manageable.

won't the fragmentation of your filesystem across the different LVM
segments hurt you?

David Lang

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
I also thought that LVM is unsafe for WAL logs and file system journals with disk write cache -- it doesn't flush the disk write caches correctly and build write barriers.

As pointed out here:
http://groups.google.com/group/pgsql.performance/browse_thread/thread/9dc43991c1887129
by Greg Smith
http://lwn.net/Articles/283161/



On Thu, Sep 11, 2008 at 3:41 PM, Alan Hodgson <ahodgson@simkin.ca> wrote:
On Thursday 11 September 2008, david@lang.hm wrote:
> while I agree with you in theory, in practice I've seen multiple
> partitions cause far more problems than they have prevented (due to the
> partitions ending up not being large enough and having to be resized
> after they fill up, etc) so I tend to go in the direction of a few large
> partitions.

I used to feel this way until LVM became usable. LVM plus online resizable
filesystems really makes multiple partitions manageable.


--
Alan

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Effects of setting linux block device readahead size

From
"Scott Marlowe"
Date:
On Thu, Sep 11, 2008 at 4:33 PM,  <david@lang.hm> wrote:
> On Thu, 11 Sep 2008, Scott Marlowe wrote:
>
>> On Thu, Sep 11, 2008 at 3:36 PM,  <david@lang.hm> wrote:
>>>
>>> by even if it didn't, most modern drives read the entire cylinder into
>>> their
>>> buffer so any additional requests to the drive will be satisfied from
>>> this
>>> buffer and not have to wait for the disk itself.
>>
>> Generally speaking I agree, but I would still make a separate logical
>> partition for pg_xlog so that if the OS fills up the /var/log dir or
>> something, it doesn't impact the db.
>
> this is a completely different discussion :-)
>
> while I agree with you in theory, in practice I've seen multiple partitions
> cause far more problems than they have prevented (due to the partitions
> ending up not being large enough and having to be resized after they fill
> up, etc) so I tend to go in the direction of a few large partitions.

I've never had that problem.  I've always made the big enough.  I
can't imagine building a server where /var/log shared space with my
db.  It's not like every root level dir gets its own partition, but
seriously, logs should never go anywhere that another application is
writing to.

> the only reason I do multiple partitions (besides when the hardware or
> performance considerations require it) is when I can identify that there is
> some data that I would not want to touch on a OS upgrade. I try to make it
> so that an OS upgrade can wipe the OS partitions if nessasary.

it's quite handy to have /home on a separate partition I agree.  But
on most servers /home should be empty.  A few others like /opt or
/usr/local I tend to make a separate one for the reasons you mention
as well.

Re: Effects of setting linux block device readahead size

From
Greg Smith
Date:
On Thu, 11 Sep 2008, Alan Hodgson wrote:

> LVM plus online resizable filesystems really makes multiple partitions
> manageable.

I've seen so many reports blaming Linux's LVM for performance issues that
its managability benefits don't seem too compelling.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Effects of setting linux block device readahead size

From
James Mansion
Date:
Scott Carey wrote:
> Consumer drives will often read-ahead much more than server drives
> optimized for i/o per second.
...
> The Linux readahead setting is _definitely_ in the kernel, definitely
> uses and fills the page cache, and from what I can gather, simply
> issues extra I/O's to the hardware beyond the last one requested by an
> app in certain situations.  It does not make your I/O request larger,
> it just queues an extra I/O following your request.
So ... fiddling with settings in Linux is going to force read-ahead, but
the read-ahead data will hit the controller cache and the system buffers.

And the drives use their caches for cyclinder caching implicitly (maybe
the SATA drives appear to preread more because the storage density per
cylinder is higher?)..

But is there any way for an OS or application to (portably) ask SATA,
SAS or SCSI drives to read ahead more (or less) than their default and
NOT return the data to the controller?

I've never heard of such a thing, but I'm no expert in the command sets
for any of this stuff.

James

>
> On Thu, Sep 11, 2008 at 12:54 PM, James Mansion
> <james@mansionfamily.plus.com <mailto:james@mansionfamily.plus.com>>
> wrote:
>
>     Greg Smith wrote:
>
>         The point I was trying to make there is that even under
>         impossibly optimal circumstances, you'd be hard pressed to
>         blow out the disk's read cache with seek-dominated data even
>         if you read a lot at each seek point.  That idea didn't make
>         it from my head into writing very well though.
>
>     Isn't there a bigger danger in blowing out the cache on the
>     controller and causing premature pageout of its dirty pages?
>
>     If you could get the readahead to work on the drive and not return
>     data to the controller, that might be dandy, but I'm sceptical.
>
>     James
>
>
>
>     --
>     Sent via pgsql-performance mailing list
>     (pgsql-performance@postgresql.org
>     <mailto:pgsql-performance@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-performance
>
>


Re: Effects of setting linux block device readahead size

From
david@lang.hm
Date:
On Fri, 12 Sep 2008, James Mansion wrote:

> Scott Carey wrote:
>> Consumer drives will often read-ahead much more than server drives
>> optimized for i/o per second.
> ...
>> The Linux readahead setting is _definitely_ in the kernel, definitely uses
>> and fills the page cache, and from what I can gather, simply issues extra
>> I/O's to the hardware beyond the last one requested by an app in certain
>> situations.  It does not make your I/O request larger, it just queues an
>> extra I/O following your request.
> So ... fiddling with settings in Linux is going to force read-ahead, but the
> read-ahead data will hit the controller cache and the system buffers.
>
> And the drives use their caches for cyclinder caching implicitly (maybe the
> SATA drives appear to preread more because the storage density per cylinder
> is higher?)..
>
> But is there any way for an OS or application to (portably) ask SATA, SAS or
> SCSI drives to read ahead more (or less) than their default and NOT return
> the data to the controller?
>
> I've never heard of such a thing, but I'm no expert in the command sets for
> any of this stuff.

I'm pretty sure that's not possible. the OS isn't supposed to even know
the internals of the drive.

David Lang

> James
>
>>
>> On Thu, Sep 11, 2008 at 12:54 PM, James Mansion
>> <james@mansionfamily.plus.com <mailto:james@mansionfamily.plus.com>> wrote:
>>
>>     Greg Smith wrote:
>>
>>         The point I was trying to make there is that even under
>>         impossibly optimal circumstances, you'd be hard pressed to
>>         blow out the disk's read cache with seek-dominated data even
>>         if you read a lot at each seek point.  That idea didn't make
>>         it from my head into writing very well though.
>>
>>     Isn't there a bigger danger in blowing out the cache on the
>>     controller and causing premature pageout of its dirty pages?
>>
>>     If you could get the readahead to work on the drive and not return
>>     data to the controller, that might be dandy, but I'm sceptical.
>>
>>     James
>>
>>
>>
>>     --     Sent via pgsql-performance mailing list
>>     (pgsql-performance@postgresql.org
>>     <mailto:pgsql-performance@postgresql.org>)
>>     To make changes to your subscription:
>>     http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>
>
>

Re: Effects of setting linux block device readahead size

From
Matthew Wakeling
Date:
On Thu, 11 Sep 2008, Scott Carey wrote:
> Preliminary summary:
>
> readahead  |  8 conc read rate  |  1 conc read rate
> 49152  |  311  |  314
> 16384  |  312  |  312
> 12288  |  304  |  309
>  8192  |  292  |
>  4096  |  264  |
>  2048  |  211  |
>  1024  |  162  |  302
>   512  |  108  |
>   256  |  81  | 300
>     8  |  38  |

What io scheduler are you using? The anticipatory scheduler is meant to
prevent this slowdown with multiple concurrent reads.

Matthew


--
And the lexer will say "Oh look, there's a null string. Oooh, there's
another. And another.", and will fall over spectacularly when it realises
there are actually rather a lot.
         - Computer Science Lecturer (edited)

Re: Effects of setting linux block device readahead size

From
"Scott Carey"
Date:
Good question.  I'm in the process of completing more exhaustive tests with the various disk i/o schedulers.

Basic findings so far:  it depends on what type of concurrency is going on.  Deadline has the best performance over a range of readahead values compared to cfq or anticipatory with concurrent sequential reads with xfs.  However, mixing random and sequential reads puts cfq ahead with low readahead values and deadline ahead with large readahead values (I have not tried anticipatory here yet).  However, your preference for prioritizing streaming over random will significantly impact which you would want to use and at what readahead value -- cfq does a better job at being consistent balancing the two, deadline swings strongly to being streaming biased as the readahead value gets larger and random biased when it is low.  Deadline and CFQ are similar with concurrent random reads.  I have not gotten to any write tests or concurrent read/write tests.

I expect the anticipatory scheduler to perform worse with mixed loads -- anything asking a raid array that can do 1000 iops to wait for 7 ms and do nothing just in case a read in the same area might occur is a bad idea for aggregate concurrent throughput.  It is a scheduler that assumes the underlying hardware is essentially one spindle -- which is why it is so good in a standard PC or laptop.   But, I could be wrong.

On Mon, Sep 15, 2008 at 9:18 AM, Matthew Wakeling <matthew@flymine.org> wrote:
On Thu, 11 Sep 2008, Scott Carey wrote:
Preliminary summary:


readahead  |  8 conc read rate  |  1 conc read rate
49152  |  311  |  314
16384  |  312  |  312
12288  |  304  |  309
 8192  |  292  |
 4096  |  264  |
 2048  |  211  |
 1024  |  162  |  302
  512  |  108  |
  256  |  81  | 300
    8  |  38  |

What io scheduler are you using? The anticipatory scheduler is meant to prevent this slowdown with multiple concurrent reads.

Matthew


--
And the lexer will say "Oh look, there's a null string. Oooh, there's another. And another.", and will fall over spectacularly when it realises
there are actually rather a lot.
       - Computer Science Lecturer (edited)

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance