Re: Effects of setting linux block device readahead size - Mailing list pgsql-performance

From Greg Smith
Subject Re: Effects of setting linux block device readahead size
Date
Msg-id Pine.GSO.4.64.0809101313070.4714@westnet.com
Whole thread Raw
In response to Re: Effects of setting linux block device readahead size  ("Scott Carey" <scott@richrelevance.com>)
Responses Re: Effects of setting linux block device readahead size
Re: Effects of setting linux block device readahead size
List pgsql-performance
On Wed, 10 Sep 2008, Scott Carey wrote:

> How does that readahead tunable affect random reads or mixed random /
> sequential situations?

It still helps as long as you don't make the parameter giant.  The read
cache in a typical hard drive noawadays is 8-32MB.  If you're seeking a
lot, you still might as well read the next 1MB or so after the block
requested once you've gone to the trouble of moving the disk somewhere.
Seek-bound workloads will only waste a relatively small amount of the
disk's read cache that way--the slow seek rate itself keeps that from
polluting the buffer cache too fast with those reads--while sequential
ones benefit enormously.

If you look at Mark's tests, you can see approximately where the readahead
is filling the disk's internal buffers, because what happens then is the
sequential read performance improvement levels off.  That looks near 8MB
for the array he's tested, but I'd like to see a single disk to better
feel that out.  Basically, once you know that, you back off from there as
much as you can without killing sequential performance completely and that
point should still support a mixed workload.

Disks are fairly well understood physical components, and if you think in
those terms you can build a gross model easily enough:

Average seek time:      4ms
Seeks/second:        250
Data read/seek:        1MB    (read-ahead number goes here)
Total read bandwidth:    250MB/s

Since that's around what a typical interface can support, that's why I
suggest a 1MB read-ahead shouldn't hurt even seek-only workloads, and it's
pretty close to optimal for sequential as well here (big improvement from
the default Linux RA of 256 blocks=128K).  If you know your work is biased
heavily toward sequential scans, you might pick the 8MB read-ahead
instead.  That value (--setra=16384 -> 8MB) has actually been the standard
"start here" setting 3ware suggests on Linux for a while now:
http://www.3ware.com/kb/Article.aspx?id=11050

> I would be very interested in a mixed fio profile with a "background writer"
> doing moderate, paced random and sequential writes combined with concurrent
> sequential reads and random reads.

Trying to make disk benchmarks really complicated is a path that leads to
a lot of wasted time.  I one made this gigantic design plan for something
that worked like the PostgreSQL buffer management system to work as a disk
benchmarking tool.  I threw it away after confirming I could do better
with carefully scripted pgbench tests.

If you want to benchmark something that looks like a database workload,
benchmark a database workload.  That will always be better than guessing
what such a workload acts like in a synthetic fashion.  The "seeks/second"
number bonnie++ spits out is good enough for most purposes at figuring out
if you've detuned seeks badly.

"pgbench -S" run against a giant database gives results that look a lot
like seeks/second, and if you mix multiple custom -f tests together it
will round-robin between them at random...

It's really helpful to measure these various disk subsystem parameters
individually.  Knowing the sequential read/write, seeks/second, and commit
rate for a disk setup is mainly valuable at making sure you're getting the
full performance expected from what you've got.  Like in this example,
where something was obviously off on the single disk results because reads
were significantly slower than writes.  That's not supposed to happen, so
you know something basic is wrong before you even get into RAID and such.
Beyond confirming whether or not you're getting approximately what you
should be out of the basic hardware, disk benchmarks are much less useful
than application ones.

With all that, I think I just gave away what the next conference paper
I've been working on is about.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

pgsql-performance by date:

Previous
From: "Scott Carey"
Date:
Subject: Re: Effects of setting linux block device readahead size
Next
From: Dimitri Fontaine
Date:
Subject: Re: Improve COPY performance for large data sets