Thread: How is random_page_cost=4 ok?

How is random_page_cost=4 ok?

From
Gregory Stark
Date:
I'm kind of curious where the value of 4 for random_page_cost came from. 
IIRC, Tom, you mentioned it came from tests you ran -- were those raw i/o
tests or Postgres cost estimates compared to execution times?

Te reason I'm wondering about this is it seems out of line with raw i/o
numbers. Typical values for consumer drives are about a sustained throughput
of 60MB/s ( Ie .2ms per 8k) and seek latency of 4ms. That gives a ratio of 20.

Server-class drives have even a ratio since a 15kRPM drive can have a
sustained bandwidth of 110-170 MB/s (48us-75us) and an average seek latency of
2ms giving a ratio of 27-42. And of course that doesn't include the effects of
a RAID array which magnifies that ratio.

I'm concerned that if we start recommending such large random_page_costs as
these it'll produce plans that are very different from what people have grown
used to. And the value of 4 seems to work well in practice.

I suspect the root of all this is that random_page_cost is encoding within it
the effects of caching. If that's true shouldn't we find a way to model
caching using effective_cache_size instead so that people can set
random_page_cost realistically based on their hardware?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: How is random_page_cost=4 ok?

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> I'm kind of curious where the value of 4 for random_page_cost came from. 
> IIRC, Tom, you mentioned it came from tests you ran -- were those raw i/o
> tests or Postgres cost estimates compared to execution times?

It was based on actual query execution times, but that was with 1990s
hardware.  It doesn't surprise me at all that modern drives would show a
larger ratio --- seems like transfer rate has improved more than seek
times.  And it's also the case that we've squeezed a lot of overhead
out of Postgres' tuple processing code since then, so that might be
another way in which seqscans have gotten cheaper.

> I'm concerned that if we start recommending such large random_page_costs as
> these it'll produce plans that are very different from what people have grown
> used to. And the value of 4 seems to work well in practice.

Yeah, it seems like raising random_page_cost is not something we ever
recommend in practice.  I suspect what we'd really need here to make any
progress is a more detailed cost model, not just fooling with the
parameters of the existing one.

> I suspect the root of all this is that random_page_cost is encoding within it
> the effects of caching. If that's true shouldn't we find a way to model
> caching using effective_cache_size instead so that people can set
> random_page_cost realistically based on their hardware?

We do model caching using effective_cache_size.  One thing we definitely
lack is any understanding of the effects of caching across multiple
queries.  I'm not sure what other first-order effects are missing from
the model ...
        regards, tom lane


Re: How is random_page_cost=4 ok?

From
Michael Renner
Date:
Gregory Stark schrieb:

> Te reason I'm wondering about this is it seems out of line with raw i/o
> numbers. Typical values for consumer drives are about a sustained throughput
> of 60MB/s ( Ie .2ms per 8k) and seek latency of 4ms. That gives a ratio of 20.
> 
> Server-class drives have even a ratio since a 15kRPM drive can have a
> sustained bandwidth of 110-170 MB/s (48us-75us) and an average seek latency of
> 2ms giving a ratio of 27-42. And of course that doesn't include the effects of
> a RAID array which magnifies that ratio.

Hi Gregory,

I think your numbers are a bit off:

For "Consumer drives" (7.200 RPM SATA 3.5"), seek times are much worse, 
in the area of 8-9ms (see [1]), but sustained sequential read numbers 
are noticeable higher, around 80-90MB/sec.

For "Server Drives" 3-4ms are more realistic ([2], [3]) for average 
seeks and the 110-170MB/sec are highly exaggerated.


Unfortunately I have only 2.5" SAS 10k drives and no FreeBSD here, 
otherwise I could provide some real world numbers; the diskinfo tool in 
[3] looks really nice (and makes me crave FreeBSD).

best regards,
Michael

[1] http://h18004.www1.hp.com/products/quickspecs/13021_div/13021_div.html

[2] http://h18004.www1.hp.com/products/quickspecs/12244_div/12244_div.html

[3] 
http://blog.insidesystems.net/articles/2007/04/09/unscientific-15k-v-10k-sas-drive-comparison


Re: How is random_page_cost=4 ok?

From
Gregory Stark
Date:
Michael Renner <michael.renner@amd.co.at> writes:

> I think your numbers are a bit off:
>
> For "Consumer drives" (7.200 RPM SATA 3.5"), seek times are much worse, in the
> area of 8-9ms (see [1]), but sustained sequential read numbers are noticeable
> higher, around 80-90MB/sec.

I took the seek latency from the data sheet for a Barracuda 7200.9 which is
several generations old but still a current model. Just rotational latency
would have a worst case of 8.3ms and half that is precisely the 4.16 they
quote so I suspect that's where the number comes from. Not especially helpful
perhaps.

They don't quote sustained bandwidth for consumer drives but 50-60MB/s are the
numbers I remembered -- admittedly from more than a couple years ago. I didn't
realize 7200 RPM drives had reached such speeds yet.

But with your numbers things look even weirder. With a 90MB/s sequential speed
(91us) and 9ms seek latency that would be a random_page_cost of nearly 100!

> For "Server Drives" 3-4ms are more realistic ([2], [3]) for average seeks and
> the 110-170MB/sec are highly exaggerated.

In that case both of those numbers come straight from Seagate's data sheet for
their top-of-the-line data centre drives:

http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah_15k_6.pdf

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


Re: How is random_page_cost=4 ok?

From
Gregory Stark
Date:
Gregory Stark <stark@enterprisedb.com> writes:

>> For "Server Drives" 3-4ms are more realistic ([2], [3]) for average seeks and
>> the 110-170MB/sec are highly exaggerated.
>
> In that case both of those numbers come straight from Seagate's data sheet for
> their top-of-the-line data centre drives:
>
> http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah_15k_6.pdf

Oh, but I just noticed they separately quote latency and read/write seek time.
The average read seek time is 3.4ms. That gives a random_page_cost of 45-71.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: How is random_page_cost=4 ok?

From
Michael Renner
Date:
Gregory Stark schrieb:

> But with your numbers things look even weirder. With a 90MB/s sequential speed
> (91us) and 9ms seek latency that would be a random_page_cost of nearly 100!

Looks good :). If you actually want to base something on Real World 
numbers I'd suggest that we collect them beforehand from existing 
setups. I was introduced to IOmeter [1] at an HP performance course 
which is a nice GUI Tool which allows you to define workloads to your 
likings and test it against given block devices, unfortunately it's 
Windows only. fio [2] and Iozone [3] should do the same for the 
Unix-World, without the "nice" and "GUI" parts ;).


For improving the model - in what situations would we benefit from a 
more accurate model here?


Is it correct that this is only relevant for large (if not huge) tables 
which border on (or don't fit in) effective_cache_size (and respectively 
- the OS Page cache)?

And we need the cost to decide between a sequential, index (order by, 
small expected result set) and a bitmap index scan?


Speaking of bitmap index/heap scans - are those counted against seq or 
random_page_cost?


regards,
michael

[1] http://www.iometer.org/
[2] http://freshmeat.net/projects/fio/
[3] http://www.iozone.org/


Re: How is random_page_cost=4 ok?

From
Greg Smith
Date:
On Fri, 10 Oct 2008, Gregory Stark wrote:

> They don't quote sustained bandwidth for consumer drives but 50-60MB/s are the
> numbers I remembered -- admittedly from more than a couple years ago. I didn't
> realize 7200 RPM drives had reached such speeds yet.

The cheap ($42!) 7200RPM SATA disks I bought a stack of for my home server 
hit a sequential 110MB/s at the beginning edge, at the other end 
throughput is still 60-70MB/s.  The smaller capacities of Seagate's 
7200.11 average about 100MB/s nowadays.  But by the time you seek to a 
location (8-9ms) and line the heads up (half a rotation at 7200RPM 
averages 4ms) you can easily end up at 12-13ms or higher measured access 
time on random reads with those.  So the true random/sequential ratio 
reaches crazy numbers.

I don't think random_page_cost actually corresponds with any real number 
anymore.  I just treat it as an uncalibrated knob you can turn and 
benchmark the results at.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: How is random_page_cost=4 ok?

From
"Kevin Grittner"
Date:
>>> Greg Smith <gsmith@gregsmith.com> wrote: 
> I don't think random_page_cost actually corresponds with any real
number 
> anymore.  I just treat it as an uncalibrated knob you can turn and 
> benchmark the results at.
Same here.  We have always found best performance in our production
environments with this set to somewhere from the same as seq_page_cost
to twice seq_page_cost -- depending on how much of the database is
cached.  As we get toward more heavily cached databases we also reduce
seq_page_cost.  So we range from (0.1,0.1) to (1,2).  These have
really become abstractions with legacy names.
If I had to suggest how someone choose a starting setting, I would say
that seq_page_cost should be the proportion of sequential scans likely
to need to go the disk, and random_page_cost should be two times the
proportion of heap data which doesn't fit in cache space.  Add 0.1 to
both numbers and then truncate to one decimal position.  This, of
course, assumes a battery backed caching RAID controller, a reasonable
RAID for the data set, and one of the more typical types of usage
patterns.
-Kevin


Re: How is random_page_cost=4 ok?

From
Tom Lane
Date:
Greg Smith <gsmith@gregsmith.com> writes:
> ...  So the true random/sequential ratio 
> reaches crazy numbers.

Bear in mind that seq_page_cost and random_page_cost are intended to
represent the time to read *and process* a page, so there's some CPU
component involved there, and this limits the ratio that could be
reached in practice.

In particular, if the OS lays out successive file pages in a way that
provides zero latency between logically adjacent blocks, I'd bet a good
bit that a Postgres seqscan would miss the read timing every time, and
degrade to handling about one block per disk rotation.  Those 100MB/s
numbers are just mirages as far as seqscan speed goes.
        regards, tom lane


Re: How is random_page_cost=4 ok?

From
Ron Mayer
Date:
Tom Lane wrote:
> In particular, if the OS lays out successive file pages in a way that
> provides zero latency between logically adjacent blocks, I'd bet a good
> bit that a Postgres seqscan would miss the read timing every time, and
> degrade to handling about one block per disk rotation.

Unless the OS does some readahead when it sees something like a seq scan?



Re: How is random_page_cost=4 ok?

From
Josh Berkus
Date:
> I don't think random_page_cost actually corresponds with any real number 
> anymore.  I just treat it as an uncalibrated knob you can turn and 
> benchmark the results at.

And, frankly, not a useful knob.  You get much more useful results out 
of effective_cache_size and cpu_* costs than you get out of messing with 
random_page_cost, unless you're running on SSD or something which would 
justify a lower RPC, or if you're compensating for our poor n-distinct 
estimation for very large tables.

--Josh



Re: How is random_page_cost=4 ok?

From
Gregory Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

>> I don't think random_page_cost actually corresponds with any real number
>> anymore.  I just treat it as an uncalibrated knob you can turn and benchmark
>> the results at.
>
> And, frankly, not a useful knob.  You get much more useful results out of
> effective_cache_size and cpu_* costs than you get out of messing with
> random_page_cost, unless you're running on SSD or something which would justify
> a lower RPC, or if you're compensating for our poor n-distinct estimation for
> very large tables.

Uh, that doesn't make much sense. effective_cache_size is only used currently
to estimate intra-query caching effects. It doesn't compensate for stead-state
cache hit rates.

And "our poor n-distinct estimation" is a problem which manifests by having
inconsistent estimates for number of tuples. It could be high one day and low
the next, so I don't see how biasing in any specific direction could be
helpful. In any case adjusting random_page_cost would be missing the target by
a wide margin since it's not going to fix the tuple count estimate itself in
any way and the rest of the plan will be predicated on that estimate, not just
the estimated cost of the scan.

Adjusting the cpu_* costs together amounts to the same thing as adjusting
seq_page_cost and random_page_cost together since the numbers are all relative
to each other and that's the whole set. Ie, doubling all the cpu_* costs is
the same has halving the two disk costs.

In any case your experience doesn't match mine. On a machine with a sizable
raid controller setting random_page_cost higher does generate, as expected,
plans with more bitmap heap scans which are in fact faster.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!


Re: How is random_page_cost=4 ok?

From
Greg Smith
Date:
On Fri, 10 Oct 2008, Tom Lane wrote:

> In particular, if the OS lays out successive file pages in a way that
> provides zero latency between logically adjacent blocks, I'd bet a good
> bit that a Postgres seqscan would miss the read timing every time, and
> degrade to handling about one block per disk rotation.

The drives themselves, and possibly the OS and disk controller, are all 
running read-ahead algorithms to accelerate this case.  In fact, this 
*exact* case for the Linux read-ahead stuff that just went mainline 
recently: http://kerneltrap.org/node/6642

I was reading something the other day about how drives with bigger caches 
are starting to have firmware tuned to just start reading from wherever 
the head happens to be end up at once the seek has found the right area, 
even if it's not what you asked for, in hopes that you'll want those 
nearby blocks soon, too.  If the drive has 32MB of cache in it and you're 
seeking around, you've got a pretty big working area relative to how fast 
you can fill that with requested data.

And then there's a patch that helps accelerate this process I should get 
back to benchmarking again...

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD


Re: How is random_page_cost=4 ok?

From
"Nikolas Everett"
Date:


In any case your experience doesn't match mine. On a machine with a sizable
raid controller setting random_page_cost higher does generate, as expected,
plans with more bitmap heap scans which are in fact faster.

We're running postgres backed by a NetApp 3020 via fiber and have had a lot of success setting random page cost very high (10).  Sequential reads are just that much faster.  I'm not sure if thats because we've configured something wrong or what, but thats a really useful knob for us.

Re: How is random_page_cost=4 ok?

From
Michael Renner
Date:
Greg Smith wrote:

> The drives themselves, and possibly the OS and disk controller, are all 
> running read-ahead algorithms to accelerate this case.  In fact, this 
> *exact* case for the Linux read-ahead stuff that just went mainline 
> recently: http://kerneltrap.org/node/6642

Apparently only the "simple" stuff hit mainline, see [1] and [2], not 
knowing how this turns out for pg-style loads, especially compared to 
the full-fledged patch.

Readahead is probably too much of a beast that no one dares to touch 
with a 3-foot-pole, unless given a large team with good standing in the 
kernel community and concerted regression testing in whatever 
environment Linux is used these days...


michael


[1] http://lwn.net/Articles/235164/
[2] 
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=mm/readahead.c


Re: How is random_page_cost=4 ok?

From
Bruce Momjian
Date:
Nikolas Everett wrote:
> >
> > In any case your experience doesn't match mine. On a machine with a sizable
> > raid controller setting random_page_cost higher does generate, as expected,
> > plans with more bitmap heap scans which are in fact faster.
> >
> 
> We're running postgres backed by a NetApp 3020 via fiber and have had a lot
> of success setting random page cost very high (10).  Sequential reads are
> just that much faster.  I'm not sure if thats because we've configured
> something wrong or what, but thats a really useful knob for us.

One other issue is that sequential I/O for data that is mostly random is
going to wipe more of the buffer cache than random access, so there
should perhaps be some additional cost associated with sequential access
to bias toward random access.  Not sure how our new code that prevents
large table scans from wiping the cache affect this.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + If your life is a hard drive, Christ can be your backup. +


Re: How is random_page_cost=4 ok?

From
Decibel!
Date:
On Oct 10, 2008, at 7:41 PM, Nikolas Everett wrote:
> In any case your experience doesn't match mine. On a machine with a  
> sizable
> raid controller setting random_page_cost higher does generate, as  
> expected,
> plans with more bitmap heap scans which are in fact faster.
>
> We're running postgres backed by a NetApp 3020 via fiber and have  
> had a lot of success setting random page cost very high (10).   
> Sequential reads are just that much faster.  I'm not sure if thats  
> because we've configured something wrong or what, but thats a  
> really useful knob for us.


Is your workload OLTP or OLAP? Promoting seqscans in an OLTP  
environment seems to be a really bad idea to me...
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828