Thread: How is random_page_cost=4 ok?
I'm kind of curious where the value of 4 for random_page_cost came from. IIRC, Tom, you mentioned it came from tests you ran -- were those raw i/o tests or Postgres cost estimates compared to execution times? Te reason I'm wondering about this is it seems out of line with raw i/o numbers. Typical values for consumer drives are about a sustained throughput of 60MB/s ( Ie .2ms per 8k) and seek latency of 4ms. That gives a ratio of 20. Server-class drives have even a ratio since a 15kRPM drive can have a sustained bandwidth of 110-170 MB/s (48us-75us) and an average seek latency of 2ms giving a ratio of 27-42. And of course that doesn't include the effects of a RAID array which magnifies that ratio. I'm concerned that if we start recommending such large random_page_costs as these it'll produce plans that are very different from what people have grown used to. And the value of 4 seems to work well in practice. I suspect the root of all this is that random_page_cost is encoding within it the effects of caching. If that's true shouldn't we find a way to model caching using effective_cache_size instead so that people can set random_page_cost realistically based on their hardware? -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark <stark@enterprisedb.com> writes: > I'm kind of curious where the value of 4 for random_page_cost came from. > IIRC, Tom, you mentioned it came from tests you ran -- were those raw i/o > tests or Postgres cost estimates compared to execution times? It was based on actual query execution times, but that was with 1990s hardware. It doesn't surprise me at all that modern drives would show a larger ratio --- seems like transfer rate has improved more than seek times. And it's also the case that we've squeezed a lot of overhead out of Postgres' tuple processing code since then, so that might be another way in which seqscans have gotten cheaper. > I'm concerned that if we start recommending such large random_page_costs as > these it'll produce plans that are very different from what people have grown > used to. And the value of 4 seems to work well in practice. Yeah, it seems like raising random_page_cost is not something we ever recommend in practice. I suspect what we'd really need here to make any progress is a more detailed cost model, not just fooling with the parameters of the existing one. > I suspect the root of all this is that random_page_cost is encoding within it > the effects of caching. If that's true shouldn't we find a way to model > caching using effective_cache_size instead so that people can set > random_page_cost realistically based on their hardware? We do model caching using effective_cache_size. One thing we definitely lack is any understanding of the effects of caching across multiple queries. I'm not sure what other first-order effects are missing from the model ... regards, tom lane
Gregory Stark schrieb: > Te reason I'm wondering about this is it seems out of line with raw i/o > numbers. Typical values for consumer drives are about a sustained throughput > of 60MB/s ( Ie .2ms per 8k) and seek latency of 4ms. That gives a ratio of 20. > > Server-class drives have even a ratio since a 15kRPM drive can have a > sustained bandwidth of 110-170 MB/s (48us-75us) and an average seek latency of > 2ms giving a ratio of 27-42. And of course that doesn't include the effects of > a RAID array which magnifies that ratio. Hi Gregory, I think your numbers are a bit off: For "Consumer drives" (7.200 RPM SATA 3.5"), seek times are much worse, in the area of 8-9ms (see [1]), but sustained sequential read numbers are noticeable higher, around 80-90MB/sec. For "Server Drives" 3-4ms are more realistic ([2], [3]) for average seeks and the 110-170MB/sec are highly exaggerated. Unfortunately I have only 2.5" SAS 10k drives and no FreeBSD here, otherwise I could provide some real world numbers; the diskinfo tool in [3] looks really nice (and makes me crave FreeBSD). best regards, Michael [1] http://h18004.www1.hp.com/products/quickspecs/13021_div/13021_div.html [2] http://h18004.www1.hp.com/products/quickspecs/12244_div/12244_div.html [3] http://blog.insidesystems.net/articles/2007/04/09/unscientific-15k-v-10k-sas-drive-comparison
Michael Renner <michael.renner@amd.co.at> writes: > I think your numbers are a bit off: > > For "Consumer drives" (7.200 RPM SATA 3.5"), seek times are much worse, in the > area of 8-9ms (see [1]), but sustained sequential read numbers are noticeable > higher, around 80-90MB/sec. I took the seek latency from the data sheet for a Barracuda 7200.9 which is several generations old but still a current model. Just rotational latency would have a worst case of 8.3ms and half that is precisely the 4.16 they quote so I suspect that's where the number comes from. Not especially helpful perhaps. They don't quote sustained bandwidth for consumer drives but 50-60MB/s are the numbers I remembered -- admittedly from more than a couple years ago. I didn't realize 7200 RPM drives had reached such speeds yet. But with your numbers things look even weirder. With a 90MB/s sequential speed (91us) and 9ms seek latency that would be a random_page_cost of nearly 100! > For "Server Drives" 3-4ms are more realistic ([2], [3]) for average seeks and > the 110-170MB/sec are highly exaggerated. In that case both of those numbers come straight from Seagate's data sheet for their top-of-the-line data centre drives: http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah_15k_6.pdf -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication support!
Gregory Stark <stark@enterprisedb.com> writes: >> For "Server Drives" 3-4ms are more realistic ([2], [3]) for average seeks and >> the 110-170MB/sec are highly exaggerated. > > In that case both of those numbers come straight from Seagate's data sheet for > their top-of-the-line data centre drives: > > http://www.seagate.com/docs/pdf/datasheet/disc/ds_cheetah_15k_6.pdf Oh, but I just noticed they separately quote latency and read/write seek time. The average read seek time is 3.4ms. That gives a random_page_cost of 45-71. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
Gregory Stark schrieb: > But with your numbers things look even weirder. With a 90MB/s sequential speed > (91us) and 9ms seek latency that would be a random_page_cost of nearly 100! Looks good :). If you actually want to base something on Real World numbers I'd suggest that we collect them beforehand from existing setups. I was introduced to IOmeter [1] at an HP performance course which is a nice GUI Tool which allows you to define workloads to your likings and test it against given block devices, unfortunately it's Windows only. fio [2] and Iozone [3] should do the same for the Unix-World, without the "nice" and "GUI" parts ;). For improving the model - in what situations would we benefit from a more accurate model here? Is it correct that this is only relevant for large (if not huge) tables which border on (or don't fit in) effective_cache_size (and respectively - the OS Page cache)? And we need the cost to decide between a sequential, index (order by, small expected result set) and a bitmap index scan? Speaking of bitmap index/heap scans - are those counted against seq or random_page_cost? regards, michael [1] http://www.iometer.org/ [2] http://freshmeat.net/projects/fio/ [3] http://www.iozone.org/
On Fri, 10 Oct 2008, Gregory Stark wrote: > They don't quote sustained bandwidth for consumer drives but 50-60MB/s are the > numbers I remembered -- admittedly from more than a couple years ago. I didn't > realize 7200 RPM drives had reached such speeds yet. The cheap ($42!) 7200RPM SATA disks I bought a stack of for my home server hit a sequential 110MB/s at the beginning edge, at the other end throughput is still 60-70MB/s. The smaller capacities of Seagate's 7200.11 average about 100MB/s nowadays. But by the time you seek to a location (8-9ms) and line the heads up (half a rotation at 7200RPM averages 4ms) you can easily end up at 12-13ms or higher measured access time on random reads with those. So the true random/sequential ratio reaches crazy numbers. I don't think random_page_cost actually corresponds with any real number anymore. I just treat it as an uncalibrated knob you can turn and benchmark the results at. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>>> Greg Smith <gsmith@gregsmith.com> wrote: > I don't think random_page_cost actually corresponds with any real number > anymore. I just treat it as an uncalibrated knob you can turn and > benchmark the results at. Same here. We have always found best performance in our production environments with this set to somewhere from the same as seq_page_cost to twice seq_page_cost -- depending on how much of the database is cached. As we get toward more heavily cached databases we also reduce seq_page_cost. So we range from (0.1,0.1) to (1,2). These have really become abstractions with legacy names. If I had to suggest how someone choose a starting setting, I would say that seq_page_cost should be the proportion of sequential scans likely to need to go the disk, and random_page_cost should be two times the proportion of heap data which doesn't fit in cache space. Add 0.1 to both numbers and then truncate to one decimal position. This, of course, assumes a battery backed caching RAID controller, a reasonable RAID for the data set, and one of the more typical types of usage patterns. -Kevin
Greg Smith <gsmith@gregsmith.com> writes: > ... So the true random/sequential ratio > reaches crazy numbers. Bear in mind that seq_page_cost and random_page_cost are intended to represent the time to read *and process* a page, so there's some CPU component involved there, and this limits the ratio that could be reached in practice. In particular, if the OS lays out successive file pages in a way that provides zero latency between logically adjacent blocks, I'd bet a good bit that a Postgres seqscan would miss the read timing every time, and degrade to handling about one block per disk rotation. Those 100MB/s numbers are just mirages as far as seqscan speed goes. regards, tom lane
Tom Lane wrote: > In particular, if the OS lays out successive file pages in a way that > provides zero latency between logically adjacent blocks, I'd bet a good > bit that a Postgres seqscan would miss the read timing every time, and > degrade to handling about one block per disk rotation. Unless the OS does some readahead when it sees something like a seq scan?
> I don't think random_page_cost actually corresponds with any real number > anymore. I just treat it as an uncalibrated knob you can turn and > benchmark the results at. And, frankly, not a useful knob. You get much more useful results out of effective_cache_size and cpu_* costs than you get out of messing with random_page_cost, unless you're running on SSD or something which would justify a lower RPC, or if you're compensating for our poor n-distinct estimation for very large tables. --Josh
Josh Berkus <josh@agliodbs.com> writes: >> I don't think random_page_cost actually corresponds with any real number >> anymore. I just treat it as an uncalibrated knob you can turn and benchmark >> the results at. > > And, frankly, not a useful knob. You get much more useful results out of > effective_cache_size and cpu_* costs than you get out of messing with > random_page_cost, unless you're running on SSD or something which would justify > a lower RPC, or if you're compensating for our poor n-distinct estimation for > very large tables. Uh, that doesn't make much sense. effective_cache_size is only used currently to estimate intra-query caching effects. It doesn't compensate for stead-state cache hit rates. And "our poor n-distinct estimation" is a problem which manifests by having inconsistent estimates for number of tuples. It could be high one day and low the next, so I don't see how biasing in any specific direction could be helpful. In any case adjusting random_page_cost would be missing the target by a wide margin since it's not going to fix the tuple count estimate itself in any way and the rest of the plan will be predicated on that estimate, not just the estimated cost of the scan. Adjusting the cpu_* costs together amounts to the same thing as adjusting seq_page_cost and random_page_cost together since the numbers are all relative to each other and that's the whole set. Ie, doubling all the cpu_* costs is the same has halving the two disk costs. In any case your experience doesn't match mine. On a machine with a sizable raid controller setting random_page_cost higher does generate, as expected, plans with more bitmap heap scans which are in fact faster. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's PostGIS support!
On Fri, 10 Oct 2008, Tom Lane wrote: > In particular, if the OS lays out successive file pages in a way that > provides zero latency between logically adjacent blocks, I'd bet a good > bit that a Postgres seqscan would miss the read timing every time, and > degrade to handling about one block per disk rotation. The drives themselves, and possibly the OS and disk controller, are all running read-ahead algorithms to accelerate this case. In fact, this *exact* case for the Linux read-ahead stuff that just went mainline recently: http://kerneltrap.org/node/6642 I was reading something the other day about how drives with bigger caches are starting to have firmware tuned to just start reading from wherever the head happens to be end up at once the seek has found the right area, even if it's not what you asked for, in hopes that you'll want those nearby blocks soon, too. If the drive has 32MB of cache in it and you're seeking around, you've got a pretty big working area relative to how fast you can fill that with requested data. And then there's a patch that helps accelerate this process I should get back to benchmarking again... -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
In any case your experience doesn't match mine. On a machine with a sizable
raid controller setting random_page_cost higher does generate, as expected,
plans with more bitmap heap scans which are in fact faster.
We're running postgres backed by a NetApp 3020 via fiber and have had a lot of success setting random page cost very high (10). Sequential reads are just that much faster. I'm not sure if thats because we've configured something wrong or what, but thats a really useful knob for us.
Greg Smith wrote: > The drives themselves, and possibly the OS and disk controller, are all > running read-ahead algorithms to accelerate this case. In fact, this > *exact* case for the Linux read-ahead stuff that just went mainline > recently: http://kerneltrap.org/node/6642 Apparently only the "simple" stuff hit mainline, see [1] and [2], not knowing how this turns out for pg-style loads, especially compared to the full-fledged patch. Readahead is probably too much of a beast that no one dares to touch with a 3-foot-pole, unless given a large team with good standing in the kernel community and concerted regression testing in whatever environment Linux is used these days... michael [1] http://lwn.net/Articles/235164/ [2] http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=history;f=mm/readahead.c
Nikolas Everett wrote: > > > > In any case your experience doesn't match mine. On a machine with a sizable > > raid controller setting random_page_cost higher does generate, as expected, > > plans with more bitmap heap scans which are in fact faster. > > > > We're running postgres backed by a NetApp 3020 via fiber and have had a lot > of success setting random page cost very high (10). Sequential reads are > just that much faster. I'm not sure if thats because we've configured > something wrong or what, but thats a really useful knob for us. One other issue is that sequential I/O for data that is mostly random is going to wipe more of the buffer cache than random access, so there should perhaps be some additional cost associated with sequential access to bias toward random access. Not sure how our new code that prevents large table scans from wiping the cache affect this. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Oct 10, 2008, at 7:41 PM, Nikolas Everett wrote: > In any case your experience doesn't match mine. On a machine with a > sizable > raid controller setting random_page_cost higher does generate, as > expected, > plans with more bitmap heap scans which are in fact faster. > > We're running postgres backed by a NetApp 3020 via fiber and have > had a lot of success setting random page cost very high (10). > Sequential reads are just that much faster. I'm not sure if thats > because we've configured something wrong or what, but thats a > really useful knob for us. Is your workload OLTP or OLAP? Promoting seqscans in an OLTP environment seems to be a really bad idea to me... -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828