Thread: Some notes on optimizer cost estimates
I have been spending some time measuring actual runtimes for various sequential-scan and index-scan query plans, and have learned that the current Postgres optimizer's cost estimation equations are not very close to reality at all. Presently we estimate the cost of a sequential scan as Nblocks + CPU_PAGE_WEIGHT * Ntuples --- that is, the unit of cost is the time to read one disk page, and we have a "fudge factor" that relates CPU time per tuple to disk time per page. (The default CPU_PAGE_WEIGHT is 0.033, which is probably too high for modern hardware --- 0.01 seems like it might be a better default, at least for simple queries.) OK, it's a simplistic model, but not too unreasonable so far. The cost of an index scan is measured in these same terms as Nblocks + CPU_PAGE_WEIGHT * Ntuples + CPU_INDEX_PAGE_WEIGHT * Nindextuples Here Ntuples is the number of tuples selected by the index qual condition (typically, it's less than the total table size used in sequential-scan estimation). CPU_INDEX_PAGE_WEIGHT essentially estimates the cost of scanning an index tuple; by default it's 0.017 or half CPU_PAGE_WEIGHT. Nblocks is estimated as the index size plus an appropriate fraction of the main table size. There are two big problems with this: 1. Since main-table tuples are visited in index order, we'll be hopping around from page to page in the table. The current cost estimation method essentially assumes that the buffer cache plus OS disk cache will be 100% efficient --- we will never have to read the same page of the main table twice in a scan, due to having discarded it between references. This of course is unreasonably optimistic. Worst case is that we'd fetch a main-table page for each selected tuple, but in most cases that'd be unreasonably pessimistic. 2. The cost of a disk page fetch is estimated at 1.0 unit for both sequential and index scans. In reality, sequential access is *much* cheaper than the quasi-random accesses performed by an index scan. This is partly a matter of physical disk seeks, and partly a matter of benefitting (or not) from any read-ahead logic the OS may employ. As best I can measure on my hardware, the cost of a nonsequential disk read should be estimated at 4 to 5 times the cost of a sequential one --- I'm getting numbers like 2.2 msec per disk page for sequential scans, and as much as 11 msec per page for index scans. I don't know, however, if this ratio is similar enough on other platforms to be useful for cost estimating. We could make it a parameter like we do for CPU_PAGE_WEIGHT ... but you know and I know that no one ever bothers to adjust those numbers in the field ... The other effect that needs to be modeled, and currently is not, is the "hit rate" of buffer cache. Presumably, this is 100% for tables smaller than the cache and drops off as the table size increases --- but I have no particular thoughts on the form of the dependency. Does anyone have ideas here? The problem is complicated by the fact that we don't really know how big the cache is; we know the number of buffers Postgres has, but we have no idea how big a disk cache the kernel is keeping. As near as I can tell, finding a hit in the kernel disk cache is not a lot more expensive than having the page sitting in Postgres' own buffers --- certainly it's much much cheaper than a disk read. BTW, if you want to do some measurements of your own, try turning on PGOPTIONS="-d 2 -te". This will dump a lot of interesting numbers into the postmaster log, if your platform supports getrusage(). regards, tom lane
Tom Lane wrote: > As best I can measure on my hardware, the cost of a nonsequential > disk read should be estimated at 4 to 5 times the cost of a sequential > one --- I'm getting numbers like 2.2 msec per disk page for sequential > scans, and as much as 11 msec per page for index scans. I don't > know, however, if this ratio is similar enough on other platforms > to be useful for cost estimating. We could make it a parameter like > we do for CPU_PAGE_WEIGHT ... but you know and I know that no one > ever bothers to adjust those numbers in the field ... Would it be possible to place those parameters as run-time settings and then write a utility that can ship with the distribution to determine those values? Kind of a self-tuning utility? Mike Mascari
Mike Mascari <mascarm@mascari.com> writes: >> to be useful for cost estimating. We could make it a parameter like >> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one >> ever bothers to adjust those numbers in the field ... > Would it be possible to place those parameters as run-time > settings and then write a utility that can ship with the > distribution to determine those values? Kind of a self-tuning > utility? Maybe. I'm not sure the average user would want to run it --- to get believable numbers, you have to be using a table considerably bigger than the kernel's disk cache, which means it takes a while. (I've been testing with a gigabyte-sized table ... one of the index scan runs took thirty hours :-( ... fortunately I have this machine to myself, or there would have been some howls about the load.) But it'd be nice to have comparable numbers for different platforms. What I was really hoping was that someone on the list would be aware of existing research I could borrow from. regards, tom lane
> -----Original Message----- > From: owner-pgsql-hackers@postgreSQL.org > [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane > > I have been spending some time measuring actual runtimes for various > sequential-scan and index-scan query plans, and have learned that the > current Postgres optimizer's cost estimation equations are not very > close to reality at all. > Thanks for your good analysis. I also have said current cost estimation for index-scan is too low. But I have had no concrete numerical values. I've wondered why we cound't analyze database without vacuum. We couldn't run vacuum light-heartedly because it acquires an exclusive lock for the target table. In addition,vacuum error occurs with analyze option in most cases AFAIK. Regards. Hiroshi Inoue Inoue@tpf.co.jp
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > I've wondered why we cound't analyze database without vacuum. > We couldn't run vacuum light-heartedly because it acquires an > exclusive lock for the target table. There is probably no real good reason, except backwards compatibility, why the ANALYZE function (obtaining pg_statistic data) is part of VACUUM at all --- it could just as easily be a separate command that would only use read access on the database. Bruce is thinking about restructuring VACUUM, so maybe now is a good time to think about splitting out the ANALYZE code too. > In addition,vacuum error occurs with analyze option in most > cases AFAIK. Still, with current sources? What's the error message? I fixed a problem with pg_statistic tuples getting too big... regards, tom lane
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > I've wondered why we cound't analyze database without vacuum. > > We couldn't run vacuum light-heartedly because it acquires an > > exclusive lock for the target table. > > There is probably no real good reason, except backwards compatibility, > why the ANALYZE function (obtaining pg_statistic data) is part of > VACUUM at all --- it could just as easily be a separate command that > would only use read access on the database. Bruce is thinking about > restructuring VACUUM, so maybe now is a good time to think about > splitting out the ANALYZE code too. I put it in vacuum because at the time I didn't know how to do such things and vacuum already scanned the table. I just linked on the the scan. Seemed like a good idea at the time. It is nice that ANALYZE is done during vacuum. I can't imagine why you would want to do an analyze without adding a vacuum to it. I guess that's why I made them the same command. If I made them separate commands, both would have to scan the table, though the analyze could do it without the exclusive lock, which would be good. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> -----Original Message----- > From: Tom Lane [mailto:tgl@sss.pgh.pa.us] > > > In addition,vacuum error occurs with analyze option in most > > cases AFAIK. > > Still, with current sources? What's the error message? I fixed > a problem with pg_statistic tuples getting too big... > Sorry,my English is poor. When I saw vacuum bug reports,there were 'analyze' option mostly. 'Analyze' is harmful for safety of vacuum. I'm thinking that 'analyze' is not preferable especially for index recreation in vacuum. Regards. Hiroshi Inoue Inoue@tpf.co.jp
> -----Original Message----- > From: Bruce Momjian [mailto:pgman@candle.pha.pa.us] > > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > > I've wondered why we cound't analyze database without vacuum. > > > We couldn't run vacuum light-heartedly because it acquires an > > > exclusive lock for the target table. > > > > There is probably no real good reason, except backwards compatibility, > > why the ANALYZE function (obtaining pg_statistic data) is part of > > VACUUM at all --- it could just as easily be a separate command that > > would only use read access on the database. Bruce is thinking about > > restructuring VACUUM, so maybe now is a good time to think about > > splitting out the ANALYZE code too. > > I put it in vacuum because at the time I didn't know how to do such > things and vacuum already scanned the table. I just linked on the the > scan. Seemed like a good idea at the time. > > It is nice that ANALYZE is done during vacuum. I can't imagine why you > would want to do an analyze without adding a vacuum to it. I guess > that's why I made them the same command. > > If I made them separate commands, both would have to scan the table, > though the analyze could do it without the exclusive lock, which would > be good. > The functionality of VACUUM and ANALYZE is quite different. I don't prefer to charge VACUUM more than now about analyzing database. Probably looong lock,more aborts .... Various kind of analysis would be possible by splitting out ANALYZE. Regards. Hiroshi Inoue Inoue@tpf.co.jp
Bruce Momjian <pgman@candle.pha.pa.us> writes: > It is nice that ANALYZE is done during vacuum. I can't imagine why you > would want to do an analyze without adding a vacuum to it. I guess > that's why I made them the same command. Well, the main bad thing about ANALYZE being part of VACUUM is that it adds to the length of time that VACUUM is holding an exclusive lock on the table. I think it'd make more sense for it to be a separate command. I have also been thinking about how to make ANALYZE produce a more reliable estimate of the most common value. The three-element list that it keeps now is a good low-cost hack, but it really doesn't produce a trustworthy answer unless the MCV is pretty darn C (since it will never pick up on the MCV at all until there are at least two occurrences in three adjacent tuples). The only idea I've come up with is to use a larger list, which would be slower and take more memory. I think that'd be OK in a separate command, but I hesitate to do it inside VACUUM --- VACUUM has its own considerable memory requirements, and there's still the issue of not holding down an exclusive lock longer than you have to. regards, tom lane
Then <tgl@sss.pgh.pa.us> spoke up and said: > As best I can measure on my hardware, the cost of a nonsequential > disk read should be estimated at 4 to 5 times the cost of a sequential > one --- I'm getting numbers like 2.2 msec per disk page for sequential > scans, and as much as 11 msec per page for index scans. I don't > know, however, if this ratio is similar enough on other platforms > to be useful for cost estimating. We could make it a parameter like > we do for CPU_PAGE_WEIGHT ... but you know and I know that no one > ever bothers to adjust those numbers in the field ... Here's a thought: there are tools (bonnie, ioscan) whose job is determining details of disk performance. Do we want to look at creating a small tool/script of our own that would (optionally) determine the correct parameters for the system it is installed on and update the appropriate parameters? -- ===================================================================== | JAVA must have been developed in the wilds of West Virginia. | | After all, why else would it support only single inheritance?? | ===================================================================== | Finger geek@cmu.edu for my public key. | =====================================================================
On Thu, 20 Jan 2000, Bruce Momjian wrote: > > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes: > > > I've wondered why we cound't analyze database without vacuum. > > > We couldn't run vacuum light-heartedly because it acquires an > > > exclusive lock for the target table. > > > > There is probably no real good reason, except backwards compatibility, > > why the ANALYZE function (obtaining pg_statistic data) is part of > > VACUUM at all --- it could just as easily be a separate command that > > would only use read access on the database. Bruce is thinking about > > restructuring VACUUM, so maybe now is a good time to think about > > splitting out the ANALYZE code too. > > I put it in vacuum because at the time I didn't know how to do such > things and vacuum already scanned the table. I just linked on the the > scan. Seemed like a good idea at the time. > > It is nice that ANALYZE is done during vacuum. I can't imagine why you > would want to do an analyze without adding a vacuum to it. I guess > that's why I made them the same command. Hrmmm...how about making ANALYZE a seperate function, while a VACUUM does an ANALYZE implicitly? Then again, whatever happened with the work that was being done to make VACUUM either non-locking *or*, at least, lock only the table being vacuum'd? Marc G. Fournier ICQ#7615664 IRC Nick: Scrappy Systems Administrator @ hub.org primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org
>to be useful for cost estimating. We could make it a parameter like >we do for CPU_PAGE_WEIGHT ... but you know and I know that no one >ever bothers to adjust those numbers in the field ... Appologies if this is addressed later in this thread. Couldn't we test some of these parameters inside configure and set them there? Signature failed Preliminary Design Review. Feasibility of a new signature is currently being evaluated. h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu
> >to be useful for cost estimating. We could make it a parameter like > >we do for CPU_PAGE_WEIGHT ... but you know and I know that no one > >ever bothers to adjust those numbers in the field ... > > Appologies if this is addressed later in this thread. > > Couldn't we test some of these parameters inside configure and set them there? That would be cool. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
"Henry B. Hotz" <hotz@jpl.nasa.gov> writes: >> to be useful for cost estimating. We could make it a parameter like >> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one >> ever bothers to adjust those numbers in the field ... > Couldn't we test some of these parameters inside configure and set > them there? If we could figure out a reasonably cheap way of estimating these numbers, it'd be worth setting up custom values at installation time. I don't know how to do that --- AFAICS, getting trustworthy numbers by measurement would require hundreds of meg of temporary disk space and probably hours of runtime. (A smaller test would be completely corrupted by kernel disk caching effects.) But perhaps someone has an idea how to do it. regards, tom lane
At 9:25 AM -0800 1/24/00, Tom Lane wrote: >"Henry B. Hotz" <hotz@jpl.nasa.gov> writes: >>> to be useful for cost estimating. We could make it a parameter like >>> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one >>> ever bothers to adjust those numbers in the field ... > >> Couldn't we test some of these parameters inside configure and set >> them there? > >If we could figure out a reasonably cheap way of estimating these >numbers, it'd be worth setting up custom values at installation time. > >I don't know how to do that --- AFAICS, getting trustworthy numbers by >measurement would require hundreds of meg of temporary disk space and >probably hours of runtime. (A smaller test would be completely >corrupted by kernel disk caching effects.) Getting a rough estimate of CPU speed is trivial. Getting a rough estimate of sequential disk access shouldn't be too hard, though you would need to make sure it didn't give the wrong answer if you ran configure twice in a row or something. Getting a rough estimate of disk access for a single non-sequential disk page also shouldn't be too hard with the same caviats. Measuring sequential vs. random reads probably takes a large dataset as you say. I suspect that there is a range of important parameters and we can only easily measure some of them. If we do so and use canned values (ratios, where possible) for the others then we're probably still ahead. I'll leave the details for the people who actually have time to do some of this stuff. Signature failed Preliminary Design Review. Feasibility of a new signature is currently being evaluated. h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu
Then <hotz@jpl.nasa.gov> spoke up and said: > >I don't know how to do that --- AFAICS, getting trustworthy numbers by > >measurement would require hundreds of meg of temporary disk space and > >probably hours of runtime. (A smaller test would be completely > >corrupted by kernel disk caching effects.) > > Getting a rough estimate of CPU speed is trivial. Getting a rough estimate > of sequential disk access shouldn't be too hard, though you would need to > make sure it didn't give the wrong answer if you ran configure twice in a > row or something. Getting a rough estimate of disk access for a single > non-sequential disk page also shouldn't be too hard with the same caviats. > Measuring sequential vs. random reads probably takes a large dataset as you > say. A point for consideration: this need not be a configure test. Any commercial database usage carries with it the expectation of a non-trivial effort at tuning. This being the case, it might make sense to bring in some foresight here. As PostgreSQL matures, people are going to be using it on non-homogeneous systems (e.g mixture of 3600, 7200, and 10k rpm disks). Our cost estimates should therefore vary somewhat as tables start living on different disks (yet another reason why symlinks are not the answer). Right now, I would kill to have a tool that I could run over a couple hours and many gigabytes of disk space that would give me indications of how to tune my Oracle database. We may want to think about, in the future, adding in the ability to tune specific tables by keeping query statistics and analyzing them. Even post-processing debug output would help, although turning debugging on adds some non-trivial overhead. -- ===================================================================== | JAVA must have been developed in the wilds of West Virginia. | | After all, why else would it support only single inheritance?? | ===================================================================== | Finger geek@cmu.edu for my public key. | =====================================================================
"Henry B. Hotz" <hotz@jpl.nasa.gov> writes: >> I don't know how to do that --- AFAICS, getting trustworthy numbers by >> measurement would require hundreds of meg of temporary disk space and >> probably hours of runtime. (A smaller test would be completely >> corrupted by kernel disk caching effects.) > Getting a rough estimate of CPU speed is trivial. Getting a rough estimate > of sequential disk access shouldn't be too hard, though you would need to > make sure it didn't give the wrong answer if you ran configure twice in a > row or something. Getting a rough estimate of disk access for a single > non-sequential disk page also shouldn't be too hard with the same caviats. In practice this would be happening at initdb time, not configure time, since it'd be a lot easier to do it in C code than in a shell script. But that's a detail. I'm still not clear on how you can wave away the issue of kernel disk caching --- if you don't use a test file that's larger than the disk cache, ISTM you risk getting a number that's entirely devoid of any physical I/O at all. regards, tom lane
At 01:13 PM 1/24/00 -0500, Tom Lane wrote: >In practice this would be happening at initdb time, not configure time, >since it'd be a lot easier to do it in C code than in a shell script. >But that's a detail. I'm still not clear on how you can wave away the >issue of kernel disk caching --- if you don't use a test file that's >larger than the disk cache, ISTM you risk getting a number that's >entirely devoid of any physical I/O at all. And even the $100 6.4 GB Ultra DMA drive I bought last week has 2MB of cache. hdparm shows me getting 19 mB/second transfers even though it adjusts for the file system cache. It's only a 5400 RPM disk and I'm certain the on-disk cache is impacting this number. - Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert Serviceand other goodies at http://donb.photo.net.
Don Baccus wrote: > > At 01:13 PM 1/24/00 -0500, Tom Lane wrote: > > >In practice this would be happening at initdb time, not configure time, or perhaps it can be collected when running regression tests. > >since it'd be a lot easier to do it in C code than in a shell script. > >But that's a detail. I'm still not clear on how you can wave away the > >issue of kernel disk caching --- if you don't use a test file that's > >larger than the disk cache, ISTM you risk getting a number that's > >entirely devoid of any physical I/O at all. > > And even the $100 6.4 GB Ultra DMA drive I bought last week has > 2MB of cache. hdparm shows me getting 19 mB/second transfers > even though it adjusts for the file system cache. It's only a > 5400 RPM disk and I'm certain the on-disk cache is impacting > this number. But they also claim internal bitrates of more than 250 Mbits/s, which stays the same for both 5400 and 7200 RPM disks (the latter even have the same seek time) so that may actually be true for sequential reads if they have all their read paths at optimal efficiency and readahead and cache do their job. ------------- Hannu
On 2000-01-24, Henry B. Hotz mentioned: > >to be useful for cost estimating. We could make it a parameter like > >we do for CPU_PAGE_WEIGHT ... but you know and I know that no one > >ever bothers to adjust those numbers in the field ... > Couldn't we test some of these parameters inside configure and set them there? Nope. configure is for testing the build environment, not the runtime environment. But the same thing could be achieved with some runtime-analyzing program that writes its results into some sort of configuration file. -- Peter Eisentraut Sernanders väg 10:115 peter_e@gmx.net 75262 Uppsala http://yi.org/peter-e/ Sweden
At 3:16 PM -0800 1/25/00, Peter Eisentraut wrote: >On 2000-01-24, Henry B. Hotz mentioned: >> Couldn't we test some of these parameters inside configure and set them >>there? > >Nope. configure is for testing the build environment, not the runtime >environment. But the same thing could be achieved with some >runtime-analyzing program that writes its results into some sort of >configuration file. And then we could make configure run the program for us, and we could automatically put the results into some header file. . . In other words I'm not that much of a purist. As Tom Lane said this is not an important point. What's important is how much can we determine how easily. It may actually make more sense to do this tuning after installation. It only affects optimizer choices in situations that aren't obvious. Signature failed Preliminary Design Review. Feasibility of a new signature is currently being evaluated. h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu