Thread: Some notes on optimizer cost estimates

Some notes on optimizer cost estimates

From
Tom Lane
Date:
I have been spending some time measuring actual runtimes for various
sequential-scan and index-scan query plans, and have learned that the
current Postgres optimizer's cost estimation equations are not very
close to reality at all.

Presently we estimate the cost of a sequential scan as
Nblocks + CPU_PAGE_WEIGHT * Ntuples

--- that is, the unit of cost is the time to read one disk page,
and we have a "fudge factor" that relates CPU time per tuple to
disk time per page.  (The default CPU_PAGE_WEIGHT is 0.033, which
is probably too high for modern hardware --- 0.01 seems like it
might be a better default, at least for simple queries.)  OK,
it's a simplistic model, but not too unreasonable so far.

The cost of an index scan is measured in these same terms as
Nblocks + CPU_PAGE_WEIGHT * Ntuples +  CPU_INDEX_PAGE_WEIGHT * Nindextuples

Here Ntuples is the number of tuples selected by the index qual
condition (typically, it's less than the total table size used in
sequential-scan estimation).  CPU_INDEX_PAGE_WEIGHT essentially
estimates the cost of scanning an index tuple; by default it's 0.017 or
half CPU_PAGE_WEIGHT.  Nblocks is estimated as the index size plus an
appropriate fraction of the main table size.

There are two big problems with this:

1. Since main-table tuples are visited in index order, we'll be hopping
around from page to page in the table.  The current cost estimation
method essentially assumes that the buffer cache plus OS disk cache will
be 100% efficient --- we will never have to read the same page of the
main table twice in a scan, due to having discarded it between
references.  This of course is unreasonably optimistic.  Worst case
is that we'd fetch a main-table page for each selected tuple, but in
most cases that'd be unreasonably pessimistic.

2. The cost of a disk page fetch is estimated at 1.0 unit for both
sequential and index scans.  In reality, sequential access is *much*
cheaper than the quasi-random accesses performed by an index scan.
This is partly a matter of physical disk seeks, and partly a matter
of benefitting (or not) from any read-ahead logic the OS may employ.

As best I can measure on my hardware, the cost of a nonsequential
disk read should be estimated at 4 to 5 times the cost of a sequential
one --- I'm getting numbers like 2.2 msec per disk page for sequential
scans, and as much as 11 msec per page for index scans.  I don't
know, however, if this ratio is similar enough on other platforms
to be useful for cost estimating.  We could make it a parameter like
we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
ever bothers to adjust those numbers in the field ...

The other effect that needs to be modeled, and currently is not, is the
"hit rate" of buffer cache.  Presumably, this is 100% for tables smaller
than the cache and drops off as the table size increases --- but I have
no particular thoughts on the form of the dependency.  Does anyone have
ideas here?  The problem is complicated by the fact that we don't really
know how big the cache is; we know the number of buffers Postgres has,
but we have no idea how big a disk cache the kernel is keeping.  As near
as I can tell, finding a hit in the kernel disk cache is not a lot more
expensive than having the page sitting in Postgres' own buffers ---
certainly it's much much cheaper than a disk read.

BTW, if you want to do some measurements of your own, try turning on
PGOPTIONS="-d 2 -te".  This will dump a lot of interesting numbers
into the postmaster log, if your platform supports getrusage().
        regards, tom lane


Re: [HACKERS] Some notes on optimizer cost estimates

From
Mike Mascari
Date:
Tom Lane wrote:

> As best I can measure on my hardware, the cost of a nonsequential
> disk read should be estimated at 4 to 5 times the cost of a sequential
> one --- I'm getting numbers like 2.2 msec per disk page for sequential
> scans, and as much as 11 msec per page for index scans.  I don't
> know, however, if this ratio is similar enough on other platforms
> to be useful for cost estimating.  We could make it a parameter like
> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
> ever bothers to adjust those numbers in the field ...

Would it be possible to place those parameters as run-time
settings and then write a utility that can ship with the
distribution to determine those values? Kind of a self-tuning
utility? 

Mike Mascari


Re: [HACKERS] Some notes on optimizer cost estimates

From
Tom Lane
Date:
Mike Mascari <mascarm@mascari.com> writes:
>> to be useful for cost estimating.  We could make it a parameter like
>> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
>> ever bothers to adjust those numbers in the field ...

> Would it be possible to place those parameters as run-time
> settings and then write a utility that can ship with the
> distribution to determine those values? Kind of a self-tuning
> utility? 

Maybe.  I'm not sure the average user would want to run it ---
to get believable numbers, you have to be using a table considerably
bigger than the kernel's disk cache, which means it takes a while.
(I've been testing with a gigabyte-sized table ... one of the index
scan runs took thirty hours :-( ... fortunately I have this machine
to myself, or there would have been some howls about the load.)

But it'd be nice to have comparable numbers for different platforms.
What I was really hoping was that someone on the list would be aware
of existing research I could borrow from.
        regards, tom lane


RE: [HACKERS] Some notes on optimizer cost estimates

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Tom Lane
> 
> I have been spending some time measuring actual runtimes for various
> sequential-scan and index-scan query plans, and have learned that the
> current Postgres optimizer's cost estimation equations are not very
> close to reality at all.
> 

Thanks for your good analysis.

I also have said current cost estimation for index-scan is too low.
But I have had no concrete numerical values.

I've wondered why we cound't analyze database without vacuum.
We couldn't run vacuum light-heartedly because it acquires an
exclusive lock for the target table. 
In addition,vacuum error occurs with analyze option in most
cases AFAIK. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] Some notes on optimizer cost estimates

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> I've wondered why we cound't analyze database without vacuum.
> We couldn't run vacuum light-heartedly because it acquires an
> exclusive lock for the target table. 

There is probably no real good reason, except backwards compatibility,
why the ANALYZE function (obtaining pg_statistic data) is part of
VACUUM at all --- it could just as easily be a separate command that
would only use read access on the database.  Bruce is thinking about
restructuring VACUUM, so maybe now is a good time to think about
splitting out the ANALYZE code too.

> In addition,vacuum error occurs with analyze option in most
> cases AFAIK. 

Still, with current sources?  What's the error message?  I fixed
a problem with pg_statistic tuples getting too big...
        regards, tom lane


Re: [HACKERS] Some notes on optimizer cost estimates

From
Bruce Momjian
Date:
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > I've wondered why we cound't analyze database without vacuum.
> > We couldn't run vacuum light-heartedly because it acquires an
> > exclusive lock for the target table. 
> 
> There is probably no real good reason, except backwards compatibility,
> why the ANALYZE function (obtaining pg_statistic data) is part of
> VACUUM at all --- it could just as easily be a separate command that
> would only use read access on the database.  Bruce is thinking about
> restructuring VACUUM, so maybe now is a good time to think about
> splitting out the ANALYZE code too.

I put it in vacuum because at the time I didn't know how to do such
things and vacuum already scanned the table.  I just linked on the the
scan.  Seemed like a good idea at the time.

It is nice that ANALYZE is done during vacuum.  I can't imagine why you
would want to do an analyze without adding a vacuum to it.  I guess
that's why I made them the same command.

If I made them separate commands, both would have to scan the table,
though the analyze could do it without the exclusive lock, which would
be good.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


RE: [HACKERS] Some notes on optimizer cost estimates

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> > In addition,vacuum error occurs with analyze option in most
> > cases AFAIK. 
> 
> Still, with current sources?  What's the error message?  I fixed
> a problem with pg_statistic tuples getting too big...
>

Sorry,my English is poor.
When I saw vacuum bug reports,there were 'analyze' option mostly.
'Analyze' is harmful for safety of vacuum. 

I'm thinking that 'analyze' is not preferable especially for index
recreation in vacuum. 

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


RE: [HACKERS] Some notes on optimizer cost estimates

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]
> 
> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > > I've wondered why we cound't analyze database without vacuum.
> > > We couldn't run vacuum light-heartedly because it acquires an
> > > exclusive lock for the target table. 
> > 
> > There is probably no real good reason, except backwards compatibility,
> > why the ANALYZE function (obtaining pg_statistic data) is part of
> > VACUUM at all --- it could just as easily be a separate command that
> > would only use read access on the database.  Bruce is thinking about
> > restructuring VACUUM, so maybe now is a good time to think about
> > splitting out the ANALYZE code too.
> 
> I put it in vacuum because at the time I didn't know how to do such
> things and vacuum already scanned the table.  I just linked on the the
> scan.  Seemed like a good idea at the time.
> 
> It is nice that ANALYZE is done during vacuum.  I can't imagine why you
> would want to do an analyze without adding a vacuum to it.  I guess
> that's why I made them the same command.
> 
> If I made them separate commands, both would have to scan the table,
> though the analyze could do it without the exclusive lock, which would
> be good.
>

The functionality of VACUUM and ANALYZE is quite different.
I don't prefer to charge VACUUM more than now about analyzing
database.  Probably looong lock,more aborts .... 
Various kind of analysis would be possible by splitting out ANALYZE.
Regards.

Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] Some notes on optimizer cost estimates

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> It is nice that ANALYZE is done during vacuum.  I can't imagine why you
> would want to do an analyze without adding a vacuum to it.  I guess
> that's why I made them the same command.

Well, the main bad thing about ANALYZE being part of VACUUM is that
it adds to the length of time that VACUUM is holding an exclusive
lock on the table.  I think it'd make more sense for it to be a
separate command.

I have also been thinking about how to make ANALYZE produce a more
reliable estimate of the most common value.  The three-element list
that it keeps now is a good low-cost hack, but it really doesn't
produce a trustworthy answer unless the MCV is pretty darn C (since
it will never pick up on the MCV at all until there are at least
two occurrences in three adjacent tuples).  The only idea I've come
up with is to use a larger list, which would be slower and take
more memory.  I think that'd be OK in a separate command, but I
hesitate to do it inside VACUUM --- VACUUM has its own considerable
memory requirements, and there's still the issue of not holding down
an exclusive lock longer than you have to.
        regards, tom lane


Re: [HACKERS] Some notes on optimizer cost estimates

From
Brian E Gallew
Date:
Then <tgl@sss.pgh.pa.us> spoke up and said:
> As best I can measure on my hardware, the cost of a nonsequential
> disk read should be estimated at 4 to 5 times the cost of a sequential
> one --- I'm getting numbers like 2.2 msec per disk page for sequential
> scans, and as much as 11 msec per page for index scans.  I don't
> know, however, if this ratio is similar enough on other platforms
> to be useful for cost estimating.  We could make it a parameter like
> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
> ever bothers to adjust those numbers in the field ...

Here's a thought: there are tools (bonnie, ioscan) whose job is
determining details of disk performance.  Do we want to look at
creating a small tool/script of our own that would (optionally)
determine the correct parameters for the system it is installed on and
update the appropriate parameters?

-- 
=====================================================================
| JAVA must have been developed in the wilds of West Virginia.      |
| After all, why else would it support only single inheritance??    |
=====================================================================
| Finger geek@cmu.edu for my public key.                            |
=====================================================================

Re: [HACKERS] Some notes on optimizer cost estimates

From
The Hermit Hacker
Date:
On Thu, 20 Jan 2000, Bruce Momjian wrote:

> > "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > > I've wondered why we cound't analyze database without vacuum.
> > > We couldn't run vacuum light-heartedly because it acquires an
> > > exclusive lock for the target table. 
> > 
> > There is probably no real good reason, except backwards compatibility,
> > why the ANALYZE function (obtaining pg_statistic data) is part of
> > VACUUM at all --- it could just as easily be a separate command that
> > would only use read access on the database.  Bruce is thinking about
> > restructuring VACUUM, so maybe now is a good time to think about
> > splitting out the ANALYZE code too.
> 
> I put it in vacuum because at the time I didn't know how to do such
> things and vacuum already scanned the table.  I just linked on the the
> scan.  Seemed like a good idea at the time.
> 
> It is nice that ANALYZE is done during vacuum.  I can't imagine why you
> would want to do an analyze without adding a vacuum to it.  I guess
> that's why I made them the same command.

Hrmmm...how about making ANALYZE a seperate function, while a VACUUM does
an ANALYZE implicitly?

Then again, whatever happened with the work that was being done to make
VACUUM either non-locking *or*, at least, lock only the table being
vacuum'd?

Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org 
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org 



Re: [HACKERS] Some notes on optimizer cost estimates

From
"Henry B. Hotz"
Date:
>to be useful for cost estimating.  We could make it a parameter like
>we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
>ever bothers to adjust those numbers in the field ...

Appologies if this is addressed later in this thread.

Couldn't we test some of these parameters inside configure and set them there?

Signature failed Preliminary Design Review.
Feasibility of a new signature is currently being evaluated.
h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu


Re: [HACKERS] Some notes on optimizer cost estimates

From
Bruce Momjian
Date:
> >to be useful for cost estimating.  We could make it a parameter like
> >we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
> >ever bothers to adjust those numbers in the field ...
> 
> Appologies if this is addressed later in this thread.
> 
> Couldn't we test some of these parameters inside configure and set them there?

That would be cool.

--  Bruce Momjian                        |  http://www.op.net/~candle pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Some notes on optimizer cost estimates

From
Tom Lane
Date:
"Henry B. Hotz" <hotz@jpl.nasa.gov> writes:
>> to be useful for cost estimating.  We could make it a parameter like
>> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
>> ever bothers to adjust those numbers in the field ...

> Couldn't we test some of these parameters inside configure and set
> them there?

If we could figure out a reasonably cheap way of estimating these
numbers, it'd be worth setting up custom values at installation time.

I don't know how to do that --- AFAICS, getting trustworthy numbers by
measurement would require hundreds of meg of temporary disk space and
probably hours of runtime.  (A smaller test would be completely
corrupted by kernel disk caching effects.)

But perhaps someone has an idea how to do it.
        regards, tom lane


Re: [HACKERS] Some notes on optimizer cost estimates

From
"Henry B. Hotz"
Date:
At 9:25 AM -0800 1/24/00, Tom Lane wrote:
>"Henry B. Hotz" <hotz@jpl.nasa.gov> writes:
>>> to be useful for cost estimating.  We could make it a parameter like
>>> we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
>>> ever bothers to adjust those numbers in the field ...
>
>> Couldn't we test some of these parameters inside configure and set
>> them there?
>
>If we could figure out a reasonably cheap way of estimating these
>numbers, it'd be worth setting up custom values at installation time.
>
>I don't know how to do that --- AFAICS, getting trustworthy numbers by
>measurement would require hundreds of meg of temporary disk space and
>probably hours of runtime.  (A smaller test would be completely
>corrupted by kernel disk caching effects.)

Getting a rough estimate of CPU speed is trivial.  Getting a rough estimate
of sequential disk access shouldn't be too hard, though you would need to
make sure it didn't give the wrong answer if you ran configure twice in a
row or something. Getting a rough estimate of disk access for a single
non-sequential disk page also shouldn't be too hard with the same caviats.
Measuring sequential vs. random reads probably takes a large dataset as you
say.

I suspect that there is a range of important parameters and we can only
easily measure some of them.  If we do so and use canned values (ratios,
where possible) for the others then we're probably still ahead.

I'll leave the details for the people who actually have time to do some of
this stuff.

Signature failed Preliminary Design Review.
Feasibility of a new signature is currently being evaluated.
h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu


Re: [HACKERS] Some notes on optimizer cost estimates

From
Brian E Gallew
Date:
Then <hotz@jpl.nasa.gov> spoke up and said:
> >I don't know how to do that --- AFAICS, getting trustworthy numbers by
> >measurement would require hundreds of meg of temporary disk space and
> >probably hours of runtime.  (A smaller test would be completely
> >corrupted by kernel disk caching effects.)
> 
> Getting a rough estimate of CPU speed is trivial.  Getting a rough estimate
> of sequential disk access shouldn't be too hard, though you would need to
> make sure it didn't give the wrong answer if you ran configure twice in a
> row or something. Getting a rough estimate of disk access for a single
> non-sequential disk page also shouldn't be too hard with the same caviats.
> Measuring sequential vs. random reads probably takes a large dataset as you
> say.

A point for consideration:  this need not be a configure test.  Any
commercial database usage carries with it the expectation of a
non-trivial effort at tuning.  This being the case, it might make
sense to bring in some foresight here.  As PostgreSQL matures, people
are going to be using it on non-homogeneous systems (e.g mixture of
3600, 7200, and 10k rpm disks).  Our cost estimates should therefore
vary somewhat as tables start living on different disks (yet another
reason why symlinks are not the answer).

Right now, I would kill to have a tool that I could run over a couple
hours and many gigabytes of disk space that would give me indications
of how to tune my Oracle database.  We may want to think about, in the
future, adding in the ability to tune specific tables by keeping query
statistics and analyzing them.  Even post-processing debug output
would help, although turning debugging on adds some non-trivial
overhead. 

-- 
=====================================================================
| JAVA must have been developed in the wilds of West Virginia.      |
| After all, why else would it support only single inheritance??    |
=====================================================================
| Finger geek@cmu.edu for my public key.                            |
=====================================================================

Re: [HACKERS] Some notes on optimizer cost estimates

From
Tom Lane
Date:
"Henry B. Hotz" <hotz@jpl.nasa.gov> writes:
>> I don't know how to do that --- AFAICS, getting trustworthy numbers by
>> measurement would require hundreds of meg of temporary disk space and
>> probably hours of runtime.  (A smaller test would be completely
>> corrupted by kernel disk caching effects.)

> Getting a rough estimate of CPU speed is trivial.  Getting a rough estimate
> of sequential disk access shouldn't be too hard, though you would need to
> make sure it didn't give the wrong answer if you ran configure twice in a
> row or something. Getting a rough estimate of disk access for a single
> non-sequential disk page also shouldn't be too hard with the same caviats.

In practice this would be happening at initdb time, not configure time,
since it'd be a lot easier to do it in C code than in a shell script.
But that's a detail.  I'm still not clear on how you can wave away the
issue of kernel disk caching --- if you don't use a test file that's
larger than the disk cache, ISTM you risk getting a number that's
entirely devoid of any physical I/O at all.
        regards, tom lane


Re: [HACKERS] Some notes on optimizer cost estimates

From
Don Baccus
Date:
At 01:13 PM 1/24/00 -0500, Tom Lane wrote:

>In practice this would be happening at initdb time, not configure time,
>since it'd be a lot easier to do it in C code than in a shell script.
>But that's a detail.  I'm still not clear on how you can wave away the
>issue of kernel disk caching --- if you don't use a test file that's
>larger than the disk cache, ISTM you risk getting a number that's
>entirely devoid of any physical I/O at all.

And even the $100 6.4 GB Ultra DMA drive I bought last week has
2MB of cache.  hdparm shows me getting 19 mB/second transfers
even though it adjusts for the file system cache.  It's only a
5400 RPM disk and I'm certain the on-disk cache is impacting
this number.



- Don Baccus, Portland OR <dhogaza@pacifier.com> Nature photos, on-line guides, Pacific Northwest Rare Bird Alert
Serviceand other goodies at http://donb.photo.net.
 


Re: [HACKERS] Some notes on optimizer cost estimates

From
Hannu Krosing
Date:
Don Baccus wrote:
> 
> At 01:13 PM 1/24/00 -0500, Tom Lane wrote:
> 
> >In practice this would be happening at initdb time, not configure time,

or perhaps it can be collected when running regression tests.

> >since it'd be a lot easier to do it in C code than in a shell script.
> >But that's a detail.  I'm still not clear on how you can wave away the
> >issue of kernel disk caching --- if you don't use a test file that's
> >larger than the disk cache, ISTM you risk getting a number that's
> >entirely devoid of any physical I/O at all.
> 
> And even the $100 6.4 GB Ultra DMA drive I bought last week has
> 2MB of cache.  hdparm shows me getting 19 mB/second transfers
> even though it adjusts for the file system cache.  It's only a
> 5400 RPM disk and I'm certain the on-disk cache is impacting
> this number.

But they also claim internal bitrates of more than 250 Mbits/s, which stays 
the same for both 5400 and 7200 RPM disks (the latter even have the same 
seek time) so that may actually be true for sequential reads if they have 
all their read paths at optimal efficiency and readahead and cache do 
their job.

-------------
Hannu


Re: [HACKERS] Some notes on optimizer cost estimates

From
Peter Eisentraut
Date:
On 2000-01-24, Henry B. Hotz mentioned:

> >to be useful for cost estimating.  We could make it a parameter like
> >we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
> >ever bothers to adjust those numbers in the field ...

> Couldn't we test some of these parameters inside configure and set them there?

Nope. configure is for testing the build environment, not the runtime
environment. But the same thing could be achieved with some
runtime-analyzing program that writes its results into some sort of
configuration file.


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden



Re: [HACKERS] Some notes on optimizer cost estimates

From
"Henry B. Hotz"
Date:
At 3:16 PM -0800 1/25/00, Peter Eisentraut wrote:
>On 2000-01-24, Henry B. Hotz mentioned:
>> Couldn't we test some of these parameters inside configure and set them
>>there?
>
>Nope. configure is for testing the build environment, not the runtime
>environment. But the same thing could be achieved with some
>runtime-analyzing program that writes its results into some sort of
>configuration file.

And then we could make configure run the program for us, and we could
automatically put the results into some header file. . .  In other words
I'm not that much of a purist.

As Tom Lane said this is not an important point.  What's important is how
much can we determine how easily.  It may actually make more sense to do
this tuning after installation.  It only affects optimizer choices in
situations that aren't obvious.

Signature failed Preliminary Design Review.
Feasibility of a new signature is currently being evaluated.
h.b.hotz@jpl.nasa.gov, or hbhotz@oxy.edu