Some notes on optimizer cost estimates - Mailing list pgsql-hackers

I have been spending some time measuring actual runtimes for various
sequential-scan and index-scan query plans, and have learned that the
current Postgres optimizer's cost estimation equations are not very
close to reality at all.

Presently we estimate the cost of a sequential scan as
Nblocks + CPU_PAGE_WEIGHT * Ntuples

--- that is, the unit of cost is the time to read one disk page,
and we have a "fudge factor" that relates CPU time per tuple to
disk time per page.  (The default CPU_PAGE_WEIGHT is 0.033, which
is probably too high for modern hardware --- 0.01 seems like it
might be a better default, at least for simple queries.)  OK,
it's a simplistic model, but not too unreasonable so far.

The cost of an index scan is measured in these same terms as
Nblocks + CPU_PAGE_WEIGHT * Ntuples +  CPU_INDEX_PAGE_WEIGHT * Nindextuples

Here Ntuples is the number of tuples selected by the index qual
condition (typically, it's less than the total table size used in
sequential-scan estimation).  CPU_INDEX_PAGE_WEIGHT essentially
estimates the cost of scanning an index tuple; by default it's 0.017 or
half CPU_PAGE_WEIGHT.  Nblocks is estimated as the index size plus an
appropriate fraction of the main table size.

There are two big problems with this:

1. Since main-table tuples are visited in index order, we'll be hopping
around from page to page in the table.  The current cost estimation
method essentially assumes that the buffer cache plus OS disk cache will
be 100% efficient --- we will never have to read the same page of the
main table twice in a scan, due to having discarded it between
references.  This of course is unreasonably optimistic.  Worst case
is that we'd fetch a main-table page for each selected tuple, but in
most cases that'd be unreasonably pessimistic.

2. The cost of a disk page fetch is estimated at 1.0 unit for both
sequential and index scans.  In reality, sequential access is *much*
cheaper than the quasi-random accesses performed by an index scan.
This is partly a matter of physical disk seeks, and partly a matter
of benefitting (or not) from any read-ahead logic the OS may employ.

As best I can measure on my hardware, the cost of a nonsequential
disk read should be estimated at 4 to 5 times the cost of a sequential
one --- I'm getting numbers like 2.2 msec per disk page for sequential
scans, and as much as 11 msec per page for index scans.  I don't
know, however, if this ratio is similar enough on other platforms
to be useful for cost estimating.  We could make it a parameter like
we do for CPU_PAGE_WEIGHT ... but you know and I know that no one
ever bothers to adjust those numbers in the field ...

The other effect that needs to be modeled, and currently is not, is the
"hit rate" of buffer cache.  Presumably, this is 100% for tables smaller
than the cache and drops off as the table size increases --- but I have
no particular thoughts on the form of the dependency.  Does anyone have
ideas here?  The problem is complicated by the fact that we don't really
know how big the cache is; we know the number of buffers Postgres has,
but we have no idea how big a disk cache the kernel is keeping.  As near
as I can tell, finding a hit in the kernel disk cache is not a lot more
expensive than having the page sitting in Postgres' own buffers ---
certainly it's much much cheaper than a disk read.

BTW, if you want to do some measurements of your own, try turning on
PGOPTIONS="-d 2 -te".  This will dump a lot of interesting numbers
into the postmaster log, if your platform supports getrusage().
        regards, tom lane


pgsql-hackers by date:

Previous
From: Vince Vielhaber
Date:
Subject: New install doc
Next
From: Mike Mascari
Date:
Subject: Re: [HACKERS] Some notes on optimizer cost estimates