Thread: question about index cost estimates

question about index cost estimates

From
Jeff Hoffmann
Date:
i know tom lane is probably sick of me, but i'm trying to figure out how
the cost estimators work for an index scan.  i'm logging a lot of the
calculations that go into the cost estimate for some sample queries to
see what factors are most important in a cost estimate.  it seems to me
that in the queries that i'm running, a vast majority of the cost comes
from reading the tuples from the relation.  i think the cost of
nonsequential access seems reasonable (it's about 2), but the
calculation of pages fetched from the relation seems high.  i'm not sure
what the actual should be, but some queries have it at 2-4x the number
of pages in the relation, which seemed high, so i started looking at
that.  i don't understand why the function that is used would be a good
model of what is actually happening.  here's the comment & the function:
* Estimate number of main-table tuples and pages fetched.** If the number of tuples is much smaller than the number of
pagesin* the relation, each tuple will cost a separate nonsequential fetch.* If it is comparable or larger, then
probablywe will be able to* avoid some fetches.  We use a growth rate of log(#tuples/#pages +* 1) --- probably totally
bogus,but intuitively it gives the right* shape of curve at least.
 
pages_fetched = ceil(baserel->pages * log(tuples_fetched /
baserel->pages + 1.0));

i'm at a total loss to explain how this works.  for all i know, it's
correct and it is that costly, i don't know.  it just seems to me that
there should be some notion of tuple size figured in to know how many
tuples fit in a page.  can somebody explain it to me?

thanks,

jeff


Re: question about index cost estimates

From
Tom Lane
Date:
Jeff Hoffmann <jeff@propertykey.com> writes:
> calculation of pages fetched from the relation seems high.  i'm not sure
> what the actual should be, but some queries have it at 2-4x the number
> of pages in the relation, which seemed high, so i started looking at
> that.

If the table is bigger than the disk cache, and we are hitting pages
more than once, then we are likely to have to fetch some pages more than
once.  The model is assuming that the tuples we want are scattered more
or less randomly over the whole table.  If #tuples to be fetched
approaches or exceeds #pages in table, then clearly we are going to be
touching some pages more than once because we want more than one tuple
from them.  The $64 question is did the page fall out of buffer cache
between references?  Worst case (if your table vastly exceeds your
available buffer space), practically every tuple fetch will require a
separate physical read, and then the number of page fetches is
essentially the number of tuples returned --- which of course can be
a lot more than the number of pages in the table.

Right now these considerations are divided between cost_index() and
cost_nonsequential_access() in a way that might well be wrong.
I've been intending to dig into the literature and try to find a better
cost model but haven't gotten to it yet.

> it just seems to me that there should be some notion of tuple size
> figured in to know how many tuples fit in a page.

It seemed to me that the critical ratios are #tuples fetched vs #pages
in table and table size vs. cache size.  I could be wrong though...
        regards, tom lane


Re: question about index cost estimates

From
Jeff Hoffmann
Date:
Tom Lane wrote:
> If the table is bigger than the disk cache, and we are hitting pages
> more than once, then we are likely to have to fetch some pages more than
> once.  The model is assuming that the tuples we want are scattered more
> or less randomly over the whole table.  If #tuples to be fetched
> approaches or exceeds #pages in table, then clearly we are going to be
> touching some pages more than once because we want more than one tuple
> from them.

i understand that, but still, if you have a tuple thats, say, 144 bytes,
you're going to have a higher chance of revisiting the same page than if
the tuple is 4k.  it may be something that works itself out of the
equation somewhere. 
> The $64 question is did the page fall out of buffer cache
> between references?  Worst case (if your table vastly exceeds your
> available buffer space), practically every tuple fetch will require a
> separate physical read, and then the number of page fetches is
> essentially the number of tuples returned --- which of course can be
> a lot more than the number of pages in the table.

i understand that, too.  i understand the need for a curve like that,
maybe my question should be the slope of the curve.  it turns out that
playing around with my queries, it seems like when i was selecting about
10% of the records, the page fetch estimate is pretty accurate, although
when the selectivity falls to significantly below that, the page fetch
estimate stays quite a bit higher than actual.  

part of what i'm doing is looking at the executor stats for each query. 
can you explain what the shared blocks, local blocks, and direct blocks
mean?  i'm assuming that the shared blocks refers to the reads to the
database files & indexes.  what do the other mean?  are the shared
blocks shared amongst all of the backends or does each have its own
pool?  i'm getting a 90%+ buffer hit rate on index scans, but i'm
assuming that's because i'm the only one doing anything on this machine
right now and that would go down with more processes.

> 
> Right now these considerations are divided between cost_index() and
> cost_nonsequential_access() in a way that might well be wrong.
> I've been intending to dig into the literature and try to find a better
> cost model but haven't gotten to it yet.

well, there have been a lot of things that i've looked at and thought
"that doesn't sound right", like for example the random page access
cost, which defaults to 4.0.  i thought that was too high, but i ran
bonnie on my hard drive and it looks like that actual value is around
3.9.  so now i'm a believer.  somebody must have put some thought into
somewhere.  i'm still taking everything with a grain of salt until i can
explain it, though.

jeff


Re: question about index cost estimates

From
Tom Lane
Date:
Jeff Hoffmann <jeff@propertykey.com> writes:
> i understand that, but still, if you have a tuple thats, say, 144 bytes,
> you're going to have a higher chance of revisiting the same page than if
> the tuple is 4k.

Are you?  If you are pulling ten tuples from a thousand-page relation,
seems to me the odds of revisiting any one page are somewhere around
0.01 (too lazy to do the exact combinatorics right now).  Doesn't really
matter what the average tuple size is --- or if you prefer, we already
accounted for that because we are looking at target tuples divided by
total pages rather than target tuples divided by total tuples.

> maybe my question should be the slope of the curve.  it turns out that
> playing around with my queries, it seems like when i was selecting about
> 10% of the records, the page fetch estimate is pretty accurate, although
> when the selectivity falls to significantly below that, the page fetch
> estimate stays quite a bit higher than actual.  

Hmm.  I would think that the behavior for very low selectivity ought to
be pretty close to the model: it's hard to see how it can be anything
but one page fetch per selected tuple, unless you are seeing strong
clustering effects.  I do *not* claim that the slope of the curve is
necessarily right for higher selectivities though... that needs to be
looked at.

> part of what i'm doing is looking at the executor stats for each query. 
> can you explain what the shared blocks, local blocks, and direct blocks
> mean?

The filesystem blocks in/out are from the kernel getrusage() call.
On my box, at least, these seem to mean physical I/O operations
initiated on behalf of the process --- AFAICT, touching a page that
is already in kernel disk buffers does not increment the filesystem
blocks count.

The other numbers are from PrintBufferUsage() in bufmgr.c.  It looks
like the shared block counts are the number of block read and write
requests made to the kernel by bufmgr.c, and the hit rate indicates
the fraction of buffer fetch requests made to bufmgr.c that were
satisfied in Postgres' own buffer area (ie, without a kernel request).

The local block counts are the same numbers for non-shared relations,
which basically means tables created in the current transaction.
They'll probably be zero in most cases of practical interest.

The "direct" counts seem to be broken at the moment --- I can't find
any code that increments them.  It looks like the intent was to count
block I/O operations on temporary files (sorttemp and suchlike).
This is not of interest for pure indexscans...

> are the shared blocks shared amongst all of the backends or does each
> have its own pool?

Shared among all --- that's what the shared memory block is (mostly)
for...

> i'm getting a 90%+ buffer hit rate on index scans,
> but i'm assuming that's because i'm the only one doing anything on
> this machine right now and that would go down with more processes.

It also suggests that your test table isn't much bigger than the buffer
cache ;-) --- or at least the part of it that you're touching isn't.

> i'm still taking everything with a grain of salt until i can
> explain it, though.

Good man.  I don't think anyone but me has looked at this stuff in a
long while, and it could use review.
        regards, tom lane


RE: question about index cost estimates

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> Behalf Of Jeff Hoffmann
>

[snip] 

> 
>  pages_fetched = ceil(baserel->pages * log(tuples_fetched /
> baserel->pages + 1.0));
>

Unfortunately I didn't understand this well either.

pages_fetched seems to be able to be greater than
baserel->pages. But if there's sufficiently large buffer
space pages_fetched would be <= baserel->pages.
Are there any assupmtions about buffer space ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


Re: question about index cost estimates

From
Jeff Hoffmann
Date:
Tom Lane wrote:
> 
> Jeff Hoffmann <jeff@propertykey.com> writes:
> > i understand that, but still, if you have a tuple thats, say, 144 bytes,
> > you're going to have a higher chance of revisiting the same page than if
> > the tuple is 4k.
> 
> Are you?  If you are pulling ten tuples from a thousand-page relation,
> seems to me the odds of revisiting any one page are somewhere around
> 0.01 (too lazy to do the exact combinatorics right now). 

well, for something like 10 tuples out of say 50,000 - 100,000 records,
that probably wouldn't come into play and you'd almost definitely
talking 1 page access per tuple.  i was thinking it would come into play
with higher selectivity.  actually i don't know if i'm thinking through
the probabilities and formulas require as much as just brainstorming
what things _might_ be coming into play.

> Doesn't really
> matter what the average tuple size is --- or if you prefer, we already
> accounted for that because we are looking at target tuples divided by
> total pages rather than target tuples divided by total tuples.

i thought about that after i sent my message.  i find that a lot of
times, i'll answer my own question roughly 15 seconds after sending the
message.  i'm still trying to figure out how that happens.

> > i'm still taking everything with a grain of salt until i can
> > explain it, though.
> 
> Good man.  I don't think anyone but me has looked at this stuff in a
> long while, and it could use review.

i can tell you that the more i look at things and start understanding
the logic, the more it seems that the numbers are fairly reasonable if
you don't or can't know a lot of the factors like caching or
clustering.   unfortunately those factors can make a lot of difference
in actual results.  i was looking at the cost estimate of an rtree index
scan, but it turns out that when you add in the page accesses for the
actual tuples, it's rare to see an index scan cost that matters to the
total cost for reasonably large tables.  because of that, i'm not sure
what the value of fixing the rtree index scan cost estimator, other than
making it semantically correct.  it seems in the big picture, we're
subject to the whims of the disk cache more than anything.

i think i'm starting to see what you said about fixing selectivity being
more important, although i'm still not convinced that making the
selectivity more accurate is going to come close to solving the
problem.  the selectivity problem is better defined, though, and
therefore easier to fix than the other problems in cost estimation.  i'm
assuming you're considering adding some sort of histogram to the stats
that vacuum collects.  is that something that you're seriously
considering or do you have another plan to introduce a useful concept of
the distribution of an attribute?  the concept of making a histogram on
a b-tree is pretty simple, but if you're going to do it, you should
probably be taking into account r-trees, where you'd need a 2-d
histogram making the job just a bit tougher.

jeff


Re: question about index cost estimates

From
Jeff Hoffmann
Date:
Hiroshi Inoue wrote:
> 
> > -----Original Message-----
> > From: pgsql-hackers-owner@hub.org [mailto:pgsql-hackers-owner@hub.org]On
> > Behalf Of Jeff Hoffmann
> >
> 
> [snip]
> 
> >
> >  pages_fetched = ceil(baserel->pages * log(tuples_fetched /
> > baserel->pages + 1.0));
> >
> 
> Unfortunately I didn't understand this well either.
> 
> pages_fetched seems to be able to be greater than
> baserel->pages. 

not only does it seem that way, you can expect it to happen fairly
frequently, even if you're pulling only 1-2% of the records with a
query.  if you don't believe it, check the actual performance of a few
queries.

> But if there's sufficiently large buffer
> space pages_fetched would be <= baserel->pages.
> Are there any assupmtions about buffer space ?
> 

the # of pages fetched would be the same, it'd just be cheaper to pull
them from the buffer instead of from disk.  that's what isn't being
taken into consideration properly in the estimate.

the real question is what assumptions can you make about buffer space? 
you don't know how many concurrent accesses there are (all sharing
buffer space).  i also don't think you can count on knowing the size of
the buffer space.  therefore, the buffer space is set to some constant
intermediate value & it is taken account of, at least in the
cost_nonsequential_tuple.  

the question is this: shouldn't you be able to make an educated guess at
this by dividing the total buffer space allocated by the backend by the
number of postmaster processes running at the time?  or don't you know
those things?

jeff


Re: question about index cost estimates

From
Tom Lane
Date:
Jeff Hoffmann <jeff@propertykey.com> writes:
> it seems in the big picture, we're subject to the whims of the disk
> cache more than anything.

Ain't that the truth.  Maybe we ought to think about whether there's any
way to find out how big the kernel's disk cache is.  Even if it only
worked on some platforms, we'd be no worse off on the other ones...

> I'm assuming you're considering adding some sort of histogram to the
> stats that vacuum collects.  is that something that you're seriously
> considering or do you have another plan to introduce a useful concept
> of the distribution of an attribute?  the concept of making a
> histogram on a b-tree is pretty simple, but if you're going to do it,
> you should probably be taking into account r-trees, where you'd need a
> 2-d histogram making the job just a bit tougher.

I was considering a histogram for b-trees, but I have to admit I hadn't
thought about r-trees.  Seems like a 2-D histogram would be too bulky
to be feasible.  Could we get any mileage out of two 1-D histograms
(ie, examine the two coordinates independently)?  Or is that too
simplistic to be worth bothering with?
        regards, tom lane


Re: question about index cost estimates

From
Tom Lane
Date:
Jeff Hoffmann <jeff@propertykey.com> writes:
> the question is this: shouldn't you be able to make an educated guess at
> this by dividing the total buffer space allocated by the backend by the
> number of postmaster processes running at the time?  or don't you know
> those things?

Two things here:

One, we could easily find out the number of active backends, and we
certainly know the number of shared disk buffers.  BUT: it'd be a
debugging nightmare if the planner's choices depended on the number
of other backends that were running at the instant of planning.  Even
though that'd theoretically be the right thing to do, I don't think
we want to go there.  (If you want an argument less dependent on
mere implementation convenience, consider that in many scenarios
the N backends will be accessing more or less the same set of tables.
So the assumption that each backend only gets the use of 1/N of the
shared buffer space is too pessimistic anyway.)

Two, the Postgres shared buffer cache is only the first-line cache.
We also have the Unix kernel's buffer cache underneath us, though
we must share it with whatever else is going on on the machine.
As far as I've been able to measure there is relatively little cost
difference between finding a page in the Postgres cache and finding
it in the kernel cache --- certainly a kernel call is still much
cheaper than an actual disk access.  So the most relevant number
seems to be the fraction of the kernel's buffer cache that's
effectively available to Postgres.  Right now we have no way at
all to measure that number, so we punt and treat it as a user-
settable parameter (I think I made the default setting 10Mb or so).
It'd be worthwhile looking into whether we can do better than
guessing about the kernel cache size.
        regards, tom lane


Re: question about index cost estimates

From
Tom Lane
Date:
"Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> pages_fetched seems to be able to be greater than
> baserel->pages. But if there's sufficiently large buffer
> space pages_fetched would be <= baserel->pages.
> Are there any assupmtions about buffer space ?

Right now cost_index doesn't try to account for that, because
it doesn't have any way of knowing the relevant buffer-space
parameter.  (As I said to Jeff, we have to consider kernel
buffer space not just the number of Postgres shared buffers.)

cost_nonsequential_access does have a dependence on (a totally
bogus estimate of) effective cache size, but it's a considerably
weaker dependence than you suggest above.  If we had a reliable
estimate of cache size I'd be inclined to restructure this code
quite a bit...
        regards, tom lane


RE: question about index cost estimates

From
"Hiroshi Inoue"
Date:
> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> 
> "Hiroshi Inoue" <Inoue@tpf.co.jp> writes:
> > pages_fetched seems to be able to be greater than
> > baserel->pages. But if there's sufficiently large buffer
> > space pages_fetched would be <= baserel->pages.
> > Are there any assupmtions about buffer space ?
> 
> Right now cost_index doesn't try to account for that, because
> it doesn't have any way of knowing the relevant buffer-space
> parameter.  (As I said to Jeff, we have to consider kernel
> buffer space not just the number of Postgres shared buffers.)
> 
> cost_nonsequential_access does have a dependence on (a totally
> bogus estimate of) effective cache size, but it's a considerably
> weaker dependence than you suggest above.

Thanks.  I just confirmed my question because I didn't understand
whether effecive cache size is irrelevant to the calculation or not. 

> If we had a reliable
> estimate of cache size I'd be inclined to restructure this code
> quite a bit...
>

Yes,I know that reliable estimate is very significant but I have
no idea unfortunately.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


Re: question about index cost estimates

From
Tom Lane
Date:
Chris Bitmead <chrisb@nimrod.itg.telstra.com.au> writes:
> On the postgres side, what if you actually made the cache _smaller_,
> only caching important stuff like system tables or indexes. You yourself
> said it doesn't matter if you get it from the cache or the kernel, so
> why not let the kernel do it and prevent double buffering?

In fact I don't think it's productive to use an extremely large -B
setting.  This is something that easily could be settled by experiment
--- do people actually see any significant improvement in performance
from increasing -B above, say, a few hundred?
        regards, tom lane


Re: question about index cost estimates

From
Jeff Hoffmann
Date:
Tom Lane wrote:

> I was considering a histogram for b-trees, but I have to admit I hadn't
> thought about r-trees.  Seems like a 2-D histogram would be too bulky
> to be feasible.  Could we get any mileage out of two 1-D histograms
> (ie, examine the two coordinates independently)?  Or is that too
> simplistic to be worth bothering with?
> 

i don't think it would do any good to look at them separately.  you
might as well just assume a uniform distribution if you're going to do
that.  

does anybody on the list know anything about fractals & wavelets?  i
know _nothing_ about it, but i know that you can use wavelets to
compress photos (i.e., a 2-d data source) and my understanding is that
you're essentially converting the image into a mathematical function. 
it's also my understanding that wavelets work well on multiple scales,
i.e., you can zoom in on a picture to get more detail or zoom out and
get less detail.  my thought is if you've got a histogram, would
something like this be useful?
like i said, i have no idea how it works or if it is of any interest or
if it's even practical, just throwing something out there.

jeff


Re: question about index cost estimates

From
Tom Lane
Date:
Jeff Hoffmann <jeff@propertykey.com> writes:
> does anybody on the list know anything about fractals & wavelets?

Now there's an interesting idea: regard the stats as a lossy compression
of the probability density of the original dataset.  Hmm ... this
doesn't do anything for the problem of computing the pdf cheaply to
begin with, but it might help with storing it compactly in pg_statistic.
        regards, tom lane


Re: question about index cost estimates

From
Jeff Hoffmann
Date:
Tom Lane wrote:
> 
> Jeff Hoffmann <jeff@propertykey.com> writes:
> > does anybody on the list know anything about fractals & wavelets?
> 
> Now there's an interesting idea: regard the stats as a lossy compression
> of the probability density of the original dataset.  Hmm ... this
> doesn't do anything for the problem of computing the pdf cheaply to
> begin with, but it might help with storing it compactly in pg_statistic.
> 
>                         regards, tom lane

yeah, that's exactly what i meant.  you can probably tell math &
statistics aren't my strongest points.  i'm trying to learn a little
about fractals because about all i knew before today is those pretty
little pictures.  just doing a quick search for their use with
databases, though, i found at least one paper on selectivity estimates &
fractals which i'm going to read.  just for your reference, the paper is
from VLDB and it called "Estimating the Selectivity of Spatial Queries
Using the 'Correlation' fractal Dimension".  here's the link:
http://www.vldb.org/conf/1995/P299.PDF.  i've only read the abstract,
but it sounds pretty promising.  i'm starting to wish that i knew what
was going on -- this is getting to be interesting to me.  i've been
using postgresql for years, but up until the last few days, i was trying
to treat it as a black box as much as possible -- i probably should have
tried getting involved a lot earlier...

jeff


Re: question about index cost estimates

From
Peter Eisentraut
Date:
Jeff Hoffmann writes:

> just for your reference, the paper is from VLDB and it called
> "Estimating the Selectivity of Spatial Queries Using the 'Correlation'
> fractal Dimension".  here's the link:
> http://www.vldb.org/conf/1995/P299.PDF.  i've only read the abstract,
> but it sounds pretty promising.

That paper was very interesting. I think it might in fact contain the
answer for writing smart selectivity estimators for geometrical data, now
someone only needs to be brave enough to try to code it. Btw., it also
contains a number interesting references regarding R-trees and selectivity
estimation in general.

Whodda thunk there's genuine information to be found on the Web ... :)


-- 
Peter Eisentraut                  Sernanders väg 10:115
peter_e@gmx.net                   75262 Uppsala
http://yi.org/peter-e/            Sweden