Re: question about index cost estimates - Mailing list pgsql-hackers

From Jeff Hoffmann
Subject Re: question about index cost estimates
Date
Msg-id 39237277.F22C4C12@propertykey.com
Whole thread Raw
In response to question about index cost estimates  (Jeff Hoffmann <jeff@propertykey.com>)
Responses Re: question about index cost estimates  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Tom Lane wrote:
> 
> Jeff Hoffmann <jeff@propertykey.com> writes:
> > i understand that, but still, if you have a tuple thats, say, 144 bytes,
> > you're going to have a higher chance of revisiting the same page than if
> > the tuple is 4k.
> 
> Are you?  If you are pulling ten tuples from a thousand-page relation,
> seems to me the odds of revisiting any one page are somewhere around
> 0.01 (too lazy to do the exact combinatorics right now). 

well, for something like 10 tuples out of say 50,000 - 100,000 records,
that probably wouldn't come into play and you'd almost definitely
talking 1 page access per tuple.  i was thinking it would come into play
with higher selectivity.  actually i don't know if i'm thinking through
the probabilities and formulas require as much as just brainstorming
what things _might_ be coming into play.

> Doesn't really
> matter what the average tuple size is --- or if you prefer, we already
> accounted for that because we are looking at target tuples divided by
> total pages rather than target tuples divided by total tuples.

i thought about that after i sent my message.  i find that a lot of
times, i'll answer my own question roughly 15 seconds after sending the
message.  i'm still trying to figure out how that happens.

> > i'm still taking everything with a grain of salt until i can
> > explain it, though.
> 
> Good man.  I don't think anyone but me has looked at this stuff in a
> long while, and it could use review.

i can tell you that the more i look at things and start understanding
the logic, the more it seems that the numbers are fairly reasonable if
you don't or can't know a lot of the factors like caching or
clustering.   unfortunately those factors can make a lot of difference
in actual results.  i was looking at the cost estimate of an rtree index
scan, but it turns out that when you add in the page accesses for the
actual tuples, it's rare to see an index scan cost that matters to the
total cost for reasonably large tables.  because of that, i'm not sure
what the value of fixing the rtree index scan cost estimator, other than
making it semantically correct.  it seems in the big picture, we're
subject to the whims of the disk cache more than anything.

i think i'm starting to see what you said about fixing selectivity being
more important, although i'm still not convinced that making the
selectivity more accurate is going to come close to solving the
problem.  the selectivity problem is better defined, though, and
therefore easier to fix than the other problems in cost estimation.  i'm
assuming you're considering adding some sort of histogram to the stats
that vacuum collects.  is that something that you're seriously
considering or do you have another plan to introduce a useful concept of
the distribution of an attribute?  the concept of making a histogram on
a b-tree is pretty simple, but if you're going to do it, you should
probably be taking into account r-trees, where you'd need a 2-d
histogram making the job just a bit tougher.

jeff


pgsql-hackers by date:

Previous
From: "Hiroshi Inoue"
Date:
Subject: RE: question about index cost estimates
Next
From: ts
Date:
Subject: Re: Trigger function languages