Home > mailing lists

Re: question about index cost estimates - Mailing list pgsql-hackers

From	Jeff Hoffmann
Subject	Re: question about index cost estimates
Date	May 18, 2000 03:33:02
Msg-id	39237277.F22C4C12@propertykey.com Whole thread Raw
In response to	question about index cost estimates (Jeff Hoffmann <jeff@propertykey.com>)
Responses	Re: question about index cost estimates (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

Tom Lane wrote:
> 
> Jeff Hoffmann <jeff@propertykey.com> writes:
> > i understand that, but still, if you have a tuple thats, say, 144 bytes,
> > you're going to have a higher chance of revisiting the same page than if
> > the tuple is 4k.
> 
> Are you?  If you are pulling ten tuples from a thousand-page relation,
> seems to me the odds of revisiting any one page are somewhere around
> 0.01 (too lazy to do the exact combinatorics right now). 

well, for something like 10 tuples out of say 50,000 - 100,000 records,
that probably wouldn't come into play and you'd almost definitely
talking 1 page access per tuple.  i was thinking it would come into play
with higher selectivity.  actually i don't know if i'm thinking through
the probabilities and formulas require as much as just brainstorming
what things _might_ be coming into play.

> Doesn't really
> matter what the average tuple size is --- or if you prefer, we already
> accounted for that because we are looking at target tuples divided by
> total pages rather than target tuples divided by total tuples.

i thought about that after i sent my message.  i find that a lot of
times, i'll answer my own question roughly 15 seconds after sending the
message.  i'm still trying to figure out how that happens.

> > i'm still taking everything with a grain of salt until i can
> > explain it, though.
> 
> Good man.  I don't think anyone but me has looked at this stuff in a
> long while, and it could use review.

i can tell you that the more i look at things and start understanding
the logic, the more it seems that the numbers are fairly reasonable if
you don't or can't know a lot of the factors like caching or
clustering.   unfortunately those factors can make a lot of difference
in actual results.  i was looking at the cost estimate of an rtree index
scan, but it turns out that when you add in the page accesses for the
actual tuples, it's rare to see an index scan cost that matters to the
total cost for reasonably large tables.  because of that, i'm not sure
what the value of fixing the rtree index scan cost estimator, other than
making it semantically correct.  it seems in the big picture, we're
subject to the whims of the disk cache more than anything.

i think i'm starting to see what you said about fixing selectivity being
more important, although i'm still not convinced that making the
selectivity more accurate is going to come close to solving the
problem.  the selectivity problem is better defined, though, and
therefore easier to fix than the other problems in cost estimation.  i'm
assuming you're considering adding some sort of histogram to the stats
that vacuum collects.  is that something that you're seriously
considering or do you have another plan to introduce a useful concept of
the distribution of an attribute?  the concept of making a histogram on
a b-tree is pretty simple, but if you're going to do it, you should
probably be taking into account r-trees, where you'd need a 2-d
histogram making the job just a bit tougher.

jeff

pgsql-hackers by date:

From: "Hiroshi Inoue"
Date: 18 May 2000, 02:41:56
Subject: RE: question about index cost estimates

From: ts
Date: 18 May 2000, 03:42:57
Subject: Re: Trigger function languages

Re: question about index cost estimates - Mailing list pgsql-hackers

Previous

Next