Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers

From Josh Berkus
Subject Re: More thoughts about planner's cost estimates
Date
Msg-id 200606011350.26364.josh@agliodbs.com
Whole thread Raw
In response to Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
Responses Re: More thoughts about planner's cost estimates  (Josh Berkus <josh@agliodbs.com>)
Re: More thoughts about planner's cost estimates  (Greg Stark <gsstark@mit.edu>)
Re: More thoughts about planner's cost estimates  (Kenneth Marshall <ktm@it.is.rice.edu>)
List pgsql-hackers
Greg, Tom,

> a) We already use block based sampling to reduce overhead. If you're
> talking about using the entire block and not just randomly sampled
> tuples from within those blocks then your sample will be biased.

There are actually some really good equations to work with this, estimating 
both row correlation and n-distinct based on sampling complete blocks.  
See prior discussions on this list on N-distinct.

> 1) You have n^2 possible two-column combinations. That's a lot of
> processing and storage.

Yes, that's the hard problem to solve.  Actually, btw, it's n!, not n^2.

> 2) It isn't even clear what data you're exactly looking for. Certainly
>    "correlation" is just shorthand here and isn't what you're actually
> looking for. 

Actually, I'd think that a correlation number estimate (0 = complete 
uncorrelated, 1 = completely correlated) would be sufficient to improve 
row count estimation significantly, without incurring the vast overhead of 
histogramXhistorgram manipulation.

> Those are all valid points, but pretty much orthogonal to what I'm on
> about today.  The problems I'm thinking about are seen even when the
> planner's rowcount estimates are dead on (as indeed they mostly were
> in Philippe's example).

OK, I was afraid they were interdependant.  You would know better than me.

> Well, it'll still exist as a GUC for the same reasons the other ones are
> GUCs.  But I think the main reasons people have needed to twiddle it are
>     (1) their database is completely RAM-resident (where RPC
>         *should* logically be 1.0), or
>     (2) they're trying to compensate for the overestimation of
>         nestloop indexscans.
> The changes I'm proposing should help with (2).

Right.  What I'm saying is that (1) should be derived from 
estimated_cache_size and DBSIZE, not by setting an additional GUC.

> >     (1) the frequency with which that index is accessed, modified by
> >     (2) whether the index is a fraction of available ram, or larger than
> > ram e) the probability that the relevant table pages are cached in
> > memory, determined by the same two factors.
>
> These would all be nice things to know, but I'm afraid it's pie in the
> sky.  We have no reasonable way to get those numbers.  (And if we could
> get them, there would be another set of problems, namely plan
> instability: the planner's choices would become very difficult to
> reproduce.)

Hmmm ... I think those numbers are possible to infer, at least on a gross 
level.  Whether the cost of calculating them outweighs the benefit of 
having them is another question, resolvable only through some 
experimentation.

> a good place to start.  The immediate problem is to get an
> infrastructure in place that gives us some numbers to apply equations
> to.

No arguments, of course.

-- 
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco


pgsql-hackers by date:

Previous
From: Martijn van Oosterhout
Date:
Subject: Re: Generalized concept of modules
Next
From: Tom Lane
Date:
Subject: Re: Generalized concept of modules