Re: More thoughts about planner's cost estimates - Mailing list pgsql-hackers
From | Josh Berkus |
---|---|
Subject | Re: More thoughts about planner's cost estimates |
Date | |
Msg-id | 200606011350.26364.josh@agliodbs.com Whole thread Raw |
In response to | Re: More thoughts about planner's cost estimates (Greg Stark <gsstark@mit.edu>) |
Responses |
Re: More thoughts about planner's cost estimates
Re: More thoughts about planner's cost estimates Re: More thoughts about planner's cost estimates |
List | pgsql-hackers |
Greg, Tom, > a) We already use block based sampling to reduce overhead. If you're > talking about using the entire block and not just randomly sampled > tuples from within those blocks then your sample will be biased. There are actually some really good equations to work with this, estimating both row correlation and n-distinct based on sampling complete blocks. See prior discussions on this list on N-distinct. > 1) You have n^2 possible two-column combinations. That's a lot of > processing and storage. Yes, that's the hard problem to solve. Actually, btw, it's n!, not n^2. > 2) It isn't even clear what data you're exactly looking for. Certainly > "correlation" is just shorthand here and isn't what you're actually > looking for. Actually, I'd think that a correlation number estimate (0 = complete uncorrelated, 1 = completely correlated) would be sufficient to improve row count estimation significantly, without incurring the vast overhead of histogramXhistorgram manipulation. > Those are all valid points, but pretty much orthogonal to what I'm on > about today. The problems I'm thinking about are seen even when the > planner's rowcount estimates are dead on (as indeed they mostly were > in Philippe's example). OK, I was afraid they were interdependant. You would know better than me. > Well, it'll still exist as a GUC for the same reasons the other ones are > GUCs. But I think the main reasons people have needed to twiddle it are > (1) their database is completely RAM-resident (where RPC > *should* logically be 1.0), or > (2) they're trying to compensate for the overestimation of > nestloop indexscans. > The changes I'm proposing should help with (2). Right. What I'm saying is that (1) should be derived from estimated_cache_size and DBSIZE, not by setting an additional GUC. > > (1) the frequency with which that index is accessed, modified by > > (2) whether the index is a fraction of available ram, or larger than > > ram e) the probability that the relevant table pages are cached in > > memory, determined by the same two factors. > > These would all be nice things to know, but I'm afraid it's pie in the > sky. We have no reasonable way to get those numbers. (And if we could > get them, there would be another set of problems, namely plan > instability: the planner's choices would become very difficult to > reproduce.) Hmmm ... I think those numbers are possible to infer, at least on a gross level. Whether the cost of calculating them outweighs the benefit of having them is another question, resolvable only through some experimentation. > a good place to start. The immediate problem is to get an > infrastructure in place that gives us some numbers to apply equations > to. No arguments, of course. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
pgsql-hackers by date: