On Fri, 2005-12-02 at 11:53 +0100, Martijn van Oosterhout wrote:
> On Fri, Dec 02, 2005 at 11:07:06AM +0100, Csaba Nagy wrote:
> > So for me the "hint" mechanism is good for telling the server that I'm
> > not interested at all in the BEST plan but which risks getting very bad
> > on occasions, but in a good enough plan which is safe.
>
> I'm wondering if long term another approach might be to have another
> parameter in the planner, cost_error or selectivity_error which is an
> indication of how accurate we think it is.
>
> So for example you have an index scan might cost x but with a possible
> error of 15% and the seqscan might cost y but with an error of 1%.
>
> The "error" for nested loop would be the product of the two inputs,
> whereas a merge join whould be much less sensetive to error. A sort or
> hash join would react badly to large variations of input.
>
> So in cases where there is a choice between two indexscans with one
> slightly more expensive and more accurate but can result in a mergejoin
> would be a better choice than a possibly highly selective index but
> without accurate info that needs to be fed into a nested loop. Even
> though the latter might look better, the former is the "safer" option.
>
> I think this would solve the problem where people see sudden flip-flops
> between good and bad plans. The downside is that it's yet another
> parameter for the planner to get wrong.
Measuring parameters more accurately is a lengthy experimental job, not
a theoretical one. I think we are just waiting for someone to do this.
> Unfortunatly, this is the kind of thing people write thesises on and I
> don't think many people have the grounding in statistics to make it all
> work.
I'd considered that before; its just a lot of work.
The theory of error propagation is straightforward: you just take the
root mean square of the errors on the parameters.
Trouble is, many of the planning parameters are just guesses, so you
have no idea of the error estimates either. Hence you can't really
calculate the error propagation accurately enough to make a sensible
stab at risk control. But it would be useful sometimes, which is about
the best it gets with the planner.
Right now the worst part of the planner is:
- the estimation of number of distinct values, which is an inherent
statistical limitation
- need for multi-column interaction statistics
The two are somewhat related.
Best Regards, Simon Riggs