Re: Bad estimate on LIKE matching - Mailing list pgsql-hackers
From | Simon Riggs |
---|---|
Subject | Re: Bad estimate on LIKE matching |
Date | |
Msg-id | 1137617615.3069.40.camel@localhost.localdomain Whole thread Raw |
In response to | Re: Bad estimate on LIKE matching (Tom Lane <tgl@sss.pgh.pa.us>) |
List | pgsql-hackers |
On Wed, 2006-01-18 at 10:37 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote: > >> Any way to teach the planner about this? > > > In a recent thread on -perform, I opined that this case could best be > > solved by using dynamic random block sampling at plan time followed by a > > direct evaluation of the LIKE against the sample. This would yield a > > more precise selectivity and lead to the better plan. So it can be > > improved for the next release. > > I find it exceedingly improbable that we'll ever install any such thing. > On-the-fly sampling of enough rows to get a useful estimate would > increase planning time by orders of magnitude --- and most of the time > the extra effort would be unhelpful. In the particular case exhibited > by Magnus, it is *really* unlikely that any such method would do better > than we are doing now. He was concerned because the planner failed to > tell the difference between selectivities of about 1e-4 and 1e-6. > On-the-fly sampling will do better only if it manages to find some of > those rows, which it is unlikely to do with a sample size less than > 1e5 or so rows. With larger tables the problem gets rapidly worse. Your reply seems too strong; I wish to discuss further improvements, not fight. My willingness to do this is inspired by the years of excellent work that you and others have already contributed. I am attempting to provide a solution to the general problem. My way of doing this is to draw on my experience, just as I would draw upon any other body of knowledge such as academic papers or experimental results. My thinking is that perhaps Teradata, Oracle and DB2 were right to implement dynamic sampling for queries. Many things done elsewhere are wasted filigree, yet some are appropriate ideas that we are free to use. Accuracy need not be our goal, but a not-higher-than selectivity might allow us to avoid the worst case behaviour displayed here. On Wed, 2006-01-11 at 09:07 +0000, Simon Riggs wrote: > On Tue, 2006-01-10 at 22:40 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > I meant use the same sampling approach as I was proposing for ANALYZE, > > > but do this at plan time for the query. That way we can apply the > > > function directly to the sampled rows and estimate selectivity. > > > > I think this is so unlikely to be a win as to not even be worth spending > > any time discussing. The extra planning time across all queries will > > vastly outweigh the occasional improvement in plan choice for some > > queries. > > Extra planning time would be bad, so clearly we wouldn't do this when we > already have relevant ANALYZE statistics. > > I would suggest we do this only when all of these are true > - when accessing more than one table, so the selectivity could effect a > join result > - when we have either no ANALYZE statistics, or ANALYZE statistics are > not relevant to estimating selectivity, e.g. LIKE > - when access against the single table in question cannot find an index > to use from other RestrictInfo predicates > > I imagined that this would also be controlled by a GUC, dynamic_sampling > which would be set to zero by default, and give a measure of sample size > to use. (Or just a bool enable_sampling = off (default)). > > This is mentioned now because the plan under consideration in this > thread would be improved by this action. It also isn't a huge amount of > code to get it to work. Best Regards, Simon Riggs
pgsql-hackers by date: