Thread: Bad estimate on LIKE matching
I have a table, "path", which is: pathid | integer | not null default nextval('path_pathid_seq'::regclass)path | text | not null Indexes: "path_pkey" PRIMARY KEY, btree (pathid) "path_name_idx" btree (path) The table contains approx 1.2 million rows, of which all are unique. (both the path field and the naem field are unique, thought he path_name_idx index is not a unique index) On this table, I do a query like: SELECT * FROM path WHERE path LIKE 'f:/userdirs/s/super_73/%' The estimate for this query is comlpetely off, which I beleive is the cause for a very bad selection of a query plan when it's used in a big join (creating nestloops that ends up taking 15+ minutes to complete..). Explain analyze gives: QUERY PLAN ------------------------------------------------------------------------ -----------------------------------------------Index Scan using path_name_idx on path (cost=0.00..3.24 rows=1 width=74) (actual time=0.035..0.442 rows=214 loops=1) Index Cond: ((path >= 'f:/userdirs/s/super'::text) AND (path < 'f:/userdirs/s/supes'::text)) Filter: (path ~~ 'f:/userdirs/s/super_73%'::text) No matter what I search on (when it's very selective), the estimate is always 1 row, whereas the actual value is at least a couple of hundred. If I try with say "f:/us", the difference is 377,759 estimated vs 562,459 returned, which is percentage-wise a lot less, but... I have tried upping the statistics target up to 1000, with no changes. Any way to teach the planner about this? FYI, if I push the result of the select on path into a temp table, and then join with that one in my main table, I get a hashjoin instead, and query time is < 30 seconds instead of 15 minutes... //Magnus
On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote: > On this table, I do a query like: > SELECT * FROM path WHERE path LIKE 'f:/userdirs/s/super_73/%' > > The estimate for this query is comlpetely off, which I beleive is the > cause for a very bad selection of a query plan when it's used in a big > join (creating nestloops that ends up taking 15+ minutes to complete..). > > > Explain analyze gives: > QUERY PLAN > ------------------------------------------------------------------------ > ----------------------------------------------- > Index Scan using path_name_idx on path (cost=0.00..3.24 rows=1 > width=74) (actual time=0.035..0.442 rows=214 loops=1) > Index Cond: ((path >= 'f:/userdirs/s/super'::text) AND (path < > 'f:/userdirs/s/supes'::text)) > Filter: (path ~~ 'f:/userdirs/s/super_73%'::text) > > > No matter what I search on (when it's very selective), the estimate is > always 1 row, whereas the actual value is at least a couple of hundred. > If I try with say "f:/us", the difference is 377,759 estimated vs > 562,459 returned, which is percentage-wise a lot less, but... > > I have tried upping the statistics target up to 1000, with no changes. > Any way to teach the planner about this? In a recent thread on -perform, I opined that this case could best be solved by using dynamic random block sampling at plan time followed by a direct evaluation of the LIKE against the sample. This would yield a more precise selectivity and lead to the better plan. So it can be improved for the next release. Best Regards, Simon Riggs
> > I have tried upping the statistics target up to 1000, with > no changes. > > > Any way to teach the planner about this? > > In a recent thread on -perform, I opined that this case could > best be solved by using dynamic random block sampling at plan > time followed by a direct evaluation of the LIKE against the > sample. This would yield a more precise selectivity and lead > to the better plan. So it can be improved for the next release. I was kinda hoping for something I could use in 8.1 :-) Even if it's an ugly solution for now. (My current workaround of writing it to a temp table and the joining to the temp table causes a reasonable plan, but I'd like something slightly less ugly than that if possible..) //Magnus
Simon Riggs <simon@2ndquadrant.com> writes: > On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote: >> Any way to teach the planner about this? > In a recent thread on -perform, I opined that this case could best be > solved by using dynamic random block sampling at plan time followed by a > direct evaluation of the LIKE against the sample. This would yield a > more precise selectivity and lead to the better plan. So it can be > improved for the next release. I find it exceedingly improbable that we'll ever install any such thing. On-the-fly sampling of enough rows to get a useful estimate would increase planning time by orders of magnitude --- and most of the time the extra effort would be unhelpful. In the particular case exhibited by Magnus, it is *really* unlikely that any such method would do better than we are doing now. He was concerned because the planner failed to tell the difference between selectivities of about 1e-4 and 1e-6. On-the-fly sampling will do better only if it manages to find some of those rows, which it is unlikely to do with a sample size less than 1e5 or so rows. With larger tables the problem gets rapidly worse. regards, tom lane
On Wed, 2006-01-18 at 10:37 -0500, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Tue, 2006-01-17 at 13:53 +0100, Magnus Hagander wrote: > >> Any way to teach the planner about this? > > > In a recent thread on -perform, I opined that this case could best be > > solved by using dynamic random block sampling at plan time followed by a > > direct evaluation of the LIKE against the sample. This would yield a > > more precise selectivity and lead to the better plan. So it can be > > improved for the next release. > > I find it exceedingly improbable that we'll ever install any such thing. > On-the-fly sampling of enough rows to get a useful estimate would > increase planning time by orders of magnitude --- and most of the time > the extra effort would be unhelpful. In the particular case exhibited > by Magnus, it is *really* unlikely that any such method would do better > than we are doing now. He was concerned because the planner failed to > tell the difference between selectivities of about 1e-4 and 1e-6. > On-the-fly sampling will do better only if it manages to find some of > those rows, which it is unlikely to do with a sample size less than > 1e5 or so rows. With larger tables the problem gets rapidly worse. Your reply seems too strong; I wish to discuss further improvements, not fight. My willingness to do this is inspired by the years of excellent work that you and others have already contributed. I am attempting to provide a solution to the general problem. My way of doing this is to draw on my experience, just as I would draw upon any other body of knowledge such as academic papers or experimental results. My thinking is that perhaps Teradata, Oracle and DB2 were right to implement dynamic sampling for queries. Many things done elsewhere are wasted filigree, yet some are appropriate ideas that we are free to use. Accuracy need not be our goal, but a not-higher-than selectivity might allow us to avoid the worst case behaviour displayed here. On Wed, 2006-01-11 at 09:07 +0000, Simon Riggs wrote: > On Tue, 2006-01-10 at 22:40 -0500, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > I meant use the same sampling approach as I was proposing for ANALYZE, > > > but do this at plan time for the query. That way we can apply the > > > function directly to the sampled rows and estimate selectivity. > > > > I think this is so unlikely to be a win as to not even be worth spending > > any time discussing. The extra planning time across all queries will > > vastly outweigh the occasional improvement in plan choice for some > > queries. > > Extra planning time would be bad, so clearly we wouldn't do this when we > already have relevant ANALYZE statistics. > > I would suggest we do this only when all of these are true > - when accessing more than one table, so the selectivity could effect a > join result > - when we have either no ANALYZE statistics, or ANALYZE statistics are > not relevant to estimating selectivity, e.g. LIKE > - when access against the single table in question cannot find an index > to use from other RestrictInfo predicates > > I imagined that this would also be controlled by a GUC, dynamic_sampling > which would be set to zero by default, and give a measure of sample size > to use. (Or just a bool enable_sampling = off (default)). > > This is mentioned now because the plan under consideration in this > thread would be improved by this action. It also isn't a huge amount of > code to get it to work. Best Regards, Simon Riggs