Thread: Bug in query planer ?
Hi, philipp resiner wrote a mail about this problem yesterday. I've now traced = it down to something that looks like a bug in the query planer to me. Please have a look at this and let me know if this is a= bug or I am compleatly wrong.. (this is done right after a complete ANALYZE over the database, so the stat= istics are up-to-date) sd-beta=3D> select n_tup_ins, n_tup_del from pg_stat_user_tables where rel= name =3D 'contractelements'; n_tup_ins | n_tup_del -----------+----------- 91821 | 0 (1 row) sd-beta=3D> select n_distinct, most_common_vals, most_common_freqs from pg= _stats where tablename =3D 'contractelements' and attname =3D 'isactiv'; n_distinct | most_common_vals | most_common_freqs ------------+------------------+---------------------- 2 | {Y,N} | {0.966467,0.0335333} (1 row) sd-beta=3D> explain analyze select 1 from contractelements where isActiv = =3D 'Y'; QUERY PLAN --------------------------------------------------------------------------= --------------------------------------------- Seq Scan on contractelements (cost=3D0.00..4963.76 rows=3D88742 width=3D= 0) (actual time=3D0.014..137.930 rows=3D88838 loops=3D1) Filter: ((isactiv)::text =3D 'Y'::text) Total runtime: 153.543 ms (3 rows) The query planner estimates that isActiv =3D 'Y' will match 88742 rows. Thi= s is reasonable (91821 * 0.966467 =3D 88741.966407) and correct. However, the following case causes some troubles: sd-beta=3D> explain analyze select 1 from contractelements where upper(isA= ctiv) =3D 'Y'; QUERY PLAN --------------------------------------------------------------------------= ------------------------------------------- Seq Scan on contractelements (cost=3D0.00..5193.32 rows=3D459 width=3D0)= (actual time=3D0.030..198.493 rows=3D88838 loops=3D1) Filter: (upper((isactiv)::text) =3D 'Y'::text) Total runtime: 214.035 ms (3 rows) Here we match on upper(isActiv) =3D 'Y' (which is totally braindead, but th= e query is auto-generated by a customer-supplied application, so I can not change it). Shouldn't the query planner execute u= pper(isActiv) for both values in pg_stats and so come to the same conclusion as in the first case? It doesn't. Led by this misapprehension the query planner generates pretty = creative, but unfortunately very suboptimal query plans. A 'CREATE INDEX clifford_temp ON contractelements ( upper(isActiv) )' follo= wed by an 'ANALYZE contractelements' solves the problem in this particular case. But this is not a solution to the problem = in general.. Shouldn't the query planner be able to do the right thing without the index= ? Where does the magic 'rows=3D459' come from? yours, - clifford --=20 : Clifford Wolf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Tel += 43-1-8178292-00 : : LINBIT Information Technologies GmbH =A0 =A0 =A0 =A0 =A0Fax +43-1-8178292= -82 : : Sch=F6nbrunnerstr 244, 1120 Vienna, Austria =A0 =A0http://www.linbit.com :
Hi, On Tuesday 31 January 2006 18:59, you wrote: > Shouldn't the query planner be able to do the right thing without the > index? Where does the magic 'rows=3D459' come from? ok - I've spend some time reading the postgres sources now. qesel() is using a selectivity of DEFAULT_EQ_SEL (0.005) for all expressions with functions. Since our query has three such equals AND'ed this gives a selectivity of 0.000000125 instead of 0.9. That's causing postgres to create a query plan which runs aprox. 8 hours instead of less then a second. I've now created a combined expression index for my case so the query plann= er can check the selectivity. This is a huge overkill and there is a lot of space for improvements.. As a last resort for such cases it would be good to be able to hardcode selectivities in the SQL statements. Something like: SELECT ... FROM ... WHERE con.ccu_objid IN (...) AND cel.isActiv =3D 'Y' AND ( upper(coalesce(dev.isActiv,'Y')) =3D 'Y' AND upper(coalesce(dev.IsCommittedSP,'Y')) =3D 'Y' AND upper(coalesce(dev.IsCommittedCust,'Y')) =3D 'Y' ) WITH SELECTIVITY 0.9 AND loc.shortName =3D '5195' However, it would be great to have get_restriction_variable() and get_attstatsslot() extended so they can pass the most common values from the statistics cache thru expressions, as described in my earlier mail. yours, - clifford --=20 : Clifford Wolf =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0Tel += 43-1-8178292-00 : : LINBIT Information Technologies GmbH =A0 =A0 =A0 =A0 =A0Fax +43-1-8178292= -82 : : Sch=F6nbrunnerstr 244, 1120 Vienna, Austria =A0 =A0http://www.linbit.com :
Clifford Wolf <clifford.wolf@linbit.com> writes: > However, it would be great to have get_restriction_variable() and > get_attstatsslot() extended so they can pass the most common values > from the statistics cache thru expressions, as described in my earlier > mail. This would only be helpful if the most-common-values list describes practically all of the column population, which isn't typically the case. In any case I'm not sure why you're resistant to maintaining an index on an expression that you are frequently querying by --- that index could have more direct use than just cueing ANALYZE what to figure statistics on. regards, tom lane
Hi, On Wednesday 01 February 2006 18:19, you wrote: > Clifford Wolf <clifford.wolf@linbit.com> writes: > > However, it would be great to have get_restriction_variable() and > > get_attstatsslot() extended so they can pass the most common values > > from the statistics cache thru expressions, as described in my earlier > > mail. > > This would only be helpful if the most-common-values list describes > practically all of the column population, which isn't typically the case. Not more than it is the case already for the simple 'variable =3D const'. (Or am I looking at the wrong eqsel() function?) Even when there is no match in the most-common-values list, the list can be used to determine a more realistic selectivity than DEFAULT_EQ_SEL (as it is done already in the 'variable =3D const' cases now). For linear functions it would even be possible to use the histogram_bounds to get a good idea of the selectivity. But that is an optimization that even is not implemented for simple 'variable =3D const' cases yet. > In any case I'm not sure why you're resistant to maintaining an index > on an expression that you are frequently querying by --- that index > could have more direct use than just cueing ANALYZE what to figure > statistics on. In this case definitely not. The only effect that index has is to show the query planner that it is a bad idea to use the indexed expression as inner clause. In the good query plans that expression is evaluated on the heap in a sequential scan (reduces a set of ~10 tuples to ~8 tupels). The index is never used. And this index just sovles this one extreme case, there are hundrets of queries with very suboptimal query plans because of that. I don't want to create hunderts of indexes just to make sure that the index is not used. .. I would do that (auto-generate hundrets of indexes from our slow-query log) when there would be some kind of semi-index type which just collects statistics on ANALYZE. But afaik this is not possible with Postgres right now. Creating that many real indexes would pretty sure slow down inserts and updates that much that it is better to live with bizare query plans for the selects.. yours, - clifford --=20 : Clifford Wolf =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0= =C2=A0 =C2=A0 =C2=A0 =C2=A0Tel +43-1-8178292-00 : : LINBIT Information Technologies GmbH =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0Fa= x +43-1-8178292-82 : : Sch=C3=B6nbrunnerstr 244, 1120 Vienna, Austria =C2=A0 =C2=A0http://www.li= nbit.com :
Clifford Wolf <clifford.wolf@linbit.com> writes: >> This would only be helpful if the most-common-values list describes >> practically all of the column population, which isn't typically the case. > Not more than it is the case already for the simple 'variable = const'. No, because while you may not know what the values are that aren't in the MCV list, you do know that none of them are equal to any of the values in the MCV list (according to the equality operator used to develop the list, anyway). This simple bit of logic breaks down as soon as you are considering f(x) rather than x, because it's entirely possible that f(x) = f(y) when x != y. Therefore, I don't think it's valid to infer that the MCV list of the function values would equal the function computed over the variable's MCV values. Also, what happens when there's more than one variable used in the expression? It'll be expensive to compute the expression over the cartesian product of the MCV lists, and logically dubious anyway because any such calculation would have to assume that the variable values are statistically independent, which they likely aren't. > .. I would do that (auto-generate hundrets of indexes from our slow-query > log) when there would be some kind of semi-index type which just collects > statistics on ANALYZE. Yeah, I've toyed with some such idea too, though I don't think of it as an index --- just some way to tell ANALYZE that you'd like it to track statistics for thus-and-such expressions. regards, tom lane