Thread: huge disparities in =/IN/BETWEEN performance
all my SQL-writin' life i have been assuming that expressions like =, IN, BETWEEN in the WHERE clause are, in the most general sense, alternative ways of doing the same things. i am hitting some very very bizarre results in PGSQL: i have a (very involved) view, say v_foo, largely optimized to be queried by a user_id column, so: select * from v_foo where user_id = 70728; Time: 580.620 ms however an essentially synonymous IN construct takes minutes to complete (the subquery inside the IN clause will return the 70728 ID and by itself takes 40ms to complete): select * from v_foo where user_id in (select user_id from bar group by 1 having count(*) = 10 limit 1); Time: 244616.464 ms a synonymous-looking BETWEEN also takes forever: select * from v_foo where user_id between 70728 and 70728; Time: 329332.722 ms there is, admittedly, substantial complexity inside v_foo, with GROUP BYs on user_id and various subqueries, but my basic thought is that should not really matter... i am on 8.1.3. i'd like to hope that the 8.2 optimizer improvements might help with this but i haven't tested. george
"George Pavlov" <gpavlov@mynewplace.com> writes: > there is, admittedly, substantial complexity inside v_foo, with GROUP > BYs on user_id and various subqueries, but my basic thought is that > should not really matter... You're unlikely to get any useful comment on this when you have not shown any of those details, nor even an EXPLAIN. regards, tom lane
> > BYs on user_id and various subqueries, but my basic thought is that > > should not really matter... > > You're unlikely to get any useful comment on this when you have not > shown any of those details, nor even an EXPLAIN. yes, i know. i guess i was partially just venting. sorry. the problem is that the view is very complex and cleansing it for general consumprion and paring it down to some degree of readability is a lot of work. the basic question i have is fairly clear though: why saying "where x = 10" should be different (in ANY cicumstance, not just mine) from saying "where x between 10 and 10" or from "where x in (select ... /* some query that returns 10 */)" ??? i am not really looking for help optimizing my view and query, more of a general idea of should this happen, when might this happen, why is it a good idea that this happens? is this a better statement of the issue? thanks for listening! george
George Pavlov wrote: > the basic question i have is fairly clear though: why saying "where x = > 10" should be different (in ANY cicumstance, not just mine) from saying > "where x between 10 and 10" or from "where x in (select ... /* some > query that returns 10 */)" ??? I think the principle here is that the system is not gonna waste cycles on dumb queries. Supposedly, morphing "foo BETWEEN 10 and 10" into "foo=10" is not a trivial transformation, and it'd impose a planning cost on all non-dumb BETWEEN queries. That cost is best avoided: if you optimize for dumb users, the smart users then want you buried because you've lost performance doing useless work. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Alvaro Herrera <alvherre@commandprompt.com> writes: > I think the principle here is that the system is not gonna waste cycles > on dumb queries. Supposedly, morphing "foo BETWEEN 10 and 10" into > "foo=10" is not a trivial transformation, and it'd impose a planning > cost on all non-dumb BETWEEN queries. There's a datatype abstraction issue involved: what does it take to prove that "x >= 10 AND x <= 10" is equivalent to "x = 10"? This requires a nontrivial amount of knowledge about the operators involved. We could probably do it for operators appearing in a btree operator class, but as Alvaro says, it'd be cycles wasted for non-dumb queries. As for the IN case, I think we do simplify "x IN (one-expression)" to "x = one-expression", but "x IN (sub-select)" is a whole 'nother matter, especially when you're comparing it to a case where one-expression is a constant and so the planner can get good statistics about how many rows are likely to match. regards, tom lane
Hi Tom, On Thu, 2007-02-08 at 22:50 -0500, Tom Lane wrote: > There's a datatype abstraction issue involved: what does it take to > prove that "x >= 10 AND x <= 10" is equivalent to "x = 10"? This > requires a nontrivial amount of knowledge about the operators involved. > We could probably do it for operators appearing in a btree operator > class, but as Alvaro says, it'd be cycles wasted for non-dumb queries. Are you saying the planner is datatype-agnostic and can't tell that x is, say, as in the example above, an INTEGER and therefore cannot transform one expression into another? What about "x = 10 AND x < 5"? Can't it reduce that to FALSE? Joe
Joe <dev@freedomcircle.net> writes: > Are you saying the planner is datatype-agnostic Certainly, but your other concerns don't follow from that. The issue at hand here is whether it's worth expending cycles on every query to try to detect a situation that only holds for a few. regards, tom lane
Hi Tom, On Thu, 2007-02-08 at 23:24 -0500, Tom Lane wrote: > Certainly, but your other concerns don't follow from that. The issue at > hand here is whether it's worth expending cycles on every query to try > to detect a situation that only holds for a few. Those where George's concerns, but I was interested in knowing whether the planner transforms a query in any way to avoid, let's say, useless execution. George didn't provide the inside of his view, but it's possible that my earlier example could be rephrased as follows: create view v_foo as select * from tab where x < 5; select * from v_foo where x = 10; Presumably the planner has to transform the SELECT into select * from tab where x < 5 and x = 10; or something analogous. This is a simple example, but with complicated views or even joins and aggregates, the "useless execution" may not be that obvious to the "naked eye" but it would be to a boolean logic analyzer. As to whether these query instances represent few or are typical is arguable, and will depend on the type of application, level of knowledge among users, and what kind of interfaces are used to enter or generate the queries. Joe
Joe <dev@freedomcircle.net> writes: > George didn't provide the inside of his view, but it's > possible that my earlier example could be rephrased as follows: > create view v_foo as select * from tab where x < 5; > select * from v_foo where x = 10; So try it: regression=# create table tab (x int); CREATE TABLE regression=# create view v_foo as select * from tab where x < 5; CREATE VIEW regression=# explain select * from v_foo where x = 10; QUERY PLAN ----------------------------------------------------Seq Scan on tab (cost=0.00..46.00 rows=4 width=4) Filter: ((x < 5)AND (x = 10)) (2 rows) regression=# set constraint_exclusion to 1; SET regression=# explain select * from v_foo where x = 10; QUERY PLAN ------------------------------------------Result (cost=0.00..0.01 rows=1 width=0) One-Time Filter: false (2 rows) (This is with HEAD, but I think 8.2 can do it too.) regards, tom lane
Thanks all for the various useful thoughts. Let me backtrack a bit and state my real underlying issue a bit with actual examples. Hope not to bore you with the length of this. Looks to me like an optimizer issue unless I am missing something. So, suppose I have a query: select * from stuff inner join ( -- just getting the distinct -- user-stuff associations -- since there may be multiple; -- ultimatelyI need to query by user select stuff_id, user_id from stuff_user group by 1,2 -- GROUP BY outperforms DISTINCT) su using (stuff_id) left join ( -- this obtains summary statistics -- about each stuff item select stuff_id, count(*) from stuff_events group by 1 ) se using (stuff_id) where user_id = 41 This is a very pared down version of what I have. And yes this specific query can be rewritten as a single GROUP BY, but in the real world I am gathering the aggregate statistics from several tables, so I actually have several sub-recordsets similar to the one called "se" above. Rewriting ALL those as a single GROUP BY is not feasible. I know, all this cries for a single summarized rollup table, but let's not go there (for now). So running the above is inefficient. This particular user_id has only one associated stuff_id and does not even have much data for that in stuff_events. The query runs in ~4600ms. Were I to query by stuff_id instead, things look great (if I change the where clause to the stuff_id it runs in 25ms). When I query based on stuff_id the optimizer uses an index on stuff_events.stuff_id. However, when I query by user_id it does a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to first figure out which stuff_ids are related to the user_id that is being asked for and then look ONLY those up in the stuff_events table using the index on stuff_id. It would seem (and this is where we get back to my original question) that one should be able to just say: select * from stuff left join (select stuff_id, count(*) from stuff_events group by 1 ) se using (stuff_id) where stuff_id in (select distinct stuff_id from stuff_user where user_id = 41 ) You'd think that the subquery in the IN would be (very quickly) resolved to a list of stuff_ids and then stuff_events would be accessed via its stuff_id index. Instead, the Seq Scan on stuff_events still happens and the query actually is even slower than the original, running in ~5500ms. So one (very ugly) way to optimize the first query is to add an extra join to stuff_user INSIDE the "se" subquery: select * from stuff inner join (select stuff_id, user_id from stuff_user group by 1,2 ) su using (stuff_id) left join (select stuff_id, user_id, count(*) from stuff_events inner join ( -- same subquery as above select stuff_id,user_id from stuff_user group by 1,2 ) su2 using (stuff_id) group by 1,2 ) se using (stuff_id) where user_id = 41; This does improve things a lot, bringing the execution time for this particular user to 3ms (!), but it is quite ugly and not fast enough for me for a user_id with lots of associated stuff_ids. George
"George Pavlov" <gpavlov@mynewplace.com> writes: > I somehow wish I could tell the optimizer to > first figure out which stuff_ids are related to the user_id that is > being asked for and then look ONLY those up in the stuff_events table > using the index on stuff_id. This is not really an optimizer problem, or at least not just an optimizer problem. The type of plan I think you are wishing for is what the source code calls a "nestloop with inner index scan", and that terminology should tip you off that it's only considered when the inner relation is just a simple indexscannable table. GROUP BY subqueries need not apply :-(. I've been speculating recently about how this situation might be improved, but I fear it will require nontrivial executor changes along with planner changes. The executor's present mechanism for passing variable values from the outer plan to the inner is a hack that only really works for indexscans. I got it to work for inheritance cases too, recently, but that's about as far as it can be pushed. I think it might be possible to get rid of it and use the more-recently-invented subplan parameter mechanism, but I haven't worked out the details. (And I know that the Greenplum crowd would like to get rid of subplan parameters, so I'm not sure this idea will go over well anyway.) The planner changes needed will be pretty wide-ranging too, likely. This might happen for 8.4 but I wouldn't promise it for 8.3. regards, tom lane