Thread: IN vs EXISTS equivalence
I've requested this before without response, but I'm asking again because it just caused me pain again: could we get a TODO added to have the planner recognize equivalent IN and EXISTS constructs and have them compete on cost estimates? I know it's not a trivial improvement, but if it's on the list maybe someone will pick it up, and I see it as the single biggest weakness in PostgreSQL performance. I don't need help resolving this particular case, because the fix is always blinding obvious when we hit this, and it doesn't even break portability because no other database we've tested fails to recognize these equivalent cases. step=# explain DELETE FROM "Body" WHERE "bodySeqNo" NOT IN (SELECT "bodySeqNo" FROM "Message"); QUERY PLAN ----------------------------------------------------------------------------------Seq Scan on "Body" (cost=90277.43..285235351699.39rows=3313379 width=6) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=90277.43..159793.40rows=6627957 width=11) -> Seq Scan on "Message" (cost=0.00..80413.07 rows=6627957 width=11) (5 rows) step=# explain DELETE FROM "Body" WHERE NOT EXISTS (SELECT * FROM "Message" m WHERE m."bodySeqNo" = "Body"."bodySeqNo"); QUERY PLAN --------------------------------------------------------------------------------------------Seq Scan on "Body" (cost=0.00..3401760.88rows=3313416 width=6) Filter: (NOT (subplan)) SubPlan -> Index Scan using "Message_Body" on "Message"m (cost=0.00..0.49 rows=1 width=136) Index Cond: (("bodySeqNo")::numeric = ($0)::numeric) (5 rows) The bodySeqNo column is NOT NULL in both tables, and is the primary key in the Body table. The Message table has a non-unique index on it. (\d lists will follow at the bottom.) I cancelled the first query after it had been running for 54 hours over our slowest hours (the weekend). The second form ran in four minutes in competition with peak time queries. -Kevin step=# \d "Body" Table "public.Body" Column | Type | Modifiers -------------+------------------------+-----------bodySeqNo | "SequenceT" | not nullcontentType | charactervarying(255) | not nullencoding | character varying(255) |body | "BodyT" | Indexes: "Body_pkey" PRIMARY KEY, btree ("bodySeqNo") step=# \d "Message" Table "public.Message" Column | Type | Modifiers -----------------+--------------------------+-----------messageId | "SequenceT" | not nullclientMessageId| "ClientMessageIdT" | not nullcorrelationId | "SequenceT" |destQueue | "QueueNameT" | not nullreplyToQueue | "QueueNameT" | not nulltypeCode | character(2) |expiration | timestamp with time zone |priority | smallint | not nullstatus | character(2) | not nullcreated | timestamp with time zone | not nulllastModified | timestamp withtime zone | not nullbodySeqNo | "SequenceT" | not nullmessageIdSearch | "PrioritySequenceT" |not null Indexes: "Message_pkey" PRIMARY KEY, btree ("messageId") "MessageIndex2" UNIQUE, btree ("destQueue", "clientMessageId") "Message_MessageIdSearch" UNIQUE, btree ("destQueue", status, "messageIdSearch") CLUSTER "Message_Body"btree ("bodySeqNo") "Message_Created" btree ("destQueue", status, created) "Message_Created2" btree ("destQueue",created) "Message_Expiration" btree (expiration) "Message_LastModified" btree ("destQueue", "lastModified") "Message_ReplyToQueue" btree ("replyToQueue") Foreign-key constraints: "Message_fk1" FOREIGN KEY ("destQueue") REFERENCES "Queue"(name) "Message_fk2" FOREIGN KEY ("replyToQueue")REFERENCES "Queue"(name)
On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote: > I've requested this before without response, but I'm asking again > because it just caused me pain again: could we get a TODO added to > have the planner recognize equivalent IN and EXISTS constructs and > have them compete on cost estimates? I know it's not a trivial > improvement, but if it's on the list maybe someone will pick it up, > and I see it as the single biggest weakness in PostgreSQL > performance. I'll pick it up as a default unless someone requests they have it from me. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com
>>> On Mon, Oct 22, 2007 at 1:30 PM, in message <1193077831.4319.61.camel@ebony.site>, Simon Riggs <simon@2ndquadrant.com> wrote: > On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote: >> I've requested this before without response, but I'm asking again >> because it just caused me pain again: could we get a TODO added to >> have the planner recognize equivalent IN and EXISTS constructs and >> have them compete on cost estimates? I know it's not a trivial >> improvement, but if it's on the list maybe someone will pick it up, >> and I see it as the single biggest weakness in PostgreSQL >> performance. > > I'll pick it up as a default unless someone requests they have it from > me. Thanks, Simon. One more logically equivalent, PostgreSQL-specific form which costs out even better was suggested off-list: step=# explain DELETE FROM "Body" USING "Message" WHERE "Message"."bodySeqNo" = "Body"."bodySeqNo"; QUERY PLAN --------------------------------------------------------------------------------------------------Merge Join (cost=0.00..696766.20rows=4048543 width=6) Merge Cond: (("Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric) -> Index Scan using "Body_pkey" on "Body" (cost=0.00..326108.11 rows=4048543 width=18) -> Index Scan using "Message_Body" on "Message" (cost=0.00..310085.16 rows=4048847 width=12) (4 rows) If both of the other syntaxes could compete against that, it would be fantastic. (If that's feasible.) -Kevin
>>> On Mon, Oct 22, 2007 at 4:37 PM, in message <471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner" <Kevin.Grittner@wicourts.gov> wrote: > One more logically equivalent, PostgreSQL-specific form which > costs out even better was suggested off-list: Oops. That is not logically equivalent. We want to delete WHERE NOT EXISTS; the logic of that suggestion is backwards. Disregard that last post, please. -Kevin
2007/10/23, Kevin Grittner <Kevin.Grittner@wicourts.gov>: > >>> On Mon, Oct 22, 2007 at 4:37 PM, in message > <471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner" > <Kevin.Grittner@wicourts.gov> wrote: > > > One more logically equivalent, PostgreSQL-specific form which > > costs out even better was suggested off-list: > > Oops. That is not logically equivalent. We want to delete WHERE NOT > EXISTS; the logic of that suggestion is backwards. > > Disregard that last post, please. > > -Kevin > > my mistake, sorry Pavel
>>> On Mon, Oct 22, 2007 at 5:04 PM, in message <471CD819.EE98.0025.0@wicourts.gov>, "Kevin Grittner" > Oops. That is not logically equivalent. We want to delete WHERE NOT > EXISTS; the logic of that suggestion is backwards. > > Disregard that last post, please. Maybe that last post shouldn't be totally disregarded -- it wouldn't be a bad idea to support a Merge NOT IN Join if it the effort isn't out of line with the benefit. Pavel suggested a clever kludge to accomplish this, which costs out better than anything else I've tried: step=# explain DELETE FROM "Body" step-# WHERE "bodySeqNo" IN (SELECT "Body"."bodySeqNo" step(# FROM "Body" step(# LEFT JOIN "Message" step(# ON "Body"."bodySeqNo" = "Message"."bodySeqNo" step(# WHERE "Message"."bodySeqNo" IS NULL); QUERYPLAN --------------------------------------------------------------------------------------------------------------Merge IN Join (cost=825315.30..1265285.81 rows=2010418 width=6) Merge Cond: ((public."Body"."bodySeqNo")::numeric = (public."Body"."bodySeqNo")::numeric) -> Index Scan using "Body_pkey" on "Body" (cost=0.00..383702.32 rows=4020835 width=18) -> Materialize (cost=825315.30..846401.18 rows=2010418 width=12) -> Merge Left Join (cost=0.00..822323.18rows=2010418 width=12) Merge Cond: ((public."Body"."bodySeqNo")::numeric = ("Message"."bodySeqNo")::numeric) Filter: ("Message"."bodySeqNo" IS NULL) -> Index Scan using "Body_pkey"on "Body" (cost=0.00..383702.32 rows=4020835 width=12) -> Index Scan using "Message_Body" on "Message" (cost=0.00..378901.17 rows=4021733 width=12) (9 rows) Just some ideas to look at while you're "in the neighborhood." -Kevin
>>> On Mon, Oct 22, 2007 at 1:30 PM, Simon Riggs wrote: > On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote: >> I've requested this before without response, but I'm asking again >> because it just caused me pain again: could we get a TODO added to >> have the planner recognize equivalent IN and EXISTS constructs and >> have them compete on cost estimates? I know it's not a trivial >> improvement, but if it's on the list maybe someone will pick it up, >> and I see it as the single biggest weakness in PostgreSQL >> performance. > > I'll pick it up as a default unless someone requests they have it from > me. Since Simon's focus has shifted to other issues, I'm hoping this can go onto the TODO list. I'm adding some NOT EXISTS examples to the thread for completeness of what someone might want to address while working on it. For two queries which can easily be shown (to a human viewer, anyway) to return identical results, I see performance differences of over five orders of magnitude. (Six if you compare to the LEFT JOIN ... WHERE not_null_right_column IS NULL trick.) Below are the cost estimates of the three techniques for a medium-sized table joining to a large table, and for a large table joining to a small table. The IN behavior has the worst worst-case behavior, at least in queries that I've run, although many people report that it is usually faster. The technique of doing an existence test with a LEFT JOIN and then checking whether a NOT NULL column from the right-hand table is null is often faster than either technique, and seldom much worse than the best technique for any given test. Queries and plans attached. Summary of costs below, in millions of cost units. (Fractions of a million discarded.) NOT IN (independent_subquery) 19745843, 5 WHERE NOT EXISTS 74, 318 LEFT JOIN WHERE not_null_right_column IS NULL 10, 17 These cost estimates tend to come out in pretty consistent ratio to the actual run times. -Kevin
Attachment
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > I'm adding some NOT EXISTS examples to the thread for completeness of > what someone might want to address while working on it. For two > queries which can easily be shown (to a human viewer, anyway) to > return identical results, I see performance differences of over five > orders of magnitude. Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When people are complaining of bad planner behavior, I don't find bare EXPLAIN output to be very convincing. regards, tom lane
>>> On Mon, Aug 4, 2008 at 6:48 PM, in message <15026.1217893692@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> I'm adding some NOT EXISTS examples to the thread for completeness of >> what someone might want to address while working on it. For two >> queries which can easily be shown (to a human viewer, anyway) to >> return identical results, I see performance differences of over five >> orders of magnitude. > > Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When people > are complaining of bad planner behavior, I don't find bare EXPLAIN > output to be very convincing. I'll give it a shot. I've never had the patience to let the one with the cost five or six orders of magnitude higher than the others run to completion, but I've started the lot of 'em. -Kevin
>>> On Mon, Aug 4, 2008 at 6:48 PM, in message <15026.1217893692@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >> I'm adding some NOT EXISTS examples to the thread for completeness of >> what someone might want to address while working on it. For two >> queries which can easily be shown (to a human viewer, anyway) to >> return identical results, I see performance differences of over five >> orders of magnitude. > > Could we see EXPLAIN ANALYZE not just EXPLAIN for these? When people > are complaining of bad planner behavior, I don't find bare EXPLAIN > output to be very convincing. The other five queries have a cost to millisecond ratio of between 9.8 and 267. If the expensive one falls in the same range, it will run for 2.3 to 64 years. I know I have left it running for days before without completion. I don't think I can devote the resources to it. Attached are the EXPLAIN ANALYZE output for the other five. -Kevin
Attachment
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: >>> I'm adding some NOT EXISTS examples to the thread for completeness >>> of what someone might want to address while working on it. For two >>> queries which can easily be shown (to a human viewer, anyway) to >>> return identical results, I see performance differences of over >>> five orders of magnitude. I've been looking at this a bit and have an action plan worked out. I believe that the optimizable cases for EXISTS are those where the EXISTS() is either at the top level of WHERE, or just underneath a NOT, and the contained subselect: * has no set operations (UNION etc), grouping, set-returning functions in the SELECT list, LIMIT, or a few other funny cases * references outer-level variables only in its WHERE clause. Given these conditions, an EXISTS is equivalent to a standard semijoin between the outer relations used in its WHERE clause and the sub-select with the WHERE removed, where the join condition is just the WHERE clause. (A semijoin returns one copy of each LHS row for which there exists at least one RHS row satisfying the join condition.) Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin. (An anti-semijoin returns one copy of each LHS row for which there exists no RHS row satisfying the join condition.) So what I think we should do is implement JOIN_SEMI and JOIN_ANTI as variant outer-join types in the planner and executor, and convert EXISTS into those. JOIN_SEMI would replace the existing JOIN_IN support. (It's effectively the same thing so far as the executor is concerned, but there are some places in the planner that assume an IN join condition consists of one or more btree equality operators. I'm envisioning folding the current planner support for IN into the more general outer-join planning infrastructure, and fixing it so that it doesn't assume the join clause must be of that form.) Explicit support for JOIN_ANTI is probably not strictly necessary: we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL" hack. However this only works if the join's ON-condition can be proved strict for at least one RHS variable, which is an assumption I'd rather not require for optimizing EXISTS. Also, the planner should be able to make much better estimates of the size of an antijoin result if it has an explicit concept that that's what's happening. (The existing kluge in nulltestsel() is not only ugly but has little hope of ever giving an accurate estimate.) So I'd prefer to go the other way: make the planner recognize the IS NULL trick and remove the IS NULL clause in favor of marking the LEFT JOIN as being really an antijoin. As far as IN goes: IN at the top level of WHERE can still be optimized to a semijoin under the same conditions as now. NOT IN is a lot trickier, for the same reason that typically trips up novices who try to use it: if any row of the subselect produces a NULL comparison result, then it is impossible for the NOT IN to result in TRUE, which means that it does not function as a standard antijoin. I thought about optimizing it only in the case where we can prove that the subselect outputs and the compared-to values are known NOT NULL (which in typical cases we could prove by looking for NOT NULL constraints on those table columns). The trouble with this is that that's not a sufficient condition: you must also assume that the comparison operator involved never yields NULL for non-null inputs. That might be okay for btree comparison functions but it's not a very comfy assumption in general; we certainly haven't got any explicit knowledge that any functions are guaranteed to act that way. So this case might be worth doing later but I'm not feeling excited about it. We generally tell people to avoid NOT IN and I'm happy to keep on saying that. Comments? regards, tom lane
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > I believe that the optimizable cases for EXISTS are those where the > EXISTS() is either at the top level of WHERE, or just underneath a NOT, The rest of the plan makes sense to me, but this part seems narrow. There's probably a good reason for that which is beyond my depth, but attached is a view that is used for calculating statistics for a database which is primarily used for "case management" purposes. If EXISTS could also be optimized in the contexts used there, it would be great. For context, this view references three other non-trivial views (MatterHistSearch, MatterHistStage, & MatterHistStatus), and does perform acceptably, so it's not a matter of complaining if it can't work here, just providing a real-world example of other useful contexts for EXISTS which might be worth covering if possible. This view is used in a large number of queries, mostly either selecting a single date with other selection criteria or counting rows which match a set of matterNo values selected on complex criteria. By the way, this view was totally unusable under 8.2. Showing how it worked under 8.3 was all it took to get them to expedite the upgrade. These views, possible because of improvements in 8.3, saved "countless programmer days" (to quote one of the programmers). -Kevin
Attachment
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I believe that the optimizable cases for EXISTS are those where the >> EXISTS() is either at the top level of WHERE, or just underneath a >> NOT, > The rest of the plan makes sense to me, but this part seems narrow. > There's probably a good reason for that which is beyond my depth, but > attached is a view that is used for calculating statistics for a > database which is primarily used for "case management" purposes. If > EXISTS could also be optimized in the contexts used there, it would be > great. I chewed on that for awhile. We can certainly optimize EXISTS that's appearing in the ON-condition of a regular JOIN, because that's not really semantically different from a WHERE-condition. But I don't think it's going to be reasonable to improve EXISTS in outer-JOIN ON conditions. There are a couple of problems. Consider t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy)) To implement this with the correct semantics, we'd have to have the t1/t2 join and the t1/t2/t3 join going on in the same execution node, with two different join behaviors (LEFT and SEMI). There isn't any way AFAICS to factor it into two separate steps. That's unreasonably complicated, and it's not clear that you'd get any better performance anyway than the current implementation (which'd treat the EXISTS as a subplan). Even worse, let the EXISTS be a degenerate case: t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx)); We can't actually treat this EXISTS as a semijoin between t1 and t3, either before or after the LEFT JOIN; because then the behavior would be to drop t1 rows that have no t3 match, which is not what this query specifies. (Note: the other degenerate case, where the EXISTS depends only on t2, *could* be optimized since we could just let the semijoin be performed within the RHS of the LEFT JOIN.) So this is not something I'm going to tackle; at least not this devel cycle. One small step we can take in this direction, though, is to improve the planner's internal handling of the qual conditions for IN and EXISTS. Right now the process is just to throw the sub-select into the main range table and put the IN join conditions into the same place in WHERE that the IN-clause was to start with. The trouble with this is that the distribute_quals_to_rels processing has no idea that there's anything special about the IN join conditions. We got away with that for the limited case of IN clauses at the top level of WHERE, but it's become clear to me over the weekend that this has no hope of working for NOT EXISTS --- since that's effectively an outer join, it absolutely has to have the same kinds of qual-scheduling constraints as ordinary outer joins do. So we need a data structure that distribute_quals_to_rels can work with. What I think needs to happen is that the initial pass that pulls up optimizable IN/EXISTS sub-selects should not merge the SubLink's replacement qual clauses seamlessly, but put them underneath a new node type, say "FlattenedSubLink", that retains knowledge of the join it's representing. The FlattenedSubLink would survive only as far as distribute_quals_to_rels, which would distribute out the contained qual conditions instead of the FlattenedSubLink itself --- but only after marking them properly for the outer-join restrictions. This representation would make it feasible to work with IN/EXISTS that are inside JOIN ON conditions, which the present representation using a single in_info_list really can't do. The semantic issues are still there but at least the representation isn't getting in the way ... regards, tom lane
On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: > * has no set operations (UNION etc), grouping, set-returning functions > in the SELECT list, LIMIT, or a few other funny cases Couldn't union/union all be treated as EXISTS(a) OR EXISTS(b) ... Or am I missing some detail with NULLS? Personally, I'd rather write it as separate EXISTS clauses rather than using UNION, but perhaps others have a different preference... -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
Our Internet connectivity failed as this was being sent. It looks like at least the list didn't get it, so here goes another try. Apologies for any duplication. -Kevin >>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > I chewed on that for awhile. We can certainly optimize EXISTS that's > appearing in the ON-condition of a regular JOIN, because that's not > really semantically different from a WHERE-condition. Good to hear. I thought that might be doable. :-) > But I don't think > it's going to be reasonable to improve EXISTS in outer-JOIN ON > conditions. There are a couple of problems. Consider The discussion did make the difficulties clear. > So this is not something I'm going to tackle; at least not this > devel cycle. Fair enough. > One small step we can take in this direction, though, is to improve the > planner's internal handling of the qual conditions for IN and EXISTS. > Right now the process is just to throw the sub-select into the main > range table and put the IN join conditions into the same place in WHERE > that the IN-clause was to start with. The trouble with this is that the > distribute_quals_to_rels processing has no idea that there's anything > special about the IN join conditions. We got away with that for the > limited case of IN clauses at the top level of WHERE, but it's become > clear to me over the weekend that this has no hope of working for NOT > EXISTS --- since that's effectively an outer join, it absolutely has to > have the same kinds of qual-scheduling constraints as ordinary outer > joins do. So we need a data structure that distribute_quals_to_rels can > work with. What I think needs to happen is that the initial pass that > pulls up optimizable IN/EXISTS sub-selects should not merge the > SubLink's replacement qual clauses seamlessly, but put them underneath a > new node type, say "FlattenedSubLink", that retains knowledge of the > join it's representing. The FlattenedSubLink would survive only as far > as distribute_quals_to_rels, which would distribute out the contained > qual conditions instead of the FlattenedSubLink itself --- but only > after marking them properly for the outer-join restrictions. This > representation would make it feasible to work with IN/EXISTS that are > inside JOIN ON conditions, which the present representation using a > single in_info_list really can't do. The semantic issues are still > there but at least the representation isn't getting in the way ... Just curious, is that something for this cycle, or a TODO item? Thanks for looking at this. The one part I'm not sure about is where the CASE/EXISTS in the SELECT value list fits into this discussion. It seems conceptually similar to the OUTER JOIN, but sort of a special case, so I'm not sure what you had in mind there. -Kevin
"Decibel!" <decibel@decibel.org> writes: > On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: >> * has no set operations (UNION etc), grouping, set-returning functions >> in the SELECT list, LIMIT, or a few other funny cases > > > Couldn't union/union all be treated as > > EXISTS(a) > OR EXISTS(b) Kind of confused by what you mean here. Can you give an example? The usual transformation to consider with UNION is to transform SELECT ... WHERE x OR y into SELECT ...WHERE x UNION ALL SELECT ...WHERE y AND NOT x (modulo handling NULLs properly) -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Decibel! <decibel@decibel.org> writes: > On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: >> * has no set operations (UNION etc), grouping, set-returning functions >> in the SELECT list, LIMIT, or a few other funny cases > Couldn't union/union all be treated as > EXISTS(a) > OR EXISTS(b) Perhaps, but that would end up defeating the optimization anyway, because as soon as the EXISTS is underneath an OR, it's no longer representing a potential join clause. regards, tom lane
On Aug 11, 2008, at 3:40 PM, Gregory Stark wrote: > "Decibel!" <decibel@decibel.org> writes: >> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote: >>> * has no set operations (UNION etc), grouping, set-returning >>> functions >>> in the SELECT list, LIMIT, or a few other funny cases >> >> >> Couldn't union/union all be treated as >> >> EXISTS(a) >> OR EXISTS(b) > > Kind of confused by what you mean here. Can you give an example? > > The usual transformation to consider with UNION is to transform > > SELECT ... WHERE x OR y > > into > > SELECT ... > WHERE x > UNION ALL > SELECT ... > WHERE y AND NOT x It was a case of the union being inside the EXISTS subquery, ie: WHERE EXISTS (SELECT * FROM a UNION SELECT * FROM b) AFAIK that's identical to WHERE EXISTS (SELECT * FROM a) OR EXISTS (SELECT * FROM b) But as Tom mentioned, as soon as you have it you can't answer it with a simple join, so it doesn't matter... -- Decibel!, aka Jim C. Nasby, Database Architect decibel@decibel.org Give your computer some brain candy! www.distributed.net Team #1828
On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote: > NOT IN is a lot trickier, > for the same reason that typically trips up novices who try to use it: > if any row of the subselect produces a NULL comparison result, then it > is impossible for the NOT IN to result in TRUE, which means that it > does not function as a standard antijoin. I thought about optimizing > it only in the case where we can prove that the subselect outputs and > the compared-to values are known NOT NULL (which in typical cases we > could prove by looking for NOT NULL constraints on those table > columns). The trouble with this is that that's not a sufficient > condition: you must also assume that the comparison operator involved > never yields NULL for non-null inputs. That might be okay for btree > comparison functions but it's not a very comfy assumption in general; > we certainly haven't got any explicit knowledge that any functions are > guaranteed to act that way. So this case might be worth doing later > but I'm not feeling excited about it. We generally tell people to > avoid NOT IN and I'm happy to keep on saying that. Just found this comment, after reading what you said on other thread about NOT IN. NOT IN is a serious performance issue for most people. We simply can't say to people "you were told not to". If we can fix it easily for the majority of cases, we should. We can't let the "it won't work in certain cases" reason prevent various optimizations from going in. There are tons of places where we say "XXX needs later improvement" in code comments. So lets do that here also. It certainly wouldn't be the first optimization/feature that went into code in a restricted way that didn't work for all cases: hash joins, ANALYZE, partial indexes etc.. Anybody that is writing complex SQL with user defined operators knows enough to re-write their queries correctly, so there will be almost no negative effect from making the NOT IN optimisation a special case. And if there is an effect, the people effected can fix the problem. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
If you're still interested in testing CVS HEAD's handling of EXISTS, I've about finished what I wanted to do with it. regards, tom lane
Hello I did some fast test on pagila database. 8.4 postgres=# explain analyze select * from film f where exists (select film_id from film_actor where f.film_id = film_id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------Hash Join (cost=117.01..195.42 rows=966 width=390) (actual time=36.011..43.314 rows=997 loops=1) Hash Cond: (f.film_id = film_actor.film_id) -> Seq Scan on film f (cost=0.00..65.00rows=1000 width=390) (actual time=0.027..1.971 rows=1000 loops=1) -> Hash (cost=104.94..104.94 rows=966 width=2) (actual time=35.886..35.886 rows=997 loops=1) -> HashAggregate (cost=95.28..104.94 rows=966 width=2) (actual time=32.650..34.139 rows=997 loops=1) -> Seq Scan on film_actor (cost=0.00..81.62 rows=5462 width=2) (actual time=0.081..14.232 rows=5462 loops=1)Total runtime: 45.373 ms (7 rows) 8.3 postgres=# explain select * from film f where exists (select film_id from film_actor where f.film_id = film_id); QUERY PLAN ------------------------------------------------------------------------------------------Seq Scan on film f (cost=0.00..4789.34rows=500 width=390) Filter: (subplan) SubPlan -> Index Scan using idx_fk_film_id on film_actor (cost=0.00..28.35 rows=6 width=2) Index Cond: ($0 = film_id) (5 rows) postgres=# explain analyze select * from film f where not exists (select film_id from film_actor where f.film_id = film_id); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------Hash AntiJoin (cost=149.90..240.24 rows=34 width=390) (actual time=25.473..28.169 rows=3 loops=1) Hash Cond: (f.film_id = film_actor.film_id) -> Seq Scan on film f (cost=0.00..65.00rows=1000 width=390) (actual time=0.027..1.898 rows=1000 loops=1) -> Hash (cost=81.62..81.62 rows=5462 width=2) (actual time=24.398..24.398 rows=5462 loops=1) -> Seq Scan on film_actor (cost=0.00..81.62 rows=5462 width=2) (actual time=0.035..12.400 rows=5462 loops=1)Total runtime: 28.866 ms postgres=# explain analyze select * from film f where not exists (select film_id from film_actor where f.film_id = film_id); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------Seq Scanon film f (cost=0.00..4789.34 rows=500 width=390) (actual time=5.874..22.654 rows=3 loops=1) Filter: (NOT (subplan)) SubPlan -> Index Scan using idx_fk_film_id on film_actor (cost=0.00..28.35 rows=6 width=2) (actual time=0.016..0.016 rows=1 loops=1000) Index Cond: ($0 = film_id)Total runtime: 22.835 ms (6 rows) postgres=# explain analyze select * from film f where film_id in (select film_id from film_actor); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------Hash Join (cost=117.01..195.42 rows=966 width=390) (actual time=43.151..53.688 rows=997 loops=1) Hash Cond: (f.film_id = film_actor.film_id) -> Seq Scan on film f (cost=0.00..65.00rows=1000 width=390) (actual time=0.021..6.765 rows=1000 loops=1) -> Hash (cost=104.94..104.94 rows=966 width=2) (actual time=43.091..43.091 rows=997 loops=1) -> HashAggregate (cost=95.28..104.94 rows=966 width=2) (actual time=34.754..36.275 rows=997 loops=1) -> Seq Scan on film_actor (cost=0.00..81.62 rows=5462 width=2) (actual time=0.024..15.746 rows=5462 loops=1)Total runtime: 55.291 ms postgres=# explain analyze select * from film f where film_id in (select film_id from film_actor); QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------Nested LoopIN Join (cost=0.00..175.25 rows=986 width=390) (actual time=0.090..14.272 rows=997 loops=1) -> Seq Scan on film f (cost=0.00..65.00 rows=1000 width=390) (actual time=0.014..1.877 rows=1000 loops=1) -> Index Scan using idx_fk_film_id on film_actor (cost=0.00..0.54 rows=6 width=2) (actual time=0.007..0.007 rows=1 loops=1000) Index Cond: (film_actor.film_id = f.film_id)Total runtime:15.902 ms (5 rows) 8.4 postgres=# explain analyze select * from film f where film_id not in (select film_id from film_actor); QUERY PLAN --------------------------------------------------------------------------------------------------------------------Seq Scanon film f (cost=95.28..162.78 rows=500 width=390) (actual time=24.324..26.015 rows=3 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on film_actor (cost=0.00..81.62rows=5462 width=2) (actual time=0.026..12.074 rows=5462 loops=1)Total runtime: 26.201 ms (5 rows) 8.3 postgres=# explain analyze select * from film f where film_id not in (select film_id from film_actor); QUERY PLAN --------------------------------------------------------------------------------------------------------------------Seq Scanon film f (cost=95.28..162.78 rows=500 width=390) (actual time=29.713..30.456 rows=3 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on film_actor (cost=0.00..81.62rows=5462 width=2) (actual time=0.051..13.649 rows=5462 loops=1)Total runtime: 30.644 ms (5 rows) 8.4 is 10% faster than 8.3 on small table < 1000 rows. 8.4 has much better prediction. Regards Pavel Stehule 2008/8/17 Tom Lane <tgl@sss.pgh.pa.us>: > If you're still interested in testing CVS HEAD's handling of EXISTS, > I've about finished what I wanted to do with it. > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
>>> On Sun, Aug 17, 2008 at 4:29 PM, in message <27116.1219008559@sss.pgh.pa.us>, Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you're still interested in testing CVS HEAD's handling of EXISTS, > I've about finished what I wanted to do with it. Thanks. I'm very interested; unfortunately I can't get to it until at least Friday. -Kevin
On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote: > > On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote: > > > NOT IN is a lot trickier, > > condition: you must also assume that the comparison operator involved > > never yields NULL for non-null inputs. That might be okay for btree > > comparison functions but it's not a very comfy assumption in general; > > we certainly haven't got any explicit knowledge that any functions are > > guaranteed to act that way. So this case might be worth doing later ... > Just found this comment, after reading what you said on other thread > about NOT IN. > > NOT IN is a serious performance issue for most people. We simply can't > say to people "you were told not to". > > If we can fix it easily for the majority of cases, we should. We can't > let the "it won't work in certain cases" reason prevent various A suggestion: what about adding an attribute to functions to declare that they never return null? declare foo(int, int) returns int immutable not null as ... -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Wed, Sep 3, 2008 at 9:17 AM, daveg <daveg@sonic.net> wrote:
On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote:
>
> On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:
>
> > NOT IN is a lot trickier,> > condition: you must also assume that the comparison operator involved...
> > never yields NULL for non-null inputs. That might be okay for btree
> > comparison functions but it's not a very comfy assumption in general;
> > we certainly haven't got any explicit knowledge that any functions are
> > guaranteed to act that way. So this case might be worth doing later> Just found this comment, after reading what you said on other threadA suggestion: what about adding an attribute to functions to declare that
> about NOT IN.
>
> NOT IN is a serious performance issue for most people. We simply can't
> say to people "you were told not to".
>
> If we can fix it easily for the majority of cases, we should. We can't
> let the "it won't work in certain cases" reason prevent various
they never return null?
And if function still returns null then error will be raised?
Then you will end up adding NOT NULL also to IN and OUT parameters.
IIRC it was possible in Oracle to declare local variables NOT NULL.
Then you will end up adding NOT NULL also to IN and OUT parameters.
IIRC it was possible in Oracle to declare local variables NOT NULL.
declare foo(int, int) returns int immutable not null as ...
-dg
--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > If you're still interested in testing CVS HEAD's handling of EXISTS, > I've about finished what I wanted to do with it. It's been hectic here, but I've managed to let some stuff run in the background using an old test case from here: http://archives.postgresql.org/pgsql-hackers/2007-03/msg01408.php explain analyze SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date", "H"."userId", "H"."time" FROM "Adjustment" "A" JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND "H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo") WHERE "H"."tranType" = 'A' AND "A"."date" > DATE '2006-01-01' AND "H"."countyNo" = 66 AND "A"."countyNo" = 66 AND EXISTS ( SELECT 1 FROM "TranDetail" "D" WHERE "D"."tranNo" = "H"."tranNo" AND "D"."countyNo" = "H"."countyNo" AND "D"."caseNo"LIKE '2006TR%' ) ; On development machine using 8.3.3:Nested Loop (cost=0.00..399190.49 rows=1 width=37) (actual time=7184.068..3249391.592 rows=12372 loops=1) Join Filter: (("A"."adjustmentNo")::text = ("H"."tranId")::text) -> SeqScan on "Adjustment" "A" (cost=0.00..5218.87 rows=247869 width=17) (actual time=9.804..1695.691 rows=248674 loops=1) Filter: (((date)::date > '2006-01-01'::date) AND (("countyNo")::smallint = 66)) -> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..1.57 rows=1 width=37) (actual time=13.056..13.056 rows=0 loops=248674) Index Cond: ((("H"."tranNo")::integer = ("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66)) Filter: ((("H"."tranType")::text = 'A'::text) AND(subplan)) SubPlan -> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..1.29 rows=1 width=0) (actual time=13.017..13.017 rows=0 loops=248674) Index Cond: ((("caseNo")::text >= '2006TR'::text) AND (("caseNo")::text < '2006TS'::text) AND (("tranNo")::integer = ($0)::integer) AND (("countyNo")::smallint = ($1)::smallint)) Filter: (("caseNo")::text ~~ '2006TR%'::text)Totalruntime: 3249404.662 ms On the same machine, using the snapshot from this morning:Nested Loop (cost=1963.24..38483.54 rows=1 width=37) (actual time=372.964..986.994 rows=12372 loops=1) Join Filter: (("H"."tranNo")::integer = ("A"."tranNo")::integer) -> Merge SemiJoin (cost=1963.24..31012.28 rows=21317 width=37) (actual time=372.926..839.298 rows=12372 loops=1) Merge Cond: (("H"."tranNo")::integer = ("D"."tranNo")::integer) Join Filter: (("D"."countyNo")::smallint = ("H"."countyNo")::smallint) -> Index Scan using "TranHeader_pkey" on "TranHeader" "H" (cost=0.00..27848.57 rows=322517 width=37) (actual time=3.722..526.124 rows=311963 loops=1 ) Index Cond: (("countyNo")::smallint = 66) Filter: (("tranType")::text = 'A'::text) -> Sort (cost=1963.17..2027.08 rows=25565 width=6) (actual time=171.512..191.688 rows=76597 loops=1) Sort Key: "D"."tranNo" Sort Method: quicksort Memory:6663kB -> Index Scan using "TranDetail_TranDetCaseNo" on "TranDetail" "D" (cost=0.00..91.57 rows=25565 width=6) (actual time=0.031..100.688 rows=7659 7 loops=1) Index Cond: ((("caseNo")::text >= '2006TR'::text) AND (("caseNo")::text < '2006TS'::text) AND (("countyNo")::smallint = 66)) Filter: (("caseNo")::text ~~ '2006TR%'::text) -> Index Scan using "Adjustment_pkey" on "Adjustment""A" (cost=0.00..0.34 rows=1 width=17) (actual time=0.009..0.010 rows=1 loops=12372) Index Cond: ((("A"."adjustmentNo")::text = ("H"."tranId")::text) AND (("A"."countyNo")::smallint = 66)) Filter: (("A".date)::date > '2006-01-01'::date)Totalruntime: 991.097 ms The chosen plan looks very reasonable, and performs very well. Nice! After converting the database I originally forgot to run VACUUM ANALYZE. Even planning "blind" and doing hint-bit rewrites it picked a plan which ran in under 10 seconds. I'll be running other tests as I get the chance. -Kevin