Thread: IN vs EXISTS equivalence

IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

22 October 2007, 11:31:46

I've requested this before without response, but I'm asking again
because it just caused me pain again: could we get a TODO added to
have the planner recognize equivalent IN and EXISTS constructs and
have them compete on cost estimates?  I know it's not a trivial
improvement, but if it's on the list maybe someone will pick it up,
and I see it as the single biggest weakness in PostgreSQL
performance.
I don't need help resolving this particular case, because the fix is
always blinding obvious when we hit this, and it doesn't even break
portability because no other database we've tested fails to recognize
these equivalent cases.
step=# explain DELETE FROM "Body" WHERE "bodySeqNo" NOT IN (SELECT "bodySeqNo" FROM "Message");
         QUERY PLAN 
----------------------------------------------------------------------------------Seq Scan on "Body"
(cost=90277.43..285235351699.39rows=3313379 width=6)  Filter: (NOT (subplan))  SubPlan    ->  Materialize
(cost=90277.43..159793.40rows=6627957 width=11)          ->  Seq Scan on "Message"  (cost=0.00..80413.07 rows=6627957
width=11)
(5 rows)

step=# explain DELETE FROM "Body" WHERE NOT EXISTS (SELECT * FROM "Message" m WHERE m."bodySeqNo" =
"Body"."bodySeqNo");                                       QUERY PLAN 
--------------------------------------------------------------------------------------------Seq Scan on "Body"
(cost=0.00..3401760.88rows=3313416 width=6)  Filter: (NOT (subplan))  SubPlan    ->  Index Scan using "Message_Body" on
"Message"m  (cost=0.00..0.49 rows=1 width=136)          Index Cond: (("bodySeqNo")::numeric = ($0)::numeric) 
(5 rows)
The bodySeqNo column is NOT NULL in both tables, and is the primary
key in the Body table.  The Message table has a non-unique index on
it.  (\d lists will follow at the bottom.)
I cancelled the first query after it had been running for 54 hours
over our slowest hours (the weekend).  The second form ran in four
minutes in competition with peak time queries.
-Kevin
step=# \d "Body"              Table "public.Body"  Column    |          Type          | Modifiers
-------------+------------------------+-----------bodySeqNo   | "SequenceT"            | not nullcontentType |
charactervarying(255) | not nullencoding    | character varying(255) |body        | "BodyT"                | 
Indexes:   "Body_pkey" PRIMARY KEY, btree ("bodySeqNo")

step=# \d "Message"                Table "public.Message"    Column      |           Type           | Modifiers
-----------------+--------------------------+-----------messageId       | "SequenceT"              | not
nullclientMessageId| "ClientMessageIdT"       | not nullcorrelationId   | "SequenceT"              |destQueue       |
"QueueNameT"            | not nullreplyToQueue    | "QueueNameT"             | not nulltypeCode        | character(2)
         |expiration      | timestamp with time zone |priority        | smallint                 | not nullstatus
  | character(2)             | not nullcreated         | timestamp with time zone | not nulllastModified    | timestamp
withtime zone | not nullbodySeqNo       | "SequenceT"              | not nullmessageIdSearch | "PrioritySequenceT"
|not null 
Indexes:   "Message_pkey" PRIMARY KEY, btree ("messageId")   "MessageIndex2" UNIQUE, btree ("destQueue",
"clientMessageId")  "Message_MessageIdSearch" UNIQUE, btree ("destQueue", status, "messageIdSearch") CLUSTER
"Message_Body"btree ("bodySeqNo")   "Message_Created" btree ("destQueue", status, created)   "Message_Created2" btree
("destQueue",created)   "Message_Expiration" btree (expiration)   "Message_LastModified" btree ("destQueue",
"lastModified")  "Message_ReplyToQueue" btree ("replyToQueue") 
Foreign-key constraints:   "Message_fk1" FOREIGN KEY ("destQueue") REFERENCES "Queue"(name)   "Message_fk2" FOREIGN KEY
("replyToQueue")REFERENCES "Queue"(name)

Re: IN vs EXISTS equivalence

From

Simon Riggs

Date:

22 October 2007, 15:34:32

On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
> I've requested this before without response, but I'm asking again
> because it just caused me pain again: could we get a TODO added to
> have the planner recognize equivalent IN and EXISTS constructs and
> have them compete on cost estimates?  I know it's not a trivial
> improvement, but if it's on the list maybe someone will pick it up,
> and I see it as the single biggest weakness in PostgreSQL
> performance.

I'll pick it up as a default unless someone requests they have it from
me.

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

22 October 2007, 18:37:38

>>> On Mon, Oct 22, 2007 at  1:30 PM, in message
<1193077831.4319.61.camel@ebony.site>, Simon Riggs <simon@2ndquadrant.com>
wrote:
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates?  I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
>
> I'll pick it up as a default unless someone requests they have it from
> me.
Thanks, Simon.
One more logically equivalent, PostgreSQL-specific form which
costs out even better was suggested off-list:
step=# explain DELETE FROM "Body" USING "Message" WHERE "Message"."bodySeqNo" = "Body"."bodySeqNo";
                     QUERY PLAN 
--------------------------------------------------------------------------------------------------Merge Join
(cost=0.00..696766.20rows=4048543 width=6)  Merge Cond: (("Body"."bodySeqNo")::numeric =
("Message"."bodySeqNo")::numeric) ->  Index Scan using "Body_pkey" on "Body"  (cost=0.00..326108.11 rows=4048543
width=18) ->  Index Scan using "Message_Body" on "Message"  (cost=0.00..310085.16 rows=4048847 width=12) 
(4 rows)
If both of the other syntaxes could compete against that, it would
be fantastic.  (If that's feasible.)
-Kevin

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

22 October 2007, 19:04:51

>>> On Mon, Oct 22, 2007 at  4:37 PM, in message
<471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
<Kevin.Grittner@wicourts.gov> wrote:

> One more logically equivalent, PostgreSQL-specific form which
> costs out even better was suggested off-list:

Oops.  That is not logically equivalent.  We want to delete WHERE NOT
EXISTS; the logic of that suggestion is backwards.

Disregard that last post, please.
-Kevin

Re: IN vs EXISTS equivalence

From

"Pavel Stehule"

Date:

23 October 2007, 04:42:38

2007/10/23, Kevin Grittner <Kevin.Grittner@wicourts.gov>:
> >>> On Mon, Oct 22, 2007 at  4:37 PM, in message
> <471CD1BE.EE98.0025.0@wicourts.gov>, "Kevin Grittner"
> <Kevin.Grittner@wicourts.gov> wrote:
>
> > One more logically equivalent, PostgreSQL-specific form which
> > costs out even better was suggested off-list:
>
> Oops.  That is not logically equivalent.  We want to delete WHERE NOT
> EXISTS; the logic of that suggestion is backwards.
>
> Disregard that last post, please.
>
> -Kevin
>
>
my mistake, sorry

Pavel

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

23 October 2007, 13:18:08

>>> On Mon, Oct 22, 2007 at  5:04 PM, in message
<471CD819.EE98.0025.0@wicourts.gov>, "Kevin Grittner"

> Oops.  That is not logically equivalent.  We want to delete WHERE NOT
> EXISTS; the logic of that suggestion is backwards.
>
> Disregard that last post, please.

Maybe that last post shouldn't be totally disregarded -- it wouldn't
be a bad idea to support a Merge NOT IN Join if it the effort isn't
out of line with the benefit.
Pavel suggested a clever kludge to accomplish this, which costs out
better than anything else I've tried:
step=# explain DELETE FROM "Body"
step-#   WHERE "bodySeqNo" IN (SELECT "Body"."bodySeqNo"
step(#                          FROM "Body"
step(#                          LEFT JOIN "Message"
step(#                            ON "Body"."bodySeqNo" = "Message"."bodySeqNo"
step(#                          WHERE "Message"."bodySeqNo" IS NULL);
QUERYPLAN 
--------------------------------------------------------------------------------------------------------------Merge IN
Join (cost=825315.30..1265285.81 rows=2010418 width=6)  Merge Cond: ((public."Body"."bodySeqNo")::numeric =
(public."Body"."bodySeqNo")::numeric) ->  Index Scan using "Body_pkey" on "Body"  (cost=0.00..383702.32 rows=4020835
width=18) ->  Materialize  (cost=825315.30..846401.18 rows=2010418 width=12)        ->  Merge Left Join
(cost=0.00..822323.18rows=2010418 width=12)              Merge Cond: ((public."Body"."bodySeqNo")::numeric =
("Message"."bodySeqNo")::numeric)             Filter: ("Message"."bodySeqNo" IS NULL)              ->  Index Scan using
"Body_pkey"on "Body"  (cost=0.00..383702.32 rows=4020835 width=12)              ->  Index Scan using "Message_Body" on
"Message" (cost=0.00..378901.17 rows=4021733 width=12) 
(9 rows)
Just some ideas to look at while you're "in the neighborhood."
-Kevin

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

04 August 2008, 19:36:07

>>> On Mon, Oct 22, 2007 at  1:30 PM, Simon Riggs wrote:
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates?  I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
>
> I'll pick it up as a default unless someone requests they have it
from
> me.

Since Simon's focus has shifted to other issues, I'm hoping this can
go onto the TODO list.

I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it.  For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude.  (Six if you compare to the LEFT JOIN ... WHERE
not_null_right_column IS NULL trick.)

Below are the cost estimates of the three techniques for a
medium-sized table joining to a large table, and for a large table
joining to a small table.  The IN behavior has the worst worst-case
behavior, at least in queries that I've run, although many people
report that it is usually faster.  The technique of doing an existence
test with a LEFT JOIN and then checking whether a NOT NULL column from
the right-hand table is null is often faster than either technique,
and seldom much worse than the best technique for any given test.

Queries and plans attached.  Summary of costs below, in millions of
cost units. (Fractions of a million discarded.)

NOT IN (independent_subquery)
19745843, 5

WHERE NOT EXISTS
74, 318

LEFT JOIN WHERE not_null_right_column IS NULL
10, 17

These cost estimates tend to come out in pretty consistent ratio to
the actual run times.

-Kevin

Attachment

not-exists-timings.txt

Re: IN vs EXISTS equivalence

From

Tom Lane

Date:

04 August 2008, 20:48:31

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> I'm adding some NOT EXISTS examples to the thread for completeness of
> what someone might want to address while working on it.  For two
> queries which can easily be shown (to a human viewer, anyway) to
> return identical results, I see performance differences of over five
> orders of magnitude.

Could we see EXPLAIN ANALYZE not just EXPLAIN for these?  When people
are complaining of bad planner behavior, I don't find bare EXPLAIN
output to be very convincing.
        regards, tom lane

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

05 August 2008, 10:45:11

>>> On Mon, Aug 4, 2008 at  6:48 PM, in message
<15026.1217893692@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> I'm adding some NOT EXISTS examples to the thread for completeness
of
>> what someone might want to address while working on it.  For two
>> queries which can easily be shown (to a human viewer, anyway) to
>> return identical results, I see performance differences of over
five
>> orders of magnitude.
> 
> Could we see EXPLAIN ANALYZE not just EXPLAIN for these?  When
people
> are complaining of bad planner behavior, I don't find bare EXPLAIN
> output to be very convincing.
I'll give it a shot.  I've never had the patience to let the one with
the cost five or six orders of magnitude higher than the others run to
completion, but I've started the lot of 'em.
-Kevin

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

05 August 2008, 13:34:18

>>> On Mon, Aug 4, 2008 at  6:48 PM, in message
<15026.1217893692@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>> I'm adding some NOT EXISTS examples to the thread for completeness
of
>> what someone might want to address while working on it.  For two
>> queries which can easily be shown (to a human viewer, anyway) to
>> return identical results, I see performance differences of over
five
>> orders of magnitude.
>
> Could we see EXPLAIN ANALYZE not just EXPLAIN for these?  When
people
> are complaining of bad planner behavior, I don't find bare EXPLAIN
> output to be very convincing.

The other five queries have a cost to millisecond ratio of between 9.8
and 267.  If the expensive one falls in the same range, it will run
for 2.3 to 64 years.  I know I have left it running for days before
without completion.  I don't think I can devote the resources to it.
Attached are the EXPLAIN ANALYZE output for the other five.

-Kevin

Attachment

not-exists-timings2.out

Re: IN vs EXISTS equivalence

From

Tom Lane

Date:

08 August 2008, 17:24:20

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
>>> I'm adding some NOT EXISTS examples to the thread for completeness
>>> of what someone might want to address while working on it.  For two
>>> queries which can easily be shown (to a human viewer, anyway) to
>>> return identical results, I see performance differences of over
>>> five orders of magnitude.

I've been looking at this a bit and have an action plan worked out.

I believe that the optimizable cases for EXISTS are those where the
EXISTS() is either at the top level of WHERE, or just underneath a NOT,
and the contained subselect:

* has no set operations (UNION etc), grouping, set-returning functions
in the SELECT list, LIMIT, or a few other funny cases

* references outer-level variables only in its WHERE clause.

Given these conditions, an EXISTS is equivalent to a standard semijoin
between the outer relations used in its WHERE clause and the sub-select
with the WHERE removed, where the join condition is just the WHERE
clause.  (A semijoin returns one copy of each LHS row for which there
exists at least one RHS row satisfying the join condition.)

Similarly, a NOT EXISTS is equivalent to a standard anti-semijoin.
(An anti-semijoin returns one copy of each LHS row for which there
exists no RHS row satisfying the join condition.)

So what I think we should do is implement JOIN_SEMI and JOIN_ANTI
as variant outer-join types in the planner and executor, and convert
EXISTS into those.  JOIN_SEMI would replace the existing JOIN_IN support.
(It's effectively the same thing so far as the executor is concerned,
but there are some places in the planner that assume an IN join condition
consists of one or more btree equality operators.  I'm envisioning folding
the current planner support for IN into the more general outer-join
planning infrastructure, and fixing it so that it doesn't assume the
join clause must be of that form.)

Explicit support for JOIN_ANTI is probably not strictly necessary:
we could represent it using the "LEFT JOIN WHERE rhs-variable IS NULL"
hack.  However this only works if the join's ON-condition can be proved
strict for at least one RHS variable, which is an assumption I'd rather
not require for optimizing EXISTS.  Also, the planner should be able to
make much better estimates of the size of an antijoin result if it has an
explicit concept that that's what's happening.  (The existing kluge in
nulltestsel() is not only ugly but has little hope of ever giving an
accurate estimate.)  So I'd prefer to go the other way: make the planner
recognize the IS NULL trick and remove the IS NULL clause in favor of
marking the LEFT JOIN as being really an antijoin.

As far as IN goes: IN at the top level of WHERE can still be optimized
to a semijoin under the same conditions as now.  NOT IN is a lot trickier,
for the same reason that typically trips up novices who try to use it:
if any row of the subselect produces a NULL comparison result, then it is
impossible for the NOT IN to result in TRUE, which means that it does not
function as a standard antijoin.  I thought about optimizing it only in
the case where we can prove that the subselect outputs and the compared-to
values are known NOT NULL (which in typical cases we could prove by
looking for NOT NULL constraints on those table columns).  The trouble
with this is that that's not a sufficient condition: you must also assume
that the comparison operator involved never yields NULL for non-null
inputs.  That might be okay for btree comparison functions but it's not a
very comfy assumption in general; we certainly haven't got any explicit
knowledge that any functions are guaranteed to act that way.  So this
case might be worth doing later but I'm not feeling excited about it.
We generally tell people to avoid NOT IN and I'm happy to keep on
saying that.

Comments?
        regards, tom lane

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

08 August 2008, 19:25:34

>>> Tom Lane <tgl@sss.pgh.pa.us> wrote:

> I believe that the optimizable cases for EXISTS are those where the
> EXISTS() is either at the top level of WHERE, or just underneath a
NOT,

The rest of the plan makes sense to me, but this part seems narrow.
There's probably a good reason for that which is beyond my depth, but
attached is a view that is used for calculating statistics for a
database which is primarily used for "case management" purposes.  If
EXISTS could also be optimized in the contexts used there, it would be
great.

For context, this view references three other non-trivial views
(MatterHistSearch, MatterHistStage, & MatterHistStatus), and does
perform acceptably, so it's not a matter of complaining if it can't
work here, just providing a real-world example of other useful
contexts for EXISTS which might be worth covering if possible.

This view is used in a large number of queries, mostly either
selecting a single date with other selection criteria or counting rows
which match a set of matterNo values selected on complex criteria.

By the way, this view was totally unusable under 8.2.  Showing how it
worked under 8.3 was all it took to get them to expedite the upgrade.
These views, possible because of improvements in 8.3, saved "countless
programmer days" (to quote one of the programmers).

-Kevin

Attachment

MatterDateStat.txt

Re: IN vs EXISTS equivalence

From

Tom Lane

Date:

11 August 2008, 13:59:51

"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I believe that the optimizable cases for EXISTS are those where the
>> EXISTS() is either at the top level of WHERE, or just underneath a
>> NOT,
> The rest of the plan makes sense to me, but this part seems narrow. 
> There's probably a good reason for that which is beyond my depth, but
> attached is a view that is used for calculating statistics for a
> database which is primarily used for "case management" purposes.  If
> EXISTS could also be optimized in the contexts used there, it would be
> great.
I chewed on that for awhile.  We can certainly optimize EXISTS that's
appearing in the ON-condition of a regular JOIN, because that's not
really semantically different from a WHERE-condition.  But I don't think
it's going to be reasonable to improve EXISTS in outer-JOIN ON
conditions.  There are a couple of problems.  Consider

t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND     EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx AND t3.f4 = t2.fy))

To implement this with the correct semantics, we'd have to have the
t1/t2 join and the t1/t2/t3 join going on in the same execution node,
with two different join behaviors (LEFT and SEMI).  There isn't any
way AFAICS to factor it into two separate steps.  That's unreasonably
complicated, and it's not clear that you'd get any better performance
anyway than the current implementation (which'd treat the EXISTS as
a subplan).

Even worse, let the EXISTS be a degenerate case:

t1 LEFT JOIN t2 ON (t1.f1 = t2.f2 AND     EXISTS(SELECT 1 FROM t3 WHERE t3.f3 = t1.fx));

We can't actually treat this EXISTS as a semijoin between t1 and t3,
either before or after the LEFT JOIN; because then the behavior would be
to drop t1 rows that have no t3 match, which is not what this query
specifies.

(Note: the other degenerate case, where the EXISTS depends only on
t2, *could* be optimized since we could just let the semijoin be
performed within the RHS of the LEFT JOIN.)

So this is not something I'm going to tackle; at least not this
devel cycle.

One small step we can take in this direction, though, is to improve the
planner's internal handling of the qual conditions for IN and EXISTS.
Right now the process is just to throw the sub-select into the main
range table and put the IN join conditions into the same place in WHERE
that the IN-clause was to start with.  The trouble with this is that the
distribute_quals_to_rels processing has no idea that there's anything
special about the IN join conditions.  We got away with that for the
limited case of IN clauses at the top level of WHERE, but it's become
clear to me over the weekend that this has no hope of working for NOT
EXISTS --- since that's effectively an outer join, it absolutely has to
have the same kinds of qual-scheduling constraints as ordinary outer
joins do.  So we need a data structure that distribute_quals_to_rels can
work with.  What I think needs to happen is that the initial pass that
pulls up optimizable IN/EXISTS sub-selects should not merge the
SubLink's replacement qual clauses seamlessly, but put them underneath a
new node type, say "FlattenedSubLink", that retains knowledge of the
join it's representing.  The FlattenedSubLink would survive only as far
as distribute_quals_to_rels, which would distribute out the contained
qual conditions instead of the FlattenedSubLink itself --- but only
after marking them properly for the outer-join restrictions.  This
representation would make it feasible to work with IN/EXISTS that are
inside JOIN ON conditions, which the present representation using a
single in_info_list really can't do.  The semantic issues are still
there but at least the representation isn't getting in the way ...
        regards, tom lane

Re: IN vs EXISTS equivalence

From

Decibel!

Date:

11 August 2008, 15:57:10

On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
> * has no set operations (UNION etc), grouping, set-returning functions
> in the SELECT list, LIMIT, or a few other funny cases


Couldn't union/union all be treated as

EXISTS(a)
OR EXISTS(b)
...

Or am I missing some detail with NULLS?

Personally, I'd rather write it as separate EXISTS clauses rather  
than using UNION, but perhaps others have a different preference...
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

11 August 2008, 16:15:22

Our Internet connectivity failed as this was being sent.  It looks
like at least the list didn't get it, so here goes another try. 
Apologies for any duplication.
-Kevin
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> I chewed on that for awhile.  We can certainly optimize EXISTS
that's
> appearing in the ON-condition of a regular JOIN, because that's not
> really semantically different from a WHERE-condition.
Good to hear.  I thought that might be doable.  :-)
> But I don't think
> it's going to be reasonable to improve EXISTS in outer-JOIN ON
> conditions.  There are a couple of problems.  Consider
The discussion did make the difficulties clear.
> So this is not something I'm going to tackle; at least not this
> devel cycle.
Fair enough.
> One small step we can take in this direction, though, is to improve
the
> planner's internal handling of the qual conditions for IN and
EXISTS.
> Right now the process is just to throw the sub-select into the main
> range table and put the IN join conditions into the same place in
WHERE
> that the IN-clause was to start with.  The trouble with this is that
the
> distribute_quals_to_rels processing has no idea that there's
anything
> special about the IN join conditions.  We got away with that for the
> limited case of IN clauses at the top level of WHERE, but it's
become
> clear to me over the weekend that this has no hope of working for
NOT
> EXISTS --- since that's effectively an outer join, it absolutely has
to
> have the same kinds of qual-scheduling constraints as ordinary outer
> joins do.  So we need a data structure that distribute_quals_to_rels
can
> work with.  What I think needs to happen is that the initial pass
that
> pulls up optimizable IN/EXISTS sub-selects should not merge the
> SubLink's replacement qual clauses seamlessly, but put them
underneath a
> new node type, say "FlattenedSubLink", that retains knowledge of the
> join it's representing.  The FlattenedSubLink would survive only as
far
> as distribute_quals_to_rels, which would distribute out the
contained
> qual conditions instead of the FlattenedSubLink itself --- but only
> after marking them properly for the outer-join restrictions.  This
> representation would make it feasible to work with IN/EXISTS that
are
> inside JOIN ON conditions, which the present representation using a
> single in_info_list really can't do.  The semantic issues are still
> there but at least the representation isn't getting in the way ...
Just curious, is that something for this cycle, or a TODO item?
Thanks for looking at this.  The one part I'm not sure about is where
the CASE/EXISTS in the SELECT value list fits into this discussion. 
It seems conceptually similar to the OUTER JOIN, but sort of a special
case, so I'm not sure what you had in mind there.
-Kevin

Re: IN vs EXISTS equivalence

From

Gregory Stark

Date:

11 August 2008, 17:40:30

"Decibel!" <decibel@decibel.org> writes:

> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>> * has no set operations (UNION etc), grouping, set-returning functions
>> in the SELECT list, LIMIT, or a few other funny cases
>
>
> Couldn't union/union all be treated as
>
> EXISTS(a)
> OR EXISTS(b)

Kind of confused by what you mean here. Can you give an example?

The usual transformation to consider with UNION is to transform

SELECT ... WHERE x OR y

into

SELECT ...WHERE x
UNION ALL
SELECT ...WHERE y AND NOT x

(modulo handling NULLs properly)

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!

Re: IN vs EXISTS equivalence

From

Tom Lane

Date:

11 August 2008, 22:46:10

Decibel! <decibel@decibel.org> writes:
> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>> * has no set operations (UNION etc), grouping, set-returning functions
>> in the SELECT list, LIMIT, or a few other funny cases

> Couldn't union/union all be treated as
> EXISTS(a)
> OR EXISTS(b)

Perhaps, but that would end up defeating the optimization anyway,
because as soon as the EXISTS is underneath an OR, it's no longer
representing a potential join clause.
        regards, tom lane

Re: IN vs EXISTS equivalence

From

Decibel!

Date:

12 August 2008, 20:02:32

On Aug 11, 2008, at 3:40 PM, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>> On Aug 8, 2008, at 3:23 PM, Tom Lane wrote:
>>> * has no set operations (UNION etc), grouping, set-returning  
>>> functions
>>> in the SELECT list, LIMIT, or a few other funny cases
>>
>>
>> Couldn't union/union all be treated as
>>
>> EXISTS(a)
>> OR EXISTS(b)
>
> Kind of confused by what you mean here. Can you give an example?
>
> The usual transformation to consider with UNION is to transform
>
> SELECT ... WHERE x OR y
>
> into
>
> SELECT ...
>  WHERE x
> UNION ALL
> SELECT ...
>  WHERE y AND NOT x

It was a case of the union being inside the EXISTS subquery, ie:

WHERE EXISTS (SELECT * FROM a UNION SELECT * FROM b)

AFAIK that's identical to

WHERE EXISTS (SELECT * FROM a) OR EXISTS (SELECT * FROM b)

But as Tom mentioned, as soon as you have it you can't answer it with  
a simple join, so it doesn't matter...
-- 
Decibel!, aka Jim C. Nasby, Database Architect  decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Re: IN vs EXISTS equivalence

From

Simon Riggs

Date:

14 August 2008, 14:43:23

On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:

> NOT IN is a lot trickier,
> for the same reason that typically trips up novices who try to use it:
> if any row of the subselect produces a NULL comparison result, then it
> is impossible for the NOT IN to result in TRUE, which means that it
> does not function as a standard antijoin.  I thought about optimizing
> it only in the case where we can prove that the subselect outputs and
> the compared-to values are known NOT NULL (which in typical cases we
> could prove by looking for NOT NULL constraints on those table
> columns).  The trouble with this is that that's not a sufficient
> condition: you must also assume that the comparison operator involved
> never yields NULL for non-null inputs.  That might be okay for btree
> comparison functions but it's not a very comfy assumption in general;
> we certainly haven't got any explicit knowledge that any functions are
> guaranteed to act that way.  So this case might be worth doing later
> but I'm not feeling excited about it. We generally tell people to
> avoid NOT IN and I'm happy to keep on saying that.

Just found this comment, after reading what you said on other thread
about NOT IN.

NOT IN is a serious performance issue for most people. We simply can't
say to people "you were told not to".

If we can fix it easily for the majority of cases, we should. We can't
let the "it won't work in certain cases" reason prevent various
optimizations from going in. There are tons of places where we say "XXX
needs later improvement" in code comments. So lets do that here also. It
certainly wouldn't be the first optimization/feature that went into code
in a restricted way that didn't work for all cases: hash joins, ANALYZE,
partial indexes etc..

Anybody that is writing complex SQL with user defined operators knows
enough to re-write their queries correctly, so there will be almost no
negative effect from making the NOT IN optimisation a special case. And
if there is an effect, the people effected can fix the problem.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support

Re: IN vs EXISTS equivalence

From

Tom Lane

Date:

17 August 2008, 18:29:39

If you're still interested in testing CVS HEAD's handling of EXISTS,
I've about finished what I wanted to do with it.
        regards, tom lane

Re: IN vs EXISTS equivalence

From

"Pavel Stehule"

Date:

18 August 2008, 06:02:43

Hello

I did some fast test on pagila database.

8.4
postgres=# explain analyze select * from film f where exists  (select
film_id from film_actor where f.film_id = film_id);                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=117.01..195.42 rows=966 width=390) (actual
 
time=36.011..43.314 rows=997 loops=1)  Hash Cond: (f.film_id = film_actor.film_id)  ->  Seq Scan on film f
(cost=0.00..65.00rows=1000 width=390)
 
(actual time=0.027..1.971 rows=1000 loops=1)  ->  Hash  (cost=104.94..104.94 rows=966 width=2) (actual
time=35.886..35.886 rows=997 loops=1)        ->  HashAggregate  (cost=95.28..104.94 rows=966 width=2)
(actual time=32.650..34.139 rows=997 loops=1)              ->  Seq Scan on film_actor  (cost=0.00..81.62 rows=5462
width=2) (actual time=0.081..14.232 rows=5462 loops=1)Total runtime: 45.373 ms
(7 rows)

8.3
postgres=# explain select * from film f where exists  (select film_id
from film_actor where f.film_id = film_id);                                       QUERY PLAN
------------------------------------------------------------------------------------------Seq Scan on film f
(cost=0.00..4789.34rows=500 width=390)  Filter: (subplan)  SubPlan    ->  Index Scan using idx_fk_film_id on
film_actor
(cost=0.00..28.35 rows=6 width=2)          Index Cond: ($0 = film_id)
(5 rows)

postgres=# explain analyze select * from film f where not exists
(select film_id from film_actor where f.film_id = film_id);                                                      QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------Hash
AntiJoin  (cost=149.90..240.24 rows=34 width=390) (actual
 
time=25.473..28.169 rows=3 loops=1)  Hash Cond: (f.film_id = film_actor.film_id)  ->  Seq Scan on film f
(cost=0.00..65.00rows=1000 width=390)
 
(actual time=0.027..1.898 rows=1000 loops=1)  ->  Hash  (cost=81.62..81.62 rows=5462 width=2) (actual
time=24.398..24.398 rows=5462 loops=1)        ->  Seq Scan on film_actor  (cost=0.00..81.62 rows=5462
width=2) (actual time=0.035..12.400 rows=5462 loops=1)Total runtime: 28.866 ms

postgres=# explain analyze select * from film f where not exists
(select film_id from film_actor where f.film_id = film_id);
QUERY
PLAN

---------------------------------------------------------------------------------------------------------------------------------------Seq
Scanon film f  (cost=0.00..4789.34 rows=500 width=390) (actual
 
time=5.874..22.654 rows=3 loops=1)  Filter: (NOT (subplan))  SubPlan    ->  Index Scan using idx_fk_film_id on
film_actor
(cost=0.00..28.35 rows=6 width=2) (actual time=0.016..0.016 rows=1
loops=1000)          Index Cond: ($0 = film_id)Total runtime: 22.835 ms
(6 rows)

postgres=# explain analyze select * from film f where film_id in
(select film_id from film_actor);                                                         QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------Hash
Join (cost=117.01..195.42 rows=966 width=390) (actual
 
time=43.151..53.688 rows=997 loops=1)  Hash Cond: (f.film_id = film_actor.film_id)  ->  Seq Scan on film f
(cost=0.00..65.00rows=1000 width=390)
 
(actual time=0.021..6.765 rows=1000 loops=1)  ->  Hash  (cost=104.94..104.94 rows=966 width=2) (actual
time=43.091..43.091 rows=997 loops=1)        ->  HashAggregate  (cost=95.28..104.94 rows=966 width=2)
(actual time=34.754..36.275 rows=997 loops=1)              ->  Seq Scan on film_actor  (cost=0.00..81.62 rows=5462
width=2) (actual time=0.024..15.746 rows=5462 loops=1)Total runtime: 55.291 ms

postgres=# explain analyze select * from film f where film_id in
(select film_id from film_actor);                                                            QUERY
PLAN

------------------------------------------------------------------------------------------------------------------------------------Nested
LoopIN Join  (cost=0.00..175.25 rows=986 width=390) (actual
 
time=0.090..14.272 rows=997 loops=1)  ->  Seq Scan on film f  (cost=0.00..65.00 rows=1000 width=390)
(actual time=0.014..1.877 rows=1000 loops=1)  ->  Index Scan using idx_fk_film_id on film_actor  (cost=0.00..0.54
rows=6 width=2) (actual time=0.007..0.007 rows=1 loops=1000)        Index Cond: (film_actor.film_id = f.film_id)Total
runtime:15.902 ms
 
(5 rows)

8.4
postgres=# explain analyze select * from film f where film_id not in
(select film_id from film_actor);                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------Seq
Scanon film f  (cost=95.28..162.78 rows=500 width=390) (actual
 
time=24.324..26.015 rows=3 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on film_actor
(cost=0.00..81.62rows=5462 width=2)
 
(actual time=0.026..12.074 rows=5462 loops=1)Total runtime: 26.201 ms
(5 rows)

8.3
postgres=# explain analyze select * from film f where film_id not in
(select film_id from film_actor);                                                    QUERY PLAN
--------------------------------------------------------------------------------------------------------------------Seq
Scanon film f  (cost=95.28..162.78 rows=500 width=390) (actual
 
time=29.713..30.456 rows=3 loops=1)  Filter: (NOT (hashed subplan))  SubPlan    ->  Seq Scan on film_actor
(cost=0.00..81.62rows=5462 width=2)
 
(actual time=0.051..13.649 rows=5462 loops=1)Total runtime: 30.644 ms
(5 rows)


8.4 is 10% faster than 8.3 on small table < 1000 rows. 8.4 has much
better prediction.

Regards
Pavel Stehule

2008/8/17 Tom Lane <tgl@sss.pgh.pa.us>:
> If you're still interested in testing CVS HEAD's handling of EXISTS,
> I've about finished what I wanted to do with it.
>
>                        regards, tom lane
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

18 August 2008, 16:48:26

>>> On Sun, Aug 17, 2008 at  4:29 PM, in message
<27116.1219008559@sss.pgh.pa.us>,
Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> If you're still interested in testing CVS HEAD's handling of EXISTS,
> I've about finished what I wanted to do with it.
Thanks.  I'm very interested; unfortunately I can't get to it until at
least Friday.
-Kevin

Re: IN vs EXISTS equivalence

From

daveg

Date:

03 September 2008, 03:17:43

On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote:
> 
> On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:
> 
> > NOT IN is a lot trickier,
> > condition: you must also assume that the comparison operator involved
> > never yields NULL for non-null inputs.  That might be okay for btree
> > comparison functions but it's not a very comfy assumption in general;
> > we certainly haven't got any explicit knowledge that any functions are
> > guaranteed to act that way.  So this case might be worth doing later
... 
> Just found this comment, after reading what you said on other thread
> about NOT IN.
> 
> NOT IN is a serious performance issue for most people. We simply can't
> say to people "you were told not to".
> 
> If we can fix it easily for the majority of cases, we should. We can't
> let the "it won't work in certain cases" reason prevent various

A suggestion: what about adding an attribute to functions to declare that
they never return null? 
  declare foo(int, int) returns int immutable not null as ...


-dg


-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.

Re: IN vs EXISTS equivalence

From

"Asko Oja"

Date:

03 September 2008, 05:28:07

On Wed, Sep 3, 2008 at 9:17 AM, daveg <daveg@sonic.net> wrote:

On Thu, Aug 14, 2008 at 06:50:09PM +0100, Simon Riggs wrote:
>
> On Fri, 2008-08-08 at 16:23 -0400, Tom Lane wrote:
>
> > NOT IN is a lot trickier,
> > condition: you must also assume that the comparison operator involved
> > never yields NULL for non-null inputs. That might be okay for btree
> > comparison functions but it's not a very comfy assumption in general;
> > we certainly haven't got any explicit knowledge that any functions are
> > guaranteed to act that way. So this case might be worth doing later
...
> Just found this comment, after reading what you said on other thread
> about NOT IN.
>
> NOT IN is a serious performance issue for most people. We simply can't
> say to people "you were told not to".
>
> If we can fix it easily for the majority of cases, we should. We can't
> let the "it won't work in certain cases" reason prevent various

A suggestion: what about adding an attribute to functions to declare that
they never return null?

And if function still returns null then error will be raised?
Then you will end up adding NOT NULL also to IN and OUT parameters.
IIRC it was possible in Oracle to declare local variables NOT NULL.

declare foo(int, int) returns int immutable not null as ...

-dg

--
David Gould daveg@sonic.net 510 536 1443 510 282 0869
If simplicity worked, the world would be overrun with insects.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: IN vs EXISTS equivalence

From

"Kevin Grittner"

Date:

03 September 2008, 18:27:46

>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> If you're still interested in testing CVS HEAD's handling of EXISTS,
> I've about finished what I wanted to do with it.
It's been hectic here, but I've managed to let some stuff run in the
background using an old test case from here:
http://archives.postgresql.org/pgsql-hackers/2007-03/msg01408.php
explain analyze
SELECT "A"."adjustmentNo", "A"."tranNo", "A"."countyNo", "H"."date",
"H"."userId", "H"."time" FROM "Adjustment" "A" JOIN "TranHeader" "H" ON ("H"."tranId" = "A"."adjustmentNo" AND
"H"."countyNo" = "A"."countyNo" AND "H"."tranNo" = "A"."tranNo") WHERE "H"."tranType" = 'A'   AND "A"."date" > DATE
'2006-01-01'  AND "H"."countyNo" = 66   AND "A"."countyNo" = 66   AND EXISTS       (         SELECT 1 FROM "TranDetail"
"D"          WHERE "D"."tranNo" = "H"."tranNo"             AND "D"."countyNo" = "H"."countyNo"             AND
"D"."caseNo"LIKE '2006TR%'       )
 
;
On development machine using 8.3.3:Nested Loop  (cost=0.00..399190.49 rows=1 width=37) (actual
time=7184.068..3249391.592 rows=12372 loops=1)  Join Filter: (("A"."adjustmentNo")::text = ("H"."tranId")::text)  ->
SeqScan on "Adjustment" "A"  (cost=0.00..5218.87 rows=247869
 
width=17) (actual time=9.804..1695.691 rows=248674 loops=1)        Filter: (((date)::date > '2006-01-01'::date) AND
(("countyNo")::smallint = 66))  ->  Index Scan using "TranHeader_pkey" on "TranHeader" "H" 
(cost=0.00..1.57 rows=1 width=37) (actual time=13.056..13.056 rows=0
loops=248674)        Index Cond: ((("H"."tranNo")::integer =
("A"."tranNo")::integer) AND (("H"."countyNo")::smallint = 66))        Filter: ((("H"."tranType")::text = 'A'::text)
AND(subplan))        SubPlan          ->  Index Scan using "TranDetail_TranDetCaseNo" on
 
"TranDetail" "D"  (cost=0.00..1.29 rows=1 width=0) (actual
time=13.017..13.017 rows=0 loops=248674)                Index Cond: ((("caseNo")::text >= '2006TR'::text) AND
(("caseNo")::text < '2006TS'::text) AND (("tranNo")::integer =
($0)::integer) AND (("countyNo")::smallint = ($1)::smallint))                Filter: (("caseNo")::text ~~
'2006TR%'::text)Totalruntime: 3249404.662 ms
 
On the same machine, using the snapshot from this morning:Nested Loop  (cost=1963.24..38483.54 rows=1 width=37)
(actual
time=372.964..986.994 rows=12372 loops=1)  Join Filter: (("H"."tranNo")::integer = ("A"."tranNo")::integer)  ->  Merge
SemiJoin  (cost=1963.24..31012.28 rows=21317 width=37)
 
(actual time=372.926..839.298 rows=12372 loops=1)        Merge Cond: (("H"."tranNo")::integer =
("D"."tranNo")::integer)        Join Filter: (("D"."countyNo")::smallint =
("H"."countyNo")::smallint)        ->  Index Scan using "TranHeader_pkey" on "TranHeader" "H" 
(cost=0.00..27848.57 rows=322517 width=37) (actual time=3.722..526.124
rows=311963 loops=1
)              Index Cond: (("countyNo")::smallint = 66)              Filter: (("tranType")::text = 'A'::text)
-> Sort  (cost=1963.17..2027.08 rows=25565 width=6) (actual
 
time=171.512..191.688 rows=76597 loops=1)              Sort Key: "D"."tranNo"              Sort Method:  quicksort
Memory:6663kB              ->  Index Scan using "TranDetail_TranDetCaseNo" on
 
"TranDetail" "D"  (cost=0.00..91.57 rows=25565 width=6) (actual
time=0.031..100.688 rows=7659
7 loops=1)                    Index Cond: ((("caseNo")::text >= '2006TR'::text)
AND (("caseNo")::text < '2006TS'::text) AND (("countyNo")::smallint =
66))                    Filter: (("caseNo")::text ~~ '2006TR%'::text)  ->  Index Scan using "Adjustment_pkey" on
"Adjustment""A" 
 
(cost=0.00..0.34 rows=1 width=17) (actual time=0.009..0.010 rows=1
loops=12372)        Index Cond: ((("A"."adjustmentNo")::text =
("H"."tranId")::text) AND (("A"."countyNo")::smallint = 66))        Filter: (("A".date)::date >
'2006-01-01'::date)Totalruntime: 991.097 ms
 
The chosen plan looks very reasonable, and performs very well.  Nice!
After converting the database I originally forgot to run VACUUM
ANALYZE.  Even planning "blind" and doing hint-bit rewrites it picked
a plan which ran in under 10 seconds.
I'll be running other tests as I get the chance.
-Kevin