Re: IN vs EXISTS equivalence - Mailing list pgsql-hackers

From Kevin Grittner
Subject Re: IN vs EXISTS equivalence
Date
Msg-id 48973DE5.EE98.0025.0@wicourts.gov
Whole thread Raw
In response to Re: IN vs EXISTS equivalence  (Simon Riggs <simon@2ndquadrant.com>)
Responses Re: IN vs EXISTS equivalence  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
>>> On Mon, Oct 22, 2007 at  1:30 PM, Simon Riggs wrote:
> On Mon, 2007-10-22 at 09:31 -0500, Kevin Grittner wrote:
>> I've requested this before without response, but I'm asking again
>> because it just caused me pain again: could we get a TODO added to
>> have the planner recognize equivalent IN and EXISTS constructs and
>> have them compete on cost estimates?  I know it's not a trivial
>> improvement, but if it's on the list maybe someone will pick it up,
>> and I see it as the single biggest weakness in PostgreSQL
>> performance.
>
> I'll pick it up as a default unless someone requests they have it
from
> me.

Since Simon's focus has shifted to other issues, I'm hoping this can
go onto the TODO list.

I'm adding some NOT EXISTS examples to the thread for completeness of
what someone might want to address while working on it.  For two
queries which can easily be shown (to a human viewer, anyway) to
return identical results, I see performance differences of over five
orders of magnitude.  (Six if you compare to the LEFT JOIN ... WHERE
not_null_right_column IS NULL trick.)

Below are the cost estimates of the three techniques for a
medium-sized table joining to a large table, and for a large table
joining to a small table.  The IN behavior has the worst worst-case
behavior, at least in queries that I've run, although many people
report that it is usually faster.  The technique of doing an existence
test with a LEFT JOIN and then checking whether a NOT NULL column from
the right-hand table is null is often faster than either technique,
and seldom much worse than the best technique for any given test.

Queries and plans attached.  Summary of costs below, in millions of
cost units. (Fractions of a million discarded.)

NOT IN (independent_subquery)
19745843, 5

WHERE NOT EXISTS
74, 318

LEFT JOIN WHERE not_null_right_column IS NULL
10, 17

These cost estimates tend to come out in pretty consistent ratio to
the actual run times.

-Kevin

Attachment

pgsql-hackers by date:

Previous
From: "Jonah H. Harris"
Date:
Subject: Re: Automatic Client Failover
Next
From: Tom Lane
Date:
Subject: Re: Automatic Client Failover