Re: Execute query with EXCEPT, INTERSECT as anti-join, join? - Mailing list pgsql-hackers

From Colin 't Hart
Subject Re: Execute query with EXCEPT, INTERSECT as anti-join, join?
Date
Msg-id CAMon-aRv2URrkzdmuUEsRfjW4KNwf7f5DOBrfH14ioXtruGw8g@mail.gmail.com
Whole thread Raw
In response to Re: Execute query with EXCEPT, INTERSECT as anti-join, join?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Execute query with EXCEPT, INTERSECT as anti-join, join?  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On 11 November 2013 15:16, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Colin 't Hart" <colin@sharpheart.org> writes:
>> On 11 November 2013 14:34, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> No, and it probably won't ever be, since the semantics aren't the same.
>>> EXCEPT/INTERSECT imply duplicate elimination.
>
>> Can't we just use DISTINCT for that?
>
> If you have to do a DISTINCT it's not clear to me that you're going to get
> much win.
>
> (The bigger picture here is that pretty much zero optimization effort has
> been spent on EXCEPT/INTERSECT, because they're just not used that much
> compared to other places where we could put that effort.)

I'm asking because I just encountered several cases where the
anti-join was *much* faster. In each case <query_1>'s result was
relatively small compared to <query_2>'s result or the "related rows"
from <query_2> were a much smaller set than the whole result of
<query_2>. In these cases, when the executor new how the two halves of
the query were related -- and that's the crux here: by writing EXCEPT
the executor couldn't determine how the two halves of the query were
related -- the anti-join was about 1000 times faster.

I think it's similar to the NOT IN which most DBMSes solved about 20
years ago but before that everyone used to rewrite by hand as NOT
EXISTS: sometimes we want to write query as EXCEPT because it's
clearer but execute it as an anti-join with DISTINCT.

Would these be difficult to build in? While I know a lot about how
DBMS engines work I've not hacked at PG internals. I'd be more than
willing to look at it, but could use some pointers as to where to
start. In particular, does PG rewrite queries in any way? Is it
possible to simply "rewrite the query" and then pass to the optimizer
to see if it would result in a better plan?

if I can improve EXCEPT, the same could also be applied to INTERSECT
as a "DISTINCT join".

Thanks & regards,

Colin



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: pg_dump and pg_dumpall in real life
Next
From: "Colin 't Hart"
Date:
Subject: Re: TABLE not synonymous with SELECT * FROM?