Thread: WIP: patch to create explicit support for semi and anti joins

WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
Here's a snapshot of my current work on improving IN/EXISTS support,
pursuant to the discussion here:
http://archives.postgresql.org/pgsql-hackers/2008-08/msg00347.php
I'm posting it because it's a pretty large patch already, and because
I wanted to get comments on the next step I'm planning.

What's done:

Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing
JOIN_IN.  Unify the InClauseInfo and OuterJoinInfo infrastructure into
"SpecialJoinInfo".  Convert IN, EXISTS, and NOT EXISTS clauses at top
level of WHERE into semi and anti joins respectively.  Recognize
LEFT JOIN with a suitable IS NULL filter condition as an anti join.
This all compiles and passes the regression tests.

What's not done:

nodeMergejoin.c doesn't yet handle JOIN_ANTI.  (This is just a SMOP,
but it's a lot more complicated than the nestloop or hash logic, and
I figured nestloop and hash were enough for testing the planner.)
In the meantime, joinpath.c contains some crude hacks to prevent a merge
antijoin path from being created.

Selectivity estimation for semijoins and antijoins is still mostly bogus.


What I have in mind to do to fix selectivity estimation is to pass around
the SpecialJoinInfo struct associated with any join operation other than
a plain-vanilla INNER join.  (Unifying InClauseInfo and OuterJoinInfo has
ensured that there always is one.)  This is happening in the core planner
already, as of the attached patch.  But in order to get anything useful
done it has to be made available to operator join selectivity estimation
functions too --- in particular, eqjoinsel and neqjoinsel, which have
been limping along for years with the knowledge that they didn't know
enough about their context.  Having the SpecialJoinInfo will let them
know not only that they are estimating for a nonstandard join, but
which side of the join is doing what.

For backwards compatibility, I think we should allow oprjoin functions
to have either the signature (internal, oid, internal, smallint)
(really PlannerInfo *, Oid, List *, JoinType) or the signature
(internal, oid, internal, smallint, internal)
(really PlannerInfo *, Oid, List *, JoinType, SpecialJoinInfo *).
The former is the old style and will allow us to not break user-defined
selectivity functions in 8.4.  But we should convert all the built-in
oprjoin functions to the new style.

Comments?

Barring objections, I'm thinking of committing what I have so far
and then starting to work on the selectivity fixes as a separate patch.

            regards, tom lane


Attachment

Re: WIP: patch to create explicit support for semi and anti joins

From
"David E. Wheeler"
Date:
On Aug 13, 2008, at 17:31, Tom Lane wrote:

> What's done:
>
> Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing
> JOIN_IN.  Unify the InClauseInfo and OuterJoinInfo infrastructure into
> "SpecialJoinInfo".  Convert IN, EXISTS, and NOT EXISTS clauses at top
> level of WHERE into semi and anti joins respectively.  Recognize
> LEFT JOIN with a suitable IS NULL filter condition as an anti join.
> This all compiles and passes the regression tests.

Wow. That sound awesome, Tom. Stupid question: Do these join types  
have some sort of correspondence to the SQL standard? Or would they be  
specific to PostgreSQL? Or is this just something that's under the  
hood an not actually a change to the syntax of SQL joins?

> What's not done:
>
> nodeMergejoin.c doesn't yet handle JOIN_ANTI.  (This is just a SMOP,
> but it's a lot more complicated than the nestloop or hash logic, and
> I figured nestloop and hash were enough for testing the planner.)

I guess that means you plan to do it once there has been significant  
testing with nestloop and hash and when the selectivity stuff is done?

Best,

David
(Who is in over his head, but decides to stick his toe in the water  
anyway.)


Re: WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
"David E. Wheeler" <david@kineticode.com> writes:
> On Aug 13, 2008, at 17:31, Tom Lane wrote:
>> Introduce JOIN_SEMI and JOIN_ANTI join types,

> Wow. That sound awesome, Tom. Stupid question: Do these join types  
> have some sort of correspondence to the SQL standard?

Semi and anti joins are pretty standard concepts in relational theory,
but they have no direct mapping in the SQL join syntax.  You can write
them with certain well-known locutions, though:IN and EXISTS, with certain restrictions, represent semi joinNOT EXISTS,
withcertain restrictions, represents anti joinLEFT JOIN with an "incompatible" higher IS NULL test represents anti
join

Basically what this patch is about is teaching the planner that these
constructs are best understood via the relational-theory concepts.
We'd been doing it in a pretty ad-hoc way before, and run into a lot
of problems that we've had to kluge around.  I think that this approach
provides a structure that will actually work well.

> Or is this just something that's under the  
> hood an not actually a change to the syntax of SQL joins?

Right, there's no "user visible" feature or syntax change here.  We're
just trying to provide better performance for certain common SQL idioms.

>> What's not done:
>> 
>> nodeMergejoin.c doesn't yet handle JOIN_ANTI.  (This is just a SMOP,

> I guess that means you plan to do it once there has been significant  
> testing with nestloop and hash and when the selectivity stuff is done?

Actually, I got it done an hour or so ago --- it turned out to be easier
than I thought.  It just didn't seem like part of the critical path for
the patch, so I'd been willing to let it go till later.
        regards, tom lane


Re: WIP: patch to create explicit support for semi and anti joins

From
"David E. Wheeler"
Date:
On Aug 13, 2008, at 20:12, Tom Lane wrote:

>> Wow. That sound awesome, Tom. Stupid question: Do these join types
>> have some sort of correspondence to the SQL standard?
>
> Semi and anti joins are pretty standard concepts in relational theory,
> but they have no direct mapping in the SQL join syntax.  You can write
> them with certain well-known locutions, though:
>     IN and EXISTS, with certain restrictions, represent semi join
>     NOT EXISTS, with certain restrictions, represents anti join
>     LEFT JOIN with an "incompatible" higher IS NULL test represents  
> anti join
>
> Basically what this patch is about is teaching the planner that these
> constructs are best understood via the relational-theory concepts.
> We'd been doing it in a pretty ad-hoc way before, and run into a lot
> of problems that we've had to kluge around.  I think that this  
> approach
> provides a structure that will actually work well.

Great. Thanks for the explanation, Tom, as always.

>> Or is this just something that's under the
>> hood an not actually a change to the syntax of SQL joins?
>
> Right, there's no "user visible" feature or syntax change here.  We're
> just trying to provide better performance for certain common SQL  
> idioms.

Good, it makes a lot of sense.

>
>>> What's not done:
>>>
>>> nodeMergejoin.c doesn't yet handle JOIN_ANTI.  (This is just a SMOP,
>
>> I guess that means you plan to do it once there has been significant
>> testing with nestloop and hash and when the selectivity stuff is  
>> done?
>
> Actually, I got it done an hour or so ago --- it turned out to be  
> easier
> than I thought.  It just didn't seem like part of the critical path  
> for
> the patch, so I'd been willing to let it go till later.

I love it when things work that way. :-)

Best,

David



Re: WIP: patch to create explicit support for semi and anti joins

From
Simon Riggs
Date:
On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote:

> We're just trying to provide better performance for certain common SQL
> idioms.

Sounds good, but can you explain how this will help? Not questioning it,
just after more information about it.

I'm half way through join removal patch, so this work might extend the
join elimination to semi/anti joins also (hopefully), or it might
(hopefully not) prevent the join elimination altogether. I'll let you
know how I get on.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote:
>> We're just trying to provide better performance for certain common SQL
>> idioms.

> Sounds good, but can you explain how this will help?

1. Allowing optimization of EXISTS/NOT EXISTS as general-purpose joins.
Up to now, the best plan you could hope for was the equivalent of a
nestloop with inner indexscan, with the EXISTS subquery on the inside.
While that's not necessarily bad, it could be pretty bad for a large
outer table.  Now we'll have the option to consider merge and hash
joins too.

2. Allowing the planner to get better estimates of the result size of
these special join types.  In the past we've had to kluge that, and
the results weren't always good.  Part of what I'm doing (the unfinished
part ;-)) is to make more information about join context available to
selectivity estimation functions, which is something we've known we
needed for awhile.  I can't yet *prove* that I can get better estimates
with the added info, but if not, that just means I need to rethink what
to pass down exactly.
        regards, tom lane


Re: WIP: patch to create explicit support for semi and anti joins

From
Simon Riggs
Date:
On Thu, 2008-08-14 at 10:04 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote:
> >> We're just trying to provide better performance for certain common SQL
> >> idioms.
> 
> > Sounds good, but can you explain how this will help?
> 
> 1. Allowing optimization of EXISTS/NOT EXISTS as general-purpose joins.
> Up to now, the best plan you could hope for was the equivalent of a
> nestloop with inner indexscan, with the EXISTS subquery on the inside.
> While that's not necessarily bad, it could be pretty bad for a large
> outer table.  Now we'll have the option to consider merge and hash
> joins too.
> 
> 2. Allowing the planner to get better estimates of the result size of
> these special join types.  In the past we've had to kluge that, and
> the results weren't always good.  Part of what I'm doing (the unfinished
> part ;-)) is to make more information about join context available to
> selectivity estimation functions, which is something we've known we
> needed for awhile.  I can't yet *prove* that I can get better estimates
> with the added info, but if not, that just means I need to rethink what
> to pass down exactly.

OK, that sounds good. Are you also working on transforming NOT IN into
different form? Or is that the same thing as (1)?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> OK, that sounds good. Are you also working on transforming NOT IN into
> different form? Or is that the same thing as (1)?

I'm not currently thinking about NOT IN.  It could be transformed to
an antijoin if we could prove that no nulls are involved, but that
seems less than trivial as I noted earlier.
        regards, tom lane


Re: WIP: patch to create explicit support for semi and anti joins

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> I can't yet *prove* that I can get better estimates
> with the added info, but if not, that just means I need to rethink
what
> to pass down exactly.
I'll see if I can do some testing here to confirm plan improvements
and check estimate accuracy.  This is only on the trunk, not any
stable branches?  I assume that effort should wait until you have a
candidate for the estimate improvements?
-Kevin


Re: WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> I can't yet *prove* that I can get better estimates
>> with the added info, but if not, that just means I need to rethink what
>> to pass down exactly.
> I'll see if I can do some testing here to confirm plan improvements
> and check estimate accuracy.  This is only on the trunk, not any
> stable branches?  I assume that effort should wait until you have a
> candidate for the estimate improvements?

Yeah.  If you feel like it you can test what I just committed, but it's
definitely not where I expect to end up in another few days.
        regards, tom lane


Re: WIP: patch to create explicit support for semi and anti joins

From
"Kevin Grittner"
Date:
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
> Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing
> JOIN_IN.  Unify the InClauseInfo and OuterJoinInfo infrastructure
into
> "SpecialJoinInfo".  Convert IN, EXISTS, and NOT EXISTS clauses at
top
> level of WHERE into semi and anti joins respectively.
It just struck me that this may cause additional joins to count
against the join_collapse_limit.  If so, that could cause some
borderline queries to optimize poorly if the limit isn't raised.  Is
that a reasonable concern?  Possibly something to document in the
release notes?
-Kevin



Re: WIP: patch to create explicit support for semi and anti joins

From
Tom Lane
Date:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> wrote: 
>> Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing
>> JOIN_IN.
> It just struck me that this may cause additional joins to count
> against the join_collapse_limit.  If so, that could cause some
> borderline queries to optimize poorly if the limit isn't raised.  Is
> that a reasonable concern?  Possibly something to document in the
> release notes?

I don't think we should change anything there.  The collapse limit
settings are mainly driven by not wanting to get into an exponential
growth of planning time, and the way in which join situations are
created doesn't really affect the appropriate value one way or the
other.

In a case where this did happen, you'd just have exchanged one
suboptimal planning situation for another, so I'm unconvinced that it'd
make matters any worse compared to prior releases.  (That does seem like
a point worth testing, though, if you want to.)
        regards, tom lane