Thread: WIP: patch to create explicit support for semi and anti joins
Here's a snapshot of my current work on improving IN/EXISTS support, pursuant to the discussion here: http://archives.postgresql.org/pgsql-hackers/2008-08/msg00347.php I'm posting it because it's a pretty large patch already, and because I wanted to get comments on the next step I'm planning. What's done: Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing JOIN_IN. Unify the InClauseInfo and OuterJoinInfo infrastructure into "SpecialJoinInfo". Convert IN, EXISTS, and NOT EXISTS clauses at top level of WHERE into semi and anti joins respectively. Recognize LEFT JOIN with a suitable IS NULL filter condition as an anti join. This all compiles and passes the regression tests. What's not done: nodeMergejoin.c doesn't yet handle JOIN_ANTI. (This is just a SMOP, but it's a lot more complicated than the nestloop or hash logic, and I figured nestloop and hash were enough for testing the planner.) In the meantime, joinpath.c contains some crude hacks to prevent a merge antijoin path from being created. Selectivity estimation for semijoins and antijoins is still mostly bogus. What I have in mind to do to fix selectivity estimation is to pass around the SpecialJoinInfo struct associated with any join operation other than a plain-vanilla INNER join. (Unifying InClauseInfo and OuterJoinInfo has ensured that there always is one.) This is happening in the core planner already, as of the attached patch. But in order to get anything useful done it has to be made available to operator join selectivity estimation functions too --- in particular, eqjoinsel and neqjoinsel, which have been limping along for years with the knowledge that they didn't know enough about their context. Having the SpecialJoinInfo will let them know not only that they are estimating for a nonstandard join, but which side of the join is doing what. For backwards compatibility, I think we should allow oprjoin functions to have either the signature (internal, oid, internal, smallint) (really PlannerInfo *, Oid, List *, JoinType) or the signature (internal, oid, internal, smallint, internal) (really PlannerInfo *, Oid, List *, JoinType, SpecialJoinInfo *). The former is the old style and will allow us to not break user-defined selectivity functions in 8.4. But we should convert all the built-in oprjoin functions to the new style. Comments? Barring objections, I'm thinking of committing what I have so far and then starting to work on the selectivity fixes as a separate patch. regards, tom lane
Attachment
On Aug 13, 2008, at 17:31, Tom Lane wrote: > What's done: > > Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing > JOIN_IN. Unify the InClauseInfo and OuterJoinInfo infrastructure into > "SpecialJoinInfo". Convert IN, EXISTS, and NOT EXISTS clauses at top > level of WHERE into semi and anti joins respectively. Recognize > LEFT JOIN with a suitable IS NULL filter condition as an anti join. > This all compiles and passes the regression tests. Wow. That sound awesome, Tom. Stupid question: Do these join types have some sort of correspondence to the SQL standard? Or would they be specific to PostgreSQL? Or is this just something that's under the hood an not actually a change to the syntax of SQL joins? > What's not done: > > nodeMergejoin.c doesn't yet handle JOIN_ANTI. (This is just a SMOP, > but it's a lot more complicated than the nestloop or hash logic, and > I figured nestloop and hash were enough for testing the planner.) I guess that means you plan to do it once there has been significant testing with nestloop and hash and when the selectivity stuff is done? Best, David (Who is in over his head, but decides to stick his toe in the water anyway.)
"David E. Wheeler" <david@kineticode.com> writes: > On Aug 13, 2008, at 17:31, Tom Lane wrote: >> Introduce JOIN_SEMI and JOIN_ANTI join types, > Wow. That sound awesome, Tom. Stupid question: Do these join types > have some sort of correspondence to the SQL standard? Semi and anti joins are pretty standard concepts in relational theory, but they have no direct mapping in the SQL join syntax. You can write them with certain well-known locutions, though:IN and EXISTS, with certain restrictions, represent semi joinNOT EXISTS, withcertain restrictions, represents anti joinLEFT JOIN with an "incompatible" higher IS NULL test represents anti join Basically what this patch is about is teaching the planner that these constructs are best understood via the relational-theory concepts. We'd been doing it in a pretty ad-hoc way before, and run into a lot of problems that we've had to kluge around. I think that this approach provides a structure that will actually work well. > Or is this just something that's under the > hood an not actually a change to the syntax of SQL joins? Right, there's no "user visible" feature or syntax change here. We're just trying to provide better performance for certain common SQL idioms. >> What's not done: >> >> nodeMergejoin.c doesn't yet handle JOIN_ANTI. (This is just a SMOP, > I guess that means you plan to do it once there has been significant > testing with nestloop and hash and when the selectivity stuff is done? Actually, I got it done an hour or so ago --- it turned out to be easier than I thought. It just didn't seem like part of the critical path for the patch, so I'd been willing to let it go till later. regards, tom lane
On Aug 13, 2008, at 20:12, Tom Lane wrote: >> Wow. That sound awesome, Tom. Stupid question: Do these join types >> have some sort of correspondence to the SQL standard? > > Semi and anti joins are pretty standard concepts in relational theory, > but they have no direct mapping in the SQL join syntax. You can write > them with certain well-known locutions, though: > IN and EXISTS, with certain restrictions, represent semi join > NOT EXISTS, with certain restrictions, represents anti join > LEFT JOIN with an "incompatible" higher IS NULL test represents > anti join > > Basically what this patch is about is teaching the planner that these > constructs are best understood via the relational-theory concepts. > We'd been doing it in a pretty ad-hoc way before, and run into a lot > of problems that we've had to kluge around. I think that this > approach > provides a structure that will actually work well. Great. Thanks for the explanation, Tom, as always. >> Or is this just something that's under the >> hood an not actually a change to the syntax of SQL joins? > > Right, there's no "user visible" feature or syntax change here. We're > just trying to provide better performance for certain common SQL > idioms. Good, it makes a lot of sense. > >>> What's not done: >>> >>> nodeMergejoin.c doesn't yet handle JOIN_ANTI. (This is just a SMOP, > >> I guess that means you plan to do it once there has been significant >> testing with nestloop and hash and when the selectivity stuff is >> done? > > Actually, I got it done an hour or so ago --- it turned out to be > easier > than I thought. It just didn't seem like part of the critical path > for > the patch, so I'd been willing to let it go till later. I love it when things work that way. :-) Best, David
On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote: > We're just trying to provide better performance for certain common SQL > idioms. Sounds good, but can you explain how this will help? Not questioning it, just after more information about it. I'm half way through join removal patch, so this work might extend the join elimination to semi/anti joins also (hopefully), or it might (hopefully not) prevent the join elimination altogether. I'll let you know how I get on. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndquadrant.com> writes: > On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote: >> We're just trying to provide better performance for certain common SQL >> idioms. > Sounds good, but can you explain how this will help? 1. Allowing optimization of EXISTS/NOT EXISTS as general-purpose joins. Up to now, the best plan you could hope for was the equivalent of a nestloop with inner indexscan, with the EXISTS subquery on the inside. While that's not necessarily bad, it could be pretty bad for a large outer table. Now we'll have the option to consider merge and hash joins too. 2. Allowing the planner to get better estimates of the result size of these special join types. In the past we've had to kluge that, and the results weren't always good. Part of what I'm doing (the unfinished part ;-)) is to make more information about join context available to selectivity estimation functions, which is something we've known we needed for awhile. I can't yet *prove* that I can get better estimates with the added info, but if not, that just means I need to rethink what to pass down exactly. regards, tom lane
On Thu, 2008-08-14 at 10:04 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > On Wed, 2008-08-13 at 23:12 -0400, Tom Lane wrote: > >> We're just trying to provide better performance for certain common SQL > >> idioms. > > > Sounds good, but can you explain how this will help? > > 1. Allowing optimization of EXISTS/NOT EXISTS as general-purpose joins. > Up to now, the best plan you could hope for was the equivalent of a > nestloop with inner indexscan, with the EXISTS subquery on the inside. > While that's not necessarily bad, it could be pretty bad for a large > outer table. Now we'll have the option to consider merge and hash > joins too. > > 2. Allowing the planner to get better estimates of the result size of > these special join types. In the past we've had to kluge that, and > the results weren't always good. Part of what I'm doing (the unfinished > part ;-)) is to make more information about join context available to > selectivity estimation functions, which is something we've known we > needed for awhile. I can't yet *prove* that I can get better estimates > with the added info, but if not, that just means I need to rethink what > to pass down exactly. OK, that sounds good. Are you also working on transforming NOT IN into different form? Or is that the same thing as (1)? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs <simon@2ndquadrant.com> writes: > OK, that sounds good. Are you also working on transforming NOT IN into > different form? Or is that the same thing as (1)? I'm not currently thinking about NOT IN. It could be transformed to an antijoin if we could prove that no nulls are involved, but that seems less than trivial as I noted earlier. regards, tom lane
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > I can't yet *prove* that I can get better estimates > with the added info, but if not, that just means I need to rethink what > to pass down exactly. I'll see if I can do some testing here to confirm plan improvements and check estimate accuracy. This is only on the trunk, not any stable branches? I assume that effort should wait until you have a candidate for the estimate improvements? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I can't yet *prove* that I can get better estimates >> with the added info, but if not, that just means I need to rethink what >> to pass down exactly. > I'll see if I can do some testing here to confirm plan improvements > and check estimate accuracy. This is only on the trunk, not any > stable branches? I assume that effort should wait until you have a > candidate for the estimate improvements? Yeah. If you feel like it you can test what I just committed, but it's definitely not where I expect to end up in another few days. regards, tom lane
>>> Tom Lane <tgl@sss.pgh.pa.us> wrote: > Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing > JOIN_IN. Unify the InClauseInfo and OuterJoinInfo infrastructure into > "SpecialJoinInfo". Convert IN, EXISTS, and NOT EXISTS clauses at top > level of WHERE into semi and anti joins respectively. It just struck me that this may cause additional joins to count against the join_collapse_limit. If so, that could cause some borderline queries to optimize poorly if the limit isn't raised. Is that a reasonable concern? Possibly something to document in the release notes? -Kevin
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes: > Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Introduce JOIN_SEMI and JOIN_ANTI join types, the former replacing >> JOIN_IN. > It just struck me that this may cause additional joins to count > against the join_collapse_limit. If so, that could cause some > borderline queries to optimize poorly if the limit isn't raised. Is > that a reasonable concern? Possibly something to document in the > release notes? I don't think we should change anything there. The collapse limit settings are mainly driven by not wanting to get into an exponential growth of planning time, and the way in which join situations are created doesn't really affect the appropriate value one way or the other. In a case where this did happen, you'd just have exchanged one suboptimal planning situation for another, so I'm unconvinced that it'd make matters any worse compared to prior releases. (That does seem like a point worth testing, though, if you want to.) regards, tom lane