Thread: Proposal for supporting outer joins in 7.1
I would like to see if we can't get some amount of OUTER JOIN functionality into the 7.1 release. It seems that the executor implementation should be pretty simple, at least for the left-join case (which is what people appear to need the most). What we are mostly missing is a suitable parsetree representation plus appropriate handling in the planner. Right now, the rangetable associated with a query serves two purposes. First, it provides a list of all relations used in the query for Var nodes to refer to (a Var's "varno" field is just an index into the rangetable list). Second, it drives the planner's construction of the join tree for the query (all entries that are marked inJoinSet will be added to the join tree in some order). I think what we want to do is separate these two functions into two data structures. The rangetable list is just fine for looking up Var references, but it's not sufficient information for representing multiple types of joins. A ready-to-plan Query node should have an additional subsidiary data structure that describes the required join tree of relations. The additional data structure could really be pretty close to the raw parse structure that the grammar emits for FROM clauses. I envision it as a tree whose leaf nodes are references to rangetable entries and whose upper-level nodes correspond to JOIN clauses (each JOIN node has pointers to its child nodes plus information about the specified join type and any supplied constraint conditions). If the FROM clause has more than one item, the top level of the join tree is a CROSS JOIN node with all the FROM-clause items as children. (Note that at least for this case, a join node needs to be able to have more than two children.) If anything, this structure should be easier for the parser/analyzer to emit than what it's doing now. Notice in particular that ON/USING qual clauses associated with joins are to be left in the join tree, *not* merged into the main WHERE clause. (The planner would just have to separate them out again and discover which join they should go with, so what's the point of merging them into WHERE?) I envision the planner as operating by walking this join tree and considering plans that perform each specified join. At nodes with more than two children (such as the top-level CROSS JOIN from a many-element FROM clause) the planner will consider all possible orders for joining the children together, the same as it does now. Note that this implementation implies that the user can guide/constrain the planner by writing an appropriate FROM clause. For example: SELECT ... FROM a,b,cPlanner will consider all three possible join orders:(a cross join b) cross join c(a cross join c) crossjoin b(b cross join c) cross join a SELECT ... FROM (a cross join b) cross join cPlanner will only consider joining a with b and then adding c. This might be construed as either a feature or a bug depending on your views about how much responsibility to put on the planner vs. the user. In any case it seems like a reasonable first-cut implementation; we can think about improvements later. Thoughts, objections, better ideas? One thing I'm not very clear on is how much has been done in the parser about the ISO rules for visibility and aliasing of table columns in explicit JOINs. Thomas, can you sketch out where we stand in that area? regards, tom lane
> I would like to see if we can't get some amount of OUTER JOIN functionality > into the 7.1 release. Great! > It seems that the executor implementation should be > pretty simple, at least for the left-join case (which is what people appear > to need the most). What we are mostly missing is a suitable parsetree > representation plus appropriate handling in the planner. I've already got the merge-join code walking left, right, and full joins (look for #ifdef code in the appropriate routines). Just didn't implment the "null column filling", and didn't figure out how to push the left/right/full flags into the executor. I haven't looked at the other join techniques to see how easy those would be. > Right now, the rangetable associated with a query serves two purposes. > First, it provides a list of all relations used in the query for Var nodes > to refer to (a Var's "varno" field is just an index into the rangetable > list). Second, it drives the planner's construction of the join tree for > the query (all entries that are marked inJoinSet will be added to the join > tree in some order). I think what we want to do is separate these two > functions into two data structures. The rangetable list is just fine for > looking up Var references, but it's not sufficient information for > representing multiple types of joins. A ready-to-plan Query node should > have an additional subsidiary data structure that describes the required > join tree of relations. > > The additional data structure could really be pretty close to the raw parse > structure that the grammar emits for FROM clauses. I envision it as a tree > whose leaf nodes are references to rangetable entries and whose upper-level > nodes correspond to JOIN clauses (each JOIN node has pointers to its child > nodes plus information about the specified join type and any supplied > constraint conditions). If the FROM clause has more than one item, the > top level of the join tree is a CROSS JOIN node with all the FROM-clause > items as children. (Note that at least for this case, a join node needs > to be able to have more than two children.) > > If anything, this structure should be easier for the parser/analyzer to emit > than what it's doing now. Notice in particular that ON/USING qual clauses > associated with joins are to be left in the join tree, *not* merged into the > main WHERE clause. (The planner would just have to separate them out again > and discover which join they should go with, so what's the point of merging > them into WHERE?) > > I envision the planner as operating by walking this join tree and > considering plans that perform each specified join. At nodes with more than > two children (such as the top-level CROSS JOIN from a many-element FROM > clause) the planner will consider all possible orders for joining the > children together, the same as it does now. Note that this implementation > implies that the user can guide/constrain the planner by writing an > appropriate FROM clause. For example: > > SELECT ... FROM a,b,c > Planner will consider all three possible join orders: > (a cross join b) cross join c > (a cross join c) cross join b > (b cross join c) cross join a > > SELECT ... FROM (a cross join b) cross join c > Planner will only consider joining a with b and then adding c. > > This might be construed as either a feature or a bug depending on your > views about how much responsibility to put on the planner vs. the user. > In any case it seems like a reasonable first-cut implementation; we can > think about improvements later. > > Thoughts, objections, better ideas? > > One thing I'm not very clear on is how much has been done in the parser > about the ISO rules for visibility and aliasing of table columns in > explicit JOINs. Thomas, can you sketch out where we stand in that area? The current code does not do the scoping rules quite right, but it isn't very far wrong either. I think that the scoping and aliasing can be contained within the parser code, swallowing the intermediate scopes by resolving back to the original names. However, if we continue to do it this way (losing intermediate alias names) then we will be making it harder for the planner/optimizer to jiggle up the plans, since the query tree will have already assumed some references. For example: select * from (a join b on (i)) join c using (i = j); is equivalent to select * from (a join b on (i)) as t1 (i, ...) join c using (t1.i = j); but the parser may already recast it as select a.i, a..., b.x, b..., c.j, c... from a, b, c where (a.i = b.i) and (a.i = c.j); so the optimizer would have to figure out for itself that a.i is interchangable with b.i. (Also, you'll remember that it currently barfs on three-way joins; I haven't had a chance to look at that yet). Getting outer joins for 7.1 would be great. Count me in to help... - Thomas
> I would like to see if we can't get some amount of OUTER JOIN functionality > into the 7.1 release. It seems that the executor implementation should be Just a word of support. Outer Joins are one of the few things that PostgreSQL misses to really make it big time. If you can help 90% of all cases until the "real" solution comes in 7.2, there's no reason to hesitate. -- Kaare Rasmussen --Linux, spil,-- Tlf: 3816 2582 Kaki Data tshirts, merchandize Fax: 3816 2582 Howitzvej 75 �ben 14.00-18.00 Email: kar@webline.dk 2000 Frederiksberg L�rdag 11.00-17.00 Web: www.suse.dk
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > I've already got the merge-join code walking left, right, and full joins > (look for #ifdef code in the appropriate routines). OK, will look. > Just didn't implment > the "null column filling", and didn't figure out how to push the > left/right/full flags into the executor. I can deal with the null tuple insertion. As for the flags, we just need to add those to the Plan nodes for joins ... but first the info has to get to the planner, thus we need to fix the parser output. > I haven't looked at the other > join techniques to see how easy those would be. Offhand it seems that left join is trivial in all three join types. Right join can be handled by swapping the two tables (make the inner outer), but full join only seems possible with a merge join. I am also concerned about whether even a merge join can do right/full joins correctly, if the merge qual (ordering expression) does not include *all* the relevant WHERE conditions. > The current code does not do the scoping rules quite right, but it isn't > very far wrong either. I think that the scoping and aliasing can be > contained within the parser code, swallowing the intermediate scopes by > resolving back to the original names. However, if we continue to do it > this way (losing intermediate alias names) then we will be making it > harder for the planner/optimizer to jiggle up the plans, since the query > tree will have already assumed some references. How so? The planner only sees Var nodes, which are unambiguous by definition: rangetable index and attribute number don't have any dependency on aliases. I think the only real problem here is whether analyze.c resolves column names correctly, ie searching the right set of aliases, in each part of the query text. (If I understand the SQL spec correctly, the set of aliases visible in a JOIN's ON/USING clauses is different from what's visible elsewhere in the query --- do we do that right, or not?) > but the parser may already recast it as > select a.i, a..., b.x, b..., c.j, c... from a, b, c > where (a.i = b.i) and (a.i = c.j); That is what the parser currently does, but I think it's actually easier all round if we leave the ON/USING conditions attached to the relevant JoinExpr node, instead of pushing them over to the main WHERE clause. > (Also, you'll remember that it currently barfs on three-way joins; I > haven't had a chance to look at that yet). IIRC the barf is directly related to the lack of suitable parsetree representation for this stuff, so I suspect that we need not worry too much about fixing that problem. It should go away by itself once we nail down a better representation. > Getting outer joins for 7.1 would be great. Count me in to help... OK, good. I think I have enough of a handle on the issues downstream from the parser. Are you interested in dealing with the column lookup/ aliasing questions? regards, tom lane
> OK, good. I think I have enough of a handle on the issues downstream > from the parser. Are you interested in dealing with the column lookup/ > aliasing questions? Sure. That's a fun part for me... - Thomas
At 16:07 25/08/00 -0400, Tom Lane wrote: > For example: > >SELECT ... FROM a,b,c > Planner will consider all three possible join orders: > (a cross join b) cross join c > (a cross join c) cross join b > (b cross join c) cross join a > >SELECT ... FROM (a cross join b) cross join c > Planner will only consider joining a with b and then adding c. > I'm not sure whether the above is the syntax you plan to use, but it looks a little too much like: SELECT ... FROM (select * from a cross join b) as z cross join c which has a quite different meaning to any kind of outer join, and the parenthesis, in this case, should not affect the planner choices. I don;t think that support for this kind of query is implemented, but it could confuse things a little. As an aside, while you are in there, I don't suppose you would be able to implement that above syntax as well? Then, maybe, have a go at the common cold, too. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/
Philip Warner <pjw@rhyme.com.au> writes: >> SELECT ... FROM (a cross join b) cross join c > I'm not sure whether the above is the syntax you plan to use, but it looks > a little too much like: > SELECT ... FROM (select * from a cross join b) as z cross join c > which has a quite different meaning to any kind of outer join, Huh? AFAIK they mean exactly the same thing, modulo a few issues about visibility of columns. In any case you'll have to take your complaint to ISO, because that's what the spec says the syntaxes are. > I don;t think that support for this kind of query is implemented, Not yet, but it's certainly on the TODO list. I'm not seriously thinking about getting subselect-in-FROM done in this go-round, though. regards, tom lane
At 11:38 27/08/00 -0400, Tom Lane wrote: >Philip Warner <pjw@rhyme.com.au> writes: >>> SELECT ... FROM (a cross join b) cross join c >> I'm not sure whether the above is the syntax you plan to use, but it looks >> a little too much like: >> SELECT ... FROM (select * from a cross join b) as z cross join c >> which has a quite different meaning to any kind of outer join, > >Huh? AFAIK they mean exactly the same thing, modulo a few issues about >visibility of columns. In any case you'll have to take your complaint >to ISO, because that's what the spec says the syntaxes are. Sorry, I should have written: SELECT ... FROM (select * from a,b) as z, c which does not do an outer join, and does not imply any ordering. >> I don;t think that support for this kind of query is implemented, > >Not yet, but it's certainly on the TODO list. I'm not seriously >thinking about getting subselect-in-FROM done in this go-round, >though. Great! If by Subselect-in-from you mean something like what I wrote above, then it is a major win. ---------------------------------------------------------------- Philip Warner | __---_____ Albatross Consulting Pty. Ltd. |----/ - \ (A.B.N. 75 008 659 498) | /(@) ______---_ Tel: (+61) 0500 83 82 81 | _________ \ Fax: (+61) 0500 83 82 82 | ___________ | Http://www.rhyme.com.au | / \| | --________-- PGP key available upon request, | / and from pgp5.ai.mit.edu:11371 |/