Thread: Oops, I seem to have changed UNION's behavior
The equal() updates I installed yesterday (to fix the "don't know whether nodes of type 600 are equal" problem) have had an unintended side effect. Am I right in thinking that UNION (without ALL) is defined to do a DISTINCT on its result, so that duplicates are removed even if the duplicates both came from the same source table? That's what 6.4.2 does, but I do not know if it's strictly kosher according to the SQL spec. If so, the code is now busted, because with the equal() extension in place, cnfify() is able to recognize and remove duplicate select clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to just "SELECT xxx" ... and that doesn't mean the same thing. An actual example: given the data play=> select a from tt; a - 1 1 2 3 (4 rows) Under 6.4.2 I get: play=> select a from tt union select a from tt; a - 1 2 3 (3 rows) Note lack of duplicate "1". Under current sources I get: ttest=> select a from tt union select a from tt; a - 1 1 2 3 (4 rows) since the query is effectively reduced to just "select a from tt". Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes: (1) simplify equal() to say that two T_Query nodes are never equal, or (2) modify the planner so that the "select distinct" operation is inserted explicitly, and will thus happen even if the UNIONed selects are collapsed into just one. (1) is a trivial fix of course, but it worries me --- maybe someday we will need equal() to give an honest answer for Query nodes. But I don't have the expertise to apply (2), and it seems like rather a lot of work for a boundary case that isn't really interesting in practice. Comments? *Is* 6.4.2 behaving according to the SQL spec? regards, tom lane
Hello! On Sun, 7 Feb 1999, Tom Lane wrote: > The equal() updates I installed yesterday (to fix the "don't know > whether nodes of type 600 are equal" problem) have had an unintended > side effect. > > Am I right in thinking that UNION (without ALL) is defined to do a > DISTINCT on its result, so that duplicates are removed even if the > duplicates both came from the same source table? That's what 6.4.2 > does, but I do not know if it's strictly kosher according to the SQL > spec. Yes, this is standard. My books (primary, Gruber) say UNION should work this way - UNION without ALL implies DISTINCT. > If so, the code is now busted, because with the equal() extension in > place, cnfify() is able to recognize and remove duplicate select > clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to > just "SELECT xxx" ... and that doesn't mean the same thing. > > An actual example: given the data > > play=> select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > Under 6.4.2 I get: > > play=> select a from tt union select a from tt; > a > - > 1 > 2 > 3 > (3 rows) > > Note lack of duplicate "1". Under current sources I get: > > ttest=> select a from tt union select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > since the query is effectively reduced to just "select a from tt". I am sure my books did not consider such case as UNION that could be otimized this way. Not sure what is Right Thing here... Oleg. ---- Oleg Broytmann National Research Surgery Centre http://sun.med.ru/~phd/ Programmers don't die, they justGOSUB without RETURN.
> The equal() updates I installed yesterday (to fix the "don't know > whether nodes of type 600 are equal" problem) have had an unintended > side effect. > Am I right in thinking that UNION (without ALL) is defined to do a > DISTINCT on its result, so that duplicates are removed even if the > duplicates both came from the same source table? That's what 6.4.2 > does, but I do not know if it's strictly kosher according to the SQL > spec. Yup, that's the way it should be... > If so, the code is now busted, because with the equal() extension in > place, cnfify() is able to recognize and remove duplicate select > clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to > just "SELECT xxx" ... and that doesn't mean the same thing. :( > Assuming that 6.4.2 is doing the Right Thing, I see two possible > fixes: > (1) simplify equal() to say that two T_Query nodes are never equal, or > (2) modify the planner so that the "select distinct" operation is > inserted explicitly, and will thus happen even if the UNIONed selects > are collapsed into just one. Sounds to me like (2) is the correct way to do it, as long as the explicit "sort/unique" happens only if the query is collapsed. I haven't looked at this code, and have no experience with cnfify(), but at the time it decides to collapse into a single select, can't it ensure that a sort/unique is done instead? Or is that what you are suggesting? Wouldn't want to have two sort/unique operations on top of each other... - Tom
Was there a conclusion on this? > The equal() updates I installed yesterday (to fix the "don't know > whether nodes of type 600 are equal" problem) have had an unintended > side effect. > > Am I right in thinking that UNION (without ALL) is defined to do a > DISTINCT on its result, so that duplicates are removed even if the > duplicates both came from the same source table? That's what 6.4.2 > does, but I do not know if it's strictly kosher according to the SQL > spec. > > If so, the code is now busted, because with the equal() extension in > place, cnfify() is able to recognize and remove duplicate select > clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to > just "SELECT xxx" ... and that doesn't mean the same thing. > > An actual example: given the data > > play=> select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > Under 6.4.2 I get: > > play=> select a from tt union select a from tt; > a > - > 1 > 2 > 3 > (3 rows) > > Note lack of duplicate "1". Under current sources I get: > > ttest=> select a from tt union select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > since the query is effectively reduced to just "select a from tt". > > Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes: > (1) simplify equal() to say that two T_Query nodes are never equal, or > (2) modify the planner so that the "select distinct" operation is > inserted explicitly, and will thus happen even if the UNIONed selects > are collapsed into just one. > > (1) is a trivial fix of course, but it worries me --- maybe someday > we will need equal() to give an honest answer for Query nodes. > But I don't have the expertise to apply (2), and it seems like rather > a lot of work for a boundary case that isn't really interesting in > practice. > > Comments? *Is* 6.4.2 behaving according to the SQL spec? > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
> > Am I right in thinking that UNION (without ALL) is defined to do a > > DISTINCT on its result, so that duplicates are removed even if the > > duplicates both came from the same source table? That's what 6.4.2 > > does, but I do not know if it's strictly kosher according to the SQL > > spec. (Just in case this is still active) Yes, this is the right behavior according to SQL92... - Thomas -- Thomas Lockhart lockhart@alumni.caltech.edu South Pasadena, California
Thomas Lockhart <lockhart@alumni.caltech.edu> writes: >>>> Am I right in thinking that UNION (without ALL) is defined to do a >>>> DISTINCT on its result, so that duplicates are removed even if the >>>> duplicates both came from the same source table? That's what 6.4.2 >>>> does, but I do not know if it's strictly kosher according to the SQL >>>> spec. > (Just in case this is still active) > Yes, this is the right behavior according to SQL92... OK, then 6.5 is still broken :-(. I know a lot more about the planner than I did then, so I will see if I can fix it "right" --- that is, without taking out equal()'s ability to detect equality of Query nodes. If that seems too hard/risky, I will just lobotomize equal() instead. Thanks for the reminder, Bruce --- I had forgotten about this issue. regards, tom lane
On Sun, 9 May 1999, Thomas Lockhart wrote: > > > Am I right in thinking that UNION (without ALL) is defined to do a > > > DISTINCT on its result, so that duplicates are removed even if the > > > duplicates both came from the same source table? That's what 6.4.2 > > > does, but I do not know if it's strictly kosher according to the SQL > > > spec. > > Yes, this is the right behavior according to SQL92... In which case something should put a DISTINCT on queries using UNION... since making T_Query nodes never equal is a deoptimization. Taral
> Thomas Lockhart <lockhart@alumni.caltech.edu> writes: > >>>> Am I right in thinking that UNION (without ALL) is defined to do a > >>>> DISTINCT on its result, so that duplicates are removed even if the > >>>> duplicates both came from the same source table? That's what 6.4.2 > >>>> does, but I do not know if it's strictly kosher according to the SQL > >>>> spec. > > > (Just in case this is still active) > > > Yes, this is the right behavior according to SQL92... > > OK, then 6.5 is still broken :-(. I know a lot more about the planner > than I did then, so I will see if I can fix it "right" --- that is, > without taking out equal()'s ability to detect equality of Query nodes. > > If that seems too hard/risky, I will just lobotomize equal() instead. > > Thanks for the reminder, Bruce --- I had forgotten about this issue. Hey, that's why I keep 500 messages in my PostgreSQL mailbox. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Can someone comment on this? Is it still an issue with cnf'ify removing duplicate cases? > The equal() updates I installed yesterday (to fix the "don't know > whether nodes of type 600 are equal" problem) have had an unintended > side effect. > > Am I right in thinking that UNION (without ALL) is defined to do a > DISTINCT on its result, so that duplicates are removed even if the > duplicates both came from the same source table? That's what 6.4.2 > does, but I do not know if it's strictly kosher according to the SQL > spec. > > If so, the code is now busted, because with the equal() extension in > place, cnfify() is able to recognize and remove duplicate select > clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to > just "SELECT xxx" ... and that doesn't mean the same thing. > > An actual example: given the data > > play=> select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > Under 6.4.2 I get: > > play=> select a from tt union select a from tt; > a > - > 1 > 2 > 3 > (3 rows) > > Note lack of duplicate "1". Under current sources I get: > > ttest=> select a from tt union select a from tt; > a > - > 1 > 1 > 2 > 3 > (4 rows) > > since the query is effectively reduced to just "select a from tt". > > Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes: > (1) simplify equal() to say that two T_Query nodes are never equal, or > (2) modify the planner so that the "select distinct" operation is > inserted explicitly, and will thus happen even if the UNIONed selects > are collapsed into just one. > > (1) is a trivial fix of course, but it worries me --- maybe someday > we will need equal() to give an honest answer for Query nodes. > But I don't have the expertise to apply (2), and it seems like rather > a lot of work for a boundary case that isn't really interesting in > practice. > > Comments? *Is* 6.4.2 behaving according to the SQL spec? > > regards, tom lane > > -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <maillist@candle.pha.pa.us> writes: > Can someone comment on this? Is it still an issue with cnf'ify removing > duplicate cases? This is still an open bug. I looked at it and concluded that a clean solution probably requires modifying the parsetree structure for UNIONs. (Currently, UNION, INTERSECT, EXCEPT are converted to OR, AND, AND NOT structures with queries as arguments ... which would be OK, except that the semantics aren't really the same, and the difference between UNION and UNION ALL doesn't fit into that operator set at all.) That looked like a lot of work for what is surely a low-priority bug, so I haven't gotten around to it yet. Need more round tuits. In the meantime, it should be on the TODO list: * SELECT foo UNION SELECT foo is incorrectly simplified to SELECT foo regards, tom lane >> The equal() updates I installed yesterday (to fix the "don't know >> whether nodes of type 600 are equal" problem) have had an unintended >> side effect. >> >> Am I right in thinking that UNION (without ALL) is defined to do a >> DISTINCT on its result, so that duplicates are removed even if the >> duplicates both came from the same source table? That's what 6.4.2 >> does, but I do not know if it's strictly kosher according to the SQL >> spec. >> >> If so, the code is now busted, because with the equal() extension in >> place, cnfify() is able to recognize and remove duplicate select >> clauses. That is, "SELECT xxx UNION SELECT xxx" will be folded to >> just "SELECT xxx" ... and that doesn't mean the same thing. >> >> An actual example: given the data >> >> play=> select a from tt; >> a >> - >> 1 >> 1 >> 2 >> 3 >> (4 rows) >> >> Under 6.4.2 I get: >> >> play=> select a from tt union select a from tt; >> a >> - >> 1 >> 2 >> 3 >> (3 rows) >> >> Note lack of duplicate "1". Under current sources I get: >> >> ttest=> select a from tt union select a from tt; >> a >> - >> 1 >> 1 >> 2 >> 3 >> (4 rows) >> >> since the query is effectively reduced to just "select a from tt". >> >> Assuming that 6.4.2 is doing the Right Thing, I see two possible fixes: >> (1) simplify equal() to say that two T_Query nodes are never equal, or >> (2) modify the planner so that the "select distinct" operation is >> inserted explicitly, and will thus happen even if the UNIONed selects >> are collapsed into just one. >> >> (1) is a trivial fix of course, but it worries me --- maybe someday >> we will need equal() to give an honest answer for Query nodes. >> But I don't have the expertise to apply (2), and it seems like rather >> a lot of work for a boundary case that isn't really interesting in >> practice. >> >> Comments? *Is* 6.4.2 behaving according to the SQL spec? >> >> regards, tom lane >> >> > -- > Bruce Momjian | http://www.op.net/~candle > maillist@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026