Thread: Is this proper UNION behavior??
According to my copy of SQL For Smarties by Joe Celko (2nd ed, p. 411): > The UNION removes all duplicate rows from the results and does > not care from which table the duplicate rows came. We could > use this feature to write a query to remove duplicates from a > table: > (TABLE tableA) > UNION > (TABLE tableA); > > But this is the same as > SELECT DISTINCT * FROM tableA; however, per the below example, the union and the select distinct are not the same in postgres 7.0.2. is joe missing somehting here? or am i? thanks --e-- CREATE TABLE "has_some_dupes" ( "a_col" character(3), "b_col" character(3), "c_col" character(3) ); COPY "has_some_dupes" FROM stdin; abc def ghi abc def ghi abc def ghi jkl mno pqr jkl mno pqr jkl mno pqr stu vwx yz stu vwx yz stu vwx yz \. ebridges=> (select * from has_some_dupes) ebridges-> UNION ebridges-> (select * from has_some_dupes); a_col | b_col | c_col -------+-------+------- abc | def | ghi abc | def | ghi abc | def | ghi jkl | mno | pqr jkl | mno | pqr jkl | mno | pqr stu | vwx | yz stu | vwx | yz stu | vwx | yz (9 rows) ebridges=> select distinct * from has_some_dupes; a_col | b_col | c_col -------+-------+------- abc | def | ghi jkl | mno | pqr stu | vwx | yz (3 rows)
"Edward Q. Bridges" <ed.bridges@buzznik.com> writes: > ebridges=> (select * from has_some_dupes) > ebridges-> UNION > ebridges-> (select * from has_some_dupes); [ produces the same result as "select * from has_some_dupes" ] Yup, that's a bug in 7.0.* (and a version or two before). There's a planning step that thinks it can simplify UNION/INTERSECT/EXCEPT expressions as if they were boolean expressions --- in particular, since "X or X = X", it thinks "X union X = X". Unfortunately, whoever wrote that code hadn't actually bothered to consult the SQL spec about the semantics of UNION, INTERSECT, or EXCEPT :-(. The spec's rules about duplicate elimination mean that the behavior is NOT quite like boolean OR/AND/AND NOT. Since people don't ordinarily write queries as silly as X union X, this bug hasn't been very high priority to fix. But it is fixed for 7.1 ... regards, tom lane
On Mon, Nov 13, 2000 at 10:12:39PM -0500, Tom Lane wrote: > "Edward Q. Bridges" <ed.bridges@buzznik.com> writes: > > ebridges=> (select * from has_some_dupes) > > ebridges-> UNION > > ebridges-> (select * from has_some_dupes); > > [ produces the same result as "select * from has_some_dupes" ] > > Yup, that's a bug in 7.0.* (and a version or two before). There's > a planning step that thinks it can simplify UNION/INTERSECT/EXCEPT > expressions as if they were boolean expressions --- in particular, > since "X or X = X", it thinks "X union X = X". Unfortunately, > whoever wrote that code hadn't actually bothered to consult the > SQL spec about the semantics of UNION, INTERSECT, or EXCEPT :-(. > The spec's rules about duplicate elimination mean that the behavior > is NOT quite like boolean OR/AND/AND NOT. > > Since people don't ordinarily write queries as silly as X union X, > this bug hasn't been very high priority to fix. But it is fixed > for 7.1 ... Of course, the real bug here is in SQL, namely that it allows duplicates in tables. Relations don't have duplicates (ask your nearest pet mathematician). Of course, 'UNION' understands taht relations don't have duplicates, but bizarrely, SELECT doesn't. ObPG: I'm no suggesting that Pg change this, by the way, I'm just ranting pointlessly about one of SQL's stupidities. Maybe Pg ought to have an option for 'bette-than-SQL', though, which would, among other thigns, not permit duplicates in relations (and switch back on that optimisation). Jules
> Of course, the real bug here is in SQL, namely that it allows > duplicates in tables. Or maybe not. Tables are not *always* relations, one may have a table of things to do. If one thing has to be done twice, it might have two entries on the table. It all depends on the workings of the programs using the database. Just imagine you SELECT a subset of a tables's unique rows, but that subset doesn't produce unique rows. Should the non-unique ones be removed behind your back? I think not.
On Tue, Nov 14, 2000 at 12:22:32PM -0300, Cristóvão B. B. Dalla Costa wrote: > > Of course, the real bug here is in SQL, namely that it allows > > duplicates in tables. > > Or maybe not. Tables are not *always* relations, one may have a table of > things to do. If one thing has to be done twice, it might have two entries > on the table. Hurrah! A nice, juicy, off-topic thread for me to get my teeth into and annoy the old-timers.. > > It all depends on the workings of the programs using the database. Just > imagine you SELECT a subset of a tables's unique rows, but that subset > doesn't produce unique rows. Should the non-unique ones be removed behind > your back? I think not. PostgreSQL is a /relational/ DBMS (with some object features). SQL is a /relational/ query language (with some flaws). They are based on a mathematical model (incomplete thought it may be). By and large, SQL tries to maintain this model: that is why UNION behaves the way it does. If SQL wasn't trying to be relational UNION would behave like that ghastly non-distinct version (is it called UNION ALL, maybe?). So I stick to my point that SELECT is the command which is the except from the general philsophy. In answer to your two examples: it is of course quite possible to model a todo list with duplicate entries in a relational fashion. It is absolutely correct behaviour that when you project only some columns, duplicates are eliminated. Jules
On Tue, 14 Nov 2000 13:35:14 +0000, Jules Bean wrote: > On Mon, Nov 13, 2000 at 10:12:39PM -0500, Tom Lane wrote: > > "Edward Q. Bridges" <ed.bridges@buzznik.com> writes: > > > ebridges=> (select * from has_some_dupes) > > > ebridges-> UNION > > > ebridges-> (select * from has_some_dupes); > > > > [ produces the same result as "select * from has_some_dupes" ] > > > Of course, the real bug here is in SQL, namely that it allows > duplicates in tables. > given the presence of SELECT DISTINCT and unique constraints, i wouldn't consider it a bug in SQL. probably better anyway to have the flexibility of having duplicates in a table, while providing the ability to constrain to unique tuples if necessary.
Jules Bean <jules@jellybean.co.uk> writes: > ObPG: I'm no suggesting that Pg change this, by the way, I'm just > ranting pointlessly about one of SQL's stupidities. Maybe Pg ought to > have an option for 'bette-than-SQL', though, which would, among other > thigns, not permit duplicates in relations (and switch back on that > optimisation). Just to clarify: it was not really an optimization, at least not to my mind. How many people are ever going to write "X union X" in real queries? As an optimization attempt, it was a complete waste of time to be looking for this case --- the savings on the once-in-a-blue-moon silly query would never pay for the cycles burned to make the test on every other query. The reason that code was there in the first place is that the pre-7.1 implementation method for UNION/INTERSECT/EXCEPT could only cope with query trees in which all the UNIONs were at the top. So what was *really* being done was a transformation to disjunctive normal form, eg, (X union Y) intersect Z becomes (X intersect Z) union (Y intersect Z) The simplication of cases like X union X was a byproduct of the DNF transformation routine, which was an algorithm designed for boolean expressions in which such a transformation is completely legitimate. Now of course what will leap out at you here is that the transformed query is likely to be considerably *more* work than the original. So for real queries, this wasn't an optimization at all, but a pessimization. I have no intention of putting it back regardless of debates about better-than-SQL semantics ;-) regards, tom lane