Thread: Is this proper UNION behavior??

Is this proper UNION behavior??

From
"Edward Q. Bridges"
Date:
According to my copy of SQL For Smarties by Joe Celko (2nd ed, p. 411):

> The UNION removes all duplicate rows from the results and does
> not care from which table the duplicate rows came.  We could
> use this feature to write a query to remove duplicates from a
> table:
>    (TABLE tableA)
>    UNION
>    (TABLE tableA);
>
> But this is the same as
>    SELECT DISTINCT * FROM tableA;

however, per the below example, the union and the select distinct
are not the same in postgres 7.0.2.  is joe missing somehting here?
or am i?

thanks
--e--


CREATE TABLE "has_some_dupes" (
        "a_col" character(3),
        "b_col" character(3),
        "c_col" character(3)
);
COPY "has_some_dupes" FROM stdin;
abc     def     ghi
abc     def     ghi
abc     def     ghi
jkl     mno     pqr
jkl     mno     pqr
jkl     mno     pqr
stu     vwx     yz
stu     vwx     yz
stu     vwx     yz
\.


ebridges=> (select * from has_some_dupes)
ebridges-> UNION
ebridges-> (select * from has_some_dupes);
 a_col | b_col | c_col
-------+-------+-------
 abc   | def   | ghi
 abc   | def   | ghi
 abc   | def   | ghi
 jkl   | mno   | pqr
 jkl   | mno   | pqr
 jkl   | mno   | pqr
 stu   | vwx   | yz
 stu   | vwx   | yz
 stu   | vwx   | yz
(9 rows)


ebridges=> select distinct * from has_some_dupes;
 a_col | b_col | c_col
-------+-------+-------
 abc   | def   | ghi
 jkl   | mno   | pqr
 stu   | vwx   | yz
(3 rows)






Re: Is this proper UNION behavior??

From
Tom Lane
Date:
"Edward Q. Bridges" <ed.bridges@buzznik.com> writes:
> ebridges=> (select * from has_some_dupes)
> ebridges-> UNION
> ebridges-> (select * from has_some_dupes);

[ produces the same result as "select * from has_some_dupes" ]

Yup, that's a bug in 7.0.* (and a version or two before).  There's
a planning step that thinks it can simplify UNION/INTERSECT/EXCEPT
expressions as if they were boolean expressions --- in particular,
since "X or X = X", it thinks "X union X = X".  Unfortunately,
whoever wrote that code hadn't actually bothered to consult the
SQL spec about the semantics of UNION, INTERSECT, or EXCEPT :-(.
The spec's rules about duplicate elimination mean that the behavior
is NOT quite like boolean OR/AND/AND NOT.

Since people don't ordinarily write queries as silly as X union X,
this bug hasn't been very high priority to fix.  But it is fixed
for 7.1 ...

            regards, tom lane

Re: Is this proper UNION behavior??

From
Jules Bean
Date:
On Mon, Nov 13, 2000 at 10:12:39PM -0500, Tom Lane wrote:
> "Edward Q. Bridges" <ed.bridges@buzznik.com> writes:
> > ebridges=> (select * from has_some_dupes)
> > ebridges-> UNION
> > ebridges-> (select * from has_some_dupes);
>
> [ produces the same result as "select * from has_some_dupes" ]
>
> Yup, that's a bug in 7.0.* (and a version or two before).  There's
> a planning step that thinks it can simplify UNION/INTERSECT/EXCEPT
> expressions as if they were boolean expressions --- in particular,
> since "X or X = X", it thinks "X union X = X".  Unfortunately,
> whoever wrote that code hadn't actually bothered to consult the
> SQL spec about the semantics of UNION, INTERSECT, or EXCEPT :-(.
> The spec's rules about duplicate elimination mean that the behavior
> is NOT quite like boolean OR/AND/AND NOT.
>
> Since people don't ordinarily write queries as silly as X union X,
> this bug hasn't been very high priority to fix.  But it is fixed
> for 7.1 ...

Of course, the real bug here is in SQL, namely that it allows
duplicates in tables.

Relations don't have duplicates (ask your nearest pet
mathematician). Of course, 'UNION' understands taht relations don't
have duplicates, but bizarrely, SELECT doesn't.

ObPG: I'm no suggesting that Pg change this, by the way, I'm just
ranting pointlessly about one of SQL's stupidities.  Maybe Pg ought to
have an option for 'bette-than-SQL', though, which would, among other
thigns, not permit duplicates in relations (and switch back on that
optimisation).

Jules

Re: Is this proper UNION behavior??

From
"Cristóvão B. B. Dalla Costa"
Date:
> Of course, the real bug here is in SQL, namely that it allows
> duplicates in tables.

Or maybe not. Tables are not *always* relations, one may have a table of
things to do. If one thing has to be done twice, it might have two entries
on the table.

It all depends on the workings of the programs using the database. Just
imagine you SELECT a subset of a tables's unique rows, but that subset
doesn't produce unique rows. Should the non-unique ones be removed behind
your back? I think not.




Re: Is this proper UNION behavior??

From
Jules Bean
Date:
On Tue, Nov 14, 2000 at 12:22:32PM -0300, Cristóvão B. B. Dalla Costa wrote:
> > Of course, the real bug here is in SQL, namely that it allows
> > duplicates in tables.
>
> Or maybe not. Tables are not *always* relations, one may have a table of
> things to do. If one thing has to be done twice, it might have two entries
> on the table.

Hurrah!  A nice, juicy, off-topic thread for me to get my teeth into
and annoy the old-timers..

>
> It all depends on the workings of the programs using the database. Just
> imagine you SELECT a subset of a tables's unique rows, but that subset
> doesn't produce unique rows. Should the non-unique ones be removed behind
> your back? I think not.

PostgreSQL is a /relational/ DBMS (with some object features). SQL is
a /relational/ query language (with some flaws).  They are based on a
mathematical model (incomplete thought it may be). By and large, SQL
tries to maintain this model: that is why UNION behaves the way it
does.  If SQL wasn't trying to be relational UNION would behave like
that ghastly non-distinct version (is it called UNION ALL, maybe?). So
I stick to my point that SELECT is the command which is the except
from the general philsophy.

In answer to your two examples: it is of course quite possible to
model a todo list with duplicate entries in a relational fashion.  It
is absolutely correct behaviour that when you project only some
columns, duplicates are eliminated.

Jules

Re: Is this proper UNION behavior??

From
"Edward Q. Bridges"
Date:
On Tue, 14 Nov 2000 13:35:14 +0000, Jules Bean wrote:

> On Mon, Nov 13, 2000 at 10:12:39PM -0500, Tom Lane wrote:
> > "Edward Q. Bridges" <ed.bridges@buzznik.com> writes:
> > > ebridges=> (select * from has_some_dupes)
> > > ebridges-> UNION
> > > ebridges-> (select * from has_some_dupes);
> >
> > [ produces the same result as "select * from has_some_dupes" ]
> >
> Of course, the real bug here is in SQL, namely that it allows
> duplicates in tables.
>

given the presence of SELECT DISTINCT and unique constraints, i
wouldn't consider it a bug in SQL.  probably better anyway to have
the flexibility of having duplicates in a table, while providing the
ability to constrain to unique tuples if necessary.










Re: Is this proper UNION behavior??

From
Tom Lane
Date:
Jules Bean <jules@jellybean.co.uk> writes:
> ObPG: I'm no suggesting that Pg change this, by the way, I'm just
> ranting pointlessly about one of SQL's stupidities.  Maybe Pg ought to
> have an option for 'bette-than-SQL', though, which would, among other
> thigns, not permit duplicates in relations (and switch back on that
> optimisation).

Just to clarify: it was not really an optimization, at least not to
my mind.  How many people are ever going to write "X union X" in real
queries?  As an optimization attempt, it was a complete waste of time
to be looking for this case --- the savings on the once-in-a-blue-moon
silly query would never pay for the cycles burned to make the test on
every other query.

The reason that code was there in the first place is that the pre-7.1
implementation method for UNION/INTERSECT/EXCEPT could only cope with
query trees in which all the UNIONs were at the top.  So what was
*really* being done was a transformation to disjunctive normal form,
eg,
    (X union Y) intersect Z
becomes
    (X intersect Z) union (Y intersect Z)
The simplication of cases like X union X was a byproduct of the DNF
transformation routine, which was an algorithm designed for boolean
expressions in which such a transformation is completely legitimate.

Now of course what will leap out at you here is that the transformed
query is likely to be considerably *more* work than the original.
So for real queries, this wasn't an optimization at all, but a
pessimization.  I have no intention of putting it back regardless of
debates about better-than-SQL semantics ;-)

            regards, tom lane