Thread: Documentation of EXCEPT ALL may have a bug

Documentation of EXCEPT ALL may have a bug

From
PG Doc comments form
Date:
The following documentation comment has been logged on the website:

Page: https://www.postgresql.org/docs/10/static/queries-union.html
Description:

I believe that the documented behavior of EXCEPT is not in agreement
with Postgres behavior (I'm using Postgres 9.5). The documents say:

    EXCEPT returns all rows that are in the result of query1 but not
    in the result of query2. (This is sometimes called the difference
    between two queries.) Again, duplicates are eliminated unless
    EXCEPT ALL is used.

Here is a test script:

    drop table if exists t;
    drop table if exists u;
    
    create table t(x int);
    create table u(x int);
    
    insert into t values (1), (2), (2), (3), (3);
    insert into u values (1), (2);
    
    select * from t except select * from u;
    select * from t except all select * from u;

And here is the output:

    DROP TABLE
    DROP TABLE
    CREATE TABLE
    CREATE TABLE
    INSERT 0 5
    INSERT 0 2
    
     x 
    ---
     3
    (1 row)
    
     x 
    ---
     3
     3
     2
    (3 rows)
    
The output from EXCEPT matches the documented behavior.

The output from EXCEPT ALL makes sense to me, but I think it is at
odds with the documentation: "EXCEPT returns all rows that are in the
result of query1 but not in the result of query2." This general
statement is then modified by the discussion of EXCEPT ALL. That first
sentence does not deal with duplicates in the input tables, and just
discusses set membership. Each occurrence of (3) in query1 is
therefore kept.  For EXCEPT, the duplicates are eliminated (yielding
the output [3]), and EXCEPT ALL should therefore yield [3, 3].  In the
EXCEPT ALL case, both occurrences of (2) in query1 should be
eliminated by the one occurrence in query2. I think this is a fair
interpretation based on the wording.

To match the observed behavior, I think that the description of EXCEPT
ALL needs to be modified to something like this:

EXCEPT ALL returns those rows of query1 in excess of matching rows in
query2, as well as rows of query1 that have no match in query2.

Re: Documentation of EXCEPT ALL may have a bug

From
Alvaro Herrera
Date:
PG Doc comments form wrote:

>     create table t(x int);
>     create table u(x int);
>     
>     insert into t values (1), (2), (2), (3), (3);
>     insert into u values (1), (2);
>     
>     select * from t except all select * from u;

>      x 
>     ---
>      3
>      3
>      2
>     (3 rows)

I find this pretty odd behavior.  Is this not an outright bug?  I
haven't read the SQL standard on this matter[*], but if they define
EXCEPT ALL to work like this, then it seems pretty useless.  (Maybe they
just didn't intend EXCEPT ALL to be useful?)  If this is indeed the
case, maybe we should amend the docs not only to explain this behavior
but also to warn against the construct.

[*] I didn't try terribly hard, but couldn't actually find where the
behavior is defined.  What I have on hand is a draft of SQL:2011 where
this appears to be defined in 7.13 <query expression> but I was unable
to find the rules for set operations.  It refers to 9.12 Grouping
operations but that defines conformance rules only.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Documentation of EXCEPT ALL may have a bug

From
Pantelis Theodosiou
Date:


On Sat, Feb 10, 2018 at 11:59 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
PG Doc comments form wrote:

>     create table t(x int);
>     create table u(x int);
>
>     insert into t values (1), (2), (2), (3), (3);
>     insert into u values (1), (2);
>
>     select * from t except all select * from u;

>      x
>     ---
>      3
>      3
>      2
>     (3 rows)

I find this pretty odd behavior.  Is this not an outright bug?  I
haven't read the SQL standard on this matter[*], but if they define
EXCEPT ALL to work like this, then it seems pretty useless.  (Maybe they
just didn't intend EXCEPT ALL to be useful?)  If this is indeed the
case, maybe we should amend the docs not only to explain this behavior
but also to warn against the construct.

I think that's the way it is defined (but I agree, I don't remember ever seeing a use for it)
 

[*] I didn't try terribly hard, but couldn't actually find where the
behavior is defined.  What I have on hand is a draft of SQL:2011 where
this appears to be defined in 7.13 <query expression> but I was unable
to find the rules for set operations.  It refers to 9.12 Grouping
operations but that defines conformance rules only.

--

I may have a different version but I see:

iii)
T contains the following rows:
1) Let R be a row that is a duplicate of some row in ET1 or of some row in ET2 or both. Let
m be the number of duplicates of R in ET1 and let n be the number of duplicates of R in
ET2, where m ≥ 0 and n ≥ 0.
2) If DISTINCT is specified or implicit, then ....
...

3) If ALL is specified, then
Case:
A) If UNION is specified, then the number of duplicates of R that T contains is (m + n).
B) If EXCEPT is specified, then the number of duplicates of R that T contains is the
maximum of (m – n) and 0 (zero).
C) If INTERSECT is specified, then the number of duplicates of R that T contains is the
minimum of m and n.

Re: Documentation of EXCEPT ALL may have a bug

From
Pantelis Theodosiou
Date:


On Sat, Feb 10, 2018 at 1:02 PM, Pantelis Theodosiou <ypercube@gmail.com> wrote:


On Sat, Feb 10, 2018 at 11:59 AM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
PG Doc comments form wrote:

>     create table t(x int);
>     create table u(x int);
>
>     insert into t values (1), (2), (2), (3), (3);
>     insert into u values (1), (2);
>
>     select * from t except all select * from u;

>      x
>     ---
>      3
>      3
>      2
>     (3 rows)

I find this pretty odd behavior.  Is this not an outright bug?  I
haven't read the SQL standard on this matter[*], but if they define
EXCEPT ALL to work like this, then it seems pretty useless.  (Maybe they
just didn't intend EXCEPT ALL to be useful?)  If this is indeed the
case, maybe we should amend the docs not only to explain this behavior
but also to warn against the construct.

I think that's the way it is defined (but I agree, I don't remember ever seeing a use for it)
 

[*] I didn't try terribly hard, but couldn't actually find where the
behavior is defined.  What I have on hand is a draft of SQL:2011 where
this appears to be defined in 7.13 <query expression> but I was unable
to find the rules for set operations.  It refers to 9.12 Grouping
operations but that defines conformance rules only.

--

I may have a different version but I see:

That is in:
7.13 <query expression>
General Rules 3, Case b:

iii)
T contains the following rows:
1) Let R be a row that is a duplicate of some row in ET1 or of some row in ET2 or both. Let
m be the number of duplicates of R in ET1 and let n be the number of duplicates of R in
ET2, where m ≥ 0 and n ≥ 0.
2) If DISTINCT is specified or implicit, then ....
...

3) If ALL is specified, then
Case:
A) If UNION is specified, then the number of duplicates of R that T contains is (m + n).
B) If EXCEPT is specified, then the number of duplicates of R that T contains is the
maximum of (m – n) and 0 (zero).
C) If INTERSECT is specified, then the number of duplicates of R that T contains is the
minimum of m and n.


Re: Documentation of EXCEPT ALL may have a bug

From
Tom Lane
Date:
Alvaro Herrera <alvherre@alvh.no-ip.org> writes:
> PG Doc comments form wrote:
>> create table t(x int);
>> create table u(x int);
>> 
>> insert into t values (1), (2), (2), (3), (3);
>> insert into u values (1), (2);
>> 
>> select * from t except all select * from u;

>> x 
>> ---
>> 3
>> 3
>> 2
>> (3 rows)

> I find this pretty odd behavior.  Is this not an outright bug?

It's exactly what the standard says to do: if there are M occurrences
of a row value in the LHS, and N occurrences in the RHS, emit
max(M-N, 0) copies of the row.  To my mind that's a reasonable definition
of EXCEPT if you suppose that nominally-identical rows are matched and
discarded one by one, rather than with de-duplication occurring
beforehand.

> [*] I didn't try terribly hard, but couldn't actually find where the
> behavior is defined.

In SQL:2011, it's 7.13 <query expression> general rule 3) b) iii) 3) B),
on page 420 in the draft version I have.

            regards, tom lane