Thread: BUG #6669: unique index w/ multiple columns and NULLs

BUG #6669: unique index w/ multiple columns and NULLs

From
jose.soares@sferacarta.com
Date:
The following bug has been logged on the website:

Bug reference:      6669
Logged by:          jose soares
Email address:      jose.soares@sferacarta.com
PostgreSQL version: 8.4.8
Operating system:   x86_64-pc-linux-gnu, debian
Description:=20=20=20=20=20=20=20=20

Hi,

I think I have found an error in pg or at least inconsistency, take a look
at this.
I created an unique index on two columns and pg let me enter repeated values
as NULLs (unknown value),
When I ask pg to tell me if there are repetitions n this index (with group
by),
the inconsistency becomes apparent.



# create table test(id int, data date, code int);
CREATE TABLE

# create UNIQUE index unica on  test(data,code);
CREATE INDEX
# \d test
     Table "public.test"
 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 data   | date    |
 code   | integer |
Indexes:
    "unica" UNIQUE, btree (data, code)

# insert into test values(1,current_date);
INSERT 0 1
# insert into test values(2,current_date);
INSERT 0 1
# insert into test values(3,current_date);
INSERT 0 1
sicer_forli=3D# select current_date,code, count(*) from test group by 1,2;
    date    | code | count
------------+------+-------
 31-05-2012 |      |     3
(1 row)

ps:
Oracle don't allows to insert two NULLs in such column.
I don't know which of them is SQL Standard, but in this case oracle is
not inconsistent.

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
Tom Lane
Date:
jose.soares@sferacarta.com writes:
> I think I have found an error in pg or at least inconsistency, take a look
> at this.
> I created an unique index on two columns and pg let me enter repeated values
> as NULLs (unknown value),

This is entirely correct per SQL standard: unique constraints do not
reject duplicated rows that include nulls.  If you read the standard,
unique constraints are defined in terms of UNIQUE predicates, and a
UNIQUE predicate for a table T is defined thus:

         2) If there are no two rows in T such that the value of each column
            in one row is non-null and is equal to the value of the cor-
            responding column in the other row according to Subclause 8.2,
            "<comparison predicate>", then the result of the <unique predi-
            cate> is true; otherwise, the result of the <unique predicate>
            is false.

(SQL92 section 8.9 <unique predicate>)

This is why a primary key constraint is defined as requiring both UNIQUE
and NOT NULL; you need that to ensure that there are indeed no two
indistinguishable rows.

(Mind you, I'm not here to defend *why* the standard is written that
way.  But that is what it says.)

> Oracle don't allows to insert two NULLs in such column.

Oracle is not exactly the most standards-compliant implementation
around.  They are well-known to be particularly wrong with respect to
NULLs behavior.

            regards, tom lane

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
jo
Date:
Hi Tom,

Thanks for the explanation about standard sql.
The goodness of it must be accepted by faith. :-)
I still have a doubt about the result of the GROUP BY clause.
It seems to me that there's an inconsistence between the GROUP BY clause
and the unique index.
The GROUP BY clause, consider NULLs as known and equal values
while the index unique constraint consider NULLs as unknown values and
not equals between them.
Don't you think, there's an inconsistence here?

j


Tom Lane wrote:
> jose.soares@sferacarta.com writes:
>
>> I think I have found an error in pg or at least inconsistency, take a look
>> at this.
>> I created an unique index on two columns and pg let me enter repeated values
>> as NULLs (unknown value),
>>
>
> This is entirely correct per SQL standard: unique constraints do not
> reject duplicated rows that include nulls.  If you read the standard,
> unique constraints are defined in terms of UNIQUE predicates, and a
> UNIQUE predicate for a table T is defined thus:
>
>          2) If there are no two rows in T such that the value of each column
>             in one row is non-null and is equal to the value of the cor-
>             responding column in the other row according to Subclause 8.2,
>             "<comparison predicate>", then the result of the <unique predi-
>             cate> is true; otherwise, the result of the <unique predicate>
>             is false.
>
> (SQL92 section 8.9 <unique predicate>)
>
> This is why a primary key constraint is defined as requiring both UNIQUE
> and NOT NULL; you need that to ensure that there are indeed no two
> indistinguishable rows.
>
> (Mind you, I'm not here to defend *why* the standard is written that
> way.  But that is what it says.)
>
>
>> Oracle don't allows to insert two NULLs in such column.
>>
>
> Oracle is not exactly the most standards-compliant implementation
> around.  They are well-known to be particularly wrong with respect to
> NULLs behavior.
>
>             regards, tom lane
>

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
"Kevin Grittner"
Date:
jo <jose.soares@sferacarta.com> wrote:

> Thanks for the explanation about standard sql.
> The goodness of it must be accepted by faith. :-)

Not if you have the stamina to fight your way through the standards
documents.  ;-)

> I still have a doubt about the result of the GROUP BY clause.
> It seems to me that there's an inconsistence between the GROUP BY
> clause and the unique index.
> The GROUP BY clause, consider NULLs as known and equal values
> while the index unique constraint consider NULLs as unknown values
> and not equals between them.
> Don't you think, there's an inconsistence here?

I think these behaviors are required by the standard.  The
PostgreSQL community generally feels pretty strongly that when
standard syntax is accepted, standard semantics are provided.

While the standard is often criticized, in this case I think it
makes sense.  The meaning of NULL is traditionally "UNKNOWN or NOT
APPLICABLE".  It would not make sense to have a hard prohibition of
two rows which only *might* be relating to the same object.  In the
"NOT APPLICABLE" case it would make sense, but unfortunately SQL has
no way to distinguish which meaning NULL has.  On the other hand,
aggregates like counts might be very useful -- it is often useful to
know not only how many rows have each of the known values, but how
many are missing a value.

Have you looked at whether an exclusion constraint would serve your
needs here?

-Kevin

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
Bruce Momjian
Date:
On Fri, Jun 01, 2012 at 08:58:32AM +0200, jo wrote:
> Hi Tom,
>
> Thanks for the explanation about standard sql.
> The goodness of it must be accepted by faith. :-)
> I still have a doubt about the result of the GROUP BY clause.
> It seems to me that there's an inconsistence between the GROUP BY
> clause and the unique index.
> The GROUP BY clause, consider NULLs as known and equal values
> while the index unique constraint consider NULLs as unknown values
> and not equals between them.
> Don't you think, there's an inconsistence here?

Yes, I can see your point.  I think GROUP BY is doing the best it can
with the NULL;  having it consider them as different would lead to long
output.  Also consider that COUNT(*) counts nulls, while COUNT(col) does
not:

    WITH null_test (col) AS
    (
       SELECT 8
       UNION ALL
       SELECT NULL
    )
    SELECT COUNT(*) FROM null_test
    UNION ALL
    SELECT COUNT(col) FROM null_test;

     count
    -------
         2
         1
    (2 rows)

COUNT(*) can't skip nulls because there is no specified column, but why
does COUNT(col) skip nulls --- again, inconsistent.

I think NULL is helpful for unknown values, and required as the output
of missing INSERT columns and unjoined outer join columns.  I think the
aggregates then did the best they could.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> COUNT(*) can't skip nulls because there is no specified column,
> but why does COUNT(col) skip nulls --- again, inconsistent.

I disagree -- one is counting rows, the other is counting rows with
a value in that column.  I guess one could criticize the syntax for
specifying that as non-obvious, but it seems pretty reasonable to
me.

-Kevin

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
Bruce Momjian
Date:
On Mon, Jun 04, 2012 at 10:29:22AM -0500, Kevin Grittner wrote:
> Bruce Momjian <bruce@momjian.us> wrote:
>
> > COUNT(*) can't skip nulls because there is no specified column,
> > but why does COUNT(col) skip nulls --- again, inconsistent.
>
> I disagree -- one is counting rows, the other is counting rows with
> a value in that column.  I guess one could criticize the syntax for
> specifying that as non-obvious, but it seems pretty reasonable to
> me.

I get your point about COUNT(*) really counting rows, not values, but
why doesn't GROUP BY then skip nulls?

    WITH null_test (col1, col2) AS
    (
       SELECT 1, null
       UNION ALL
       SELECT null, null
    )
    SELECT COUNT(*), col2 FROM null_test group by col2
    UNION ALL
    SELECT COUNT(col1), col2 FROM null_test group by col2;

     count | col2
    -------+------
         2 |
         1 |
    (2 rows)

Since col2 is null in both places, why it is processed?  Looks like
GROUP BY is selecting the NULL rows, then COUNT is processing them based
on its rules.

I think the original complaint is that NULL != NULL in a WHERE clause,
but GROUP BY is able to group them together just fine.

Anyway, just thoughts on the topic.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
"Kevin Grittner"
Date:
Bruce Momjian <bruce@momjian.us> wrote:

> I get your point about COUNT(*) really counting rows, not values,
> but why doesn't GROUP BY then skip nulls?
>
>   WITH null_test (col1, col2) AS
>   (
>      SELECT 1, null
>      UNION ALL
>      SELECT null, null
>   )
>   SELECT COUNT(*), col2 FROM null_test group by col2
>   UNION ALL
>   SELECT COUNT(col1), col2 FROM null_test group by col2;
>
>    count | col2
>   -------+------
>        2 |
>        1 |
>   (2 rows)

I'm having trouble seeing why this result is confusing.  You asked
for counts with a GROUP BY clause.  In such a case, NULLs are a
group, since you might be interested in how many *are* null.  Then
you did a count of all rows and a count of rows where another value
wasn't NULL.  You got the only reasonable result, IMO.  (Well,
unless you argue that a row with no known values should be an error
in the first place, which if I remember correctly is what E.F. Codd
argued for; but that is one point on which the standards committee
didn't go with Codd's position.)

> Looks like GROUP BY is selecting the NULL rows, then COUNT is
> processing them based on its rules.

I would tend to view it that COUNT is processing the rows it was
asked to process in each case, and GROUP BY is grouping them as
requested.

> I think the original complaint is that NULL != NULL in a WHERE
> clause, but GROUP BY is able to group them together just fine.

Whoa!  I think I know what you meant, but that is a dangerously
misleading misstatement.  It is not true that NULL = NULL, but
neither is it true that NULL != NULL.  In fact, you also don't get
TRUE from NOT NULL = NULL.  If you don't know either value, you
can't know that they are equal, and you can't know that they are
unequal.  The results of such comparisons are UNKNOWN.  That doesn't
mean you would always find the count of rows where the value is NULL
uninteresting; hence the IS NOT DISTINCT FROM concept is effectively
used for grouping.

Performing logical or set operations on data sets with missing
values is an inherently tricky business, but I think that overall
SQL has made reasonable choices on how to do that; my biggest gripe
is that there is no standard way to distinguish between UNKNOWN and
NOT APPLICABLE.  The fuzziest areas seem to me to be related to
that deficiency.  As long as NULL is not abused for such things as
"known to be zero" in an accounting record (which is subtly but
significantly different from "not applicable"), NULL is almost
always (IMO) better than some "magic value".  If you have ever
converted data from a database where names were split into multiple
fields, and NULL was not allowed for middle name, you will probably
agree.

-Kevin

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
Bruce Momjian
Date:
On Mon, Jun 04, 2012 at 11:26:20AM -0500, Kevin Grittner wrote:
> > I think the original complaint is that NULL != NULL in a WHERE
> > clause, but GROUP BY is able to group them together just fine.
>
> Whoa!  I think I know what you meant, but that is a dangerously
> misleading misstatement.  It is not true that NULL = NULL, but
> neither is it true that NULL != NULL.  In fact, you also don't get
> TRUE from NOT NULL = NULL.  If you don't know either value, you
> can't know that they are equal, and you can't know that they are
> unequal.  The results of such comparisons are UNKNOWN.  That doesn't
> mean you would always find the count of rows where the value is NULL
> uninteresting; hence the IS NOT DISTINCT FROM concept is effectively
> used for grouping.

Well, thinking of this from a procedural language perspective, a new
value comes in and GROUP BY has to figure out if it already has a
matching value.  If a NULL comes in, anything you compare it to is NULL,
even another NULL, so what is the logic that allows these nulls to be
placed in the same group?  Every value to compare it to returns NULL.
I assume IS DISTINCT FROM is the answer.

> Performing logical or set operations on data sets with missing
> values is an inherently tricky business, but I think that overall
> SQL has made reasonable choices on how to do that; my biggest gripe
> is that there is no standard way to distinguish between UNKNOWN and
> NOT APPLICABLE.  The fuzziest areas seem to me to be related to
> that deficiency.  As long as NULL is not abused for such things as
> "known to be zero" in an accounting record (which is subtly but
> significantly different from "not applicable"), NULL is almost
> always (IMO) better than some "magic value".  If you have ever
> converted data from a database where names were split into multiple
> fields, and NULL was not allowed for middle name, you will probably
> agree.

Agreed, we could use another null-type value, though NULLs are already
so confusing that I am sure adding another one just caused too many
groans in the room when it was suggested.

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
Jeff Davis
Date:
On Mon, 2012-06-04 at 11:56 -0400, Bruce Momjian wrote:
> I get your point about COUNT(*) really counting rows, not values, but
> why doesn't GROUP BY then skip nulls?

A while ago, I came to the conclusion that applying logic to extrapolate
the behavior of NULL is a bad idea:

http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/

Jose was not wrong about the inconsistency between UNIQUE and GROUP BY.
But the answer is that "we do it that way because the standard says so".
And that's a good reason.

Regards,
    Jeff Davis

Re: BUG #6669: unique index w/ multiple columns and NULLs

From
jo
Date:
Jeff Davis wrote:
> On Mon, 2012-06-04 at 11:56 -0400, Bruce Momjian wrote:
>
>> I get your point about COUNT(*) really counting rows, not values, but
>> why doesn't GROUP BY then skip nulls?
>>
>
> A while ago, I came to the conclusion that applying logic to extrapolate
> the behavior of NULL is a bad idea:
>
> http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/
>
> Jose was not wrong about the inconsistency between UNIQUE and GROUP BY.
> But the answer is that "we do it that way because the standard says so".
> And that's a good reason.
>
> Regards,
>     Jeff Davis
>
>
>
The article pointed by Jeff is very insightful.

NULLs, if you know them, avoid them :-(

I agree with Date and Darwen about NULLs:

"Chris Date and  Hugh Darwen the authors of The Third Manifesto, have
suggested that the SQL Null implementation is inherently flawed and
should be ELIMINATED altogether^
<http://en.wikipedia.org/wiki/Null_%28SQL%29#cite_note-3rdmanifesto-16>
, pointing to inconsistencies and flaws in the implementation of SQL
Null-handling (particularly in aggregate functions) as proof that the
entire concept of Null is flawed and should be removed from the
relational model"

j