Re: BUG #6669: unique index w/ multiple columns and NULLs - Mailing list pgsql-bugs

From Kevin Grittner
Subject Re: BUG #6669: unique index w/ multiple columns and NULLs
Date
Msg-id 4FCC9B5C020000250004801D@gw.wicourts.gov
Whole thread Raw
In response to Re: BUG #6669: unique index w/ multiple columns and NULLs  (Bruce Momjian <bruce@momjian.us>)
Responses Re: BUG #6669: unique index w/ multiple columns and NULLs
List pgsql-bugs
Bruce Momjian <bruce@momjian.us> wrote:

> I get your point about COUNT(*) really counting rows, not values,
> but why doesn't GROUP BY then skip nulls?
>
>   WITH null_test (col1, col2) AS
>   (
>      SELECT 1, null
>      UNION ALL
>      SELECT null, null
>   )
>   SELECT COUNT(*), col2 FROM null_test group by col2
>   UNION ALL
>   SELECT COUNT(col1), col2 FROM null_test group by col2;
>
>    count | col2
>   -------+------
>        2 |
>        1 |
>   (2 rows)

I'm having trouble seeing why this result is confusing.  You asked
for counts with a GROUP BY clause.  In such a case, NULLs are a
group, since you might be interested in how many *are* null.  Then
you did a count of all rows and a count of rows where another value
wasn't NULL.  You got the only reasonable result, IMO.  (Well,
unless you argue that a row with no known values should be an error
in the first place, which if I remember correctly is what E.F. Codd
argued for; but that is one point on which the standards committee
didn't go with Codd's position.)

> Looks like GROUP BY is selecting the NULL rows, then COUNT is
> processing them based on its rules.

I would tend to view it that COUNT is processing the rows it was
asked to process in each case, and GROUP BY is grouping them as
requested.

> I think the original complaint is that NULL != NULL in a WHERE
> clause, but GROUP BY is able to group them together just fine.

Whoa!  I think I know what you meant, but that is a dangerously
misleading misstatement.  It is not true that NULL = NULL, but
neither is it true that NULL != NULL.  In fact, you also don't get
TRUE from NOT NULL = NULL.  If you don't know either value, you
can't know that they are equal, and you can't know that they are
unequal.  The results of such comparisons are UNKNOWN.  That doesn't
mean you would always find the count of rows where the value is NULL
uninteresting; hence the IS NOT DISTINCT FROM concept is effectively
used for grouping.

Performing logical or set operations on data sets with missing
values is an inherently tricky business, but I think that overall
SQL has made reasonable choices on how to do that; my biggest gripe
is that there is no standard way to distinguish between UNKNOWN and
NOT APPLICABLE.  The fuzziest areas seem to me to be related to
that deficiency.  As long as NULL is not abused for such things as
"known to be zero" in an accounting record (which is subtly but
significantly different from "not applicable"), NULL is almost
always (IMO) better than some "magic value".  If you have ever
converted data from a database where names were split into multiple
fields, and NULL was not allowed for middle name, you will probably
agree.

-Kevin

pgsql-bugs by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: BUG #6669: unique index w/ multiple columns and NULLs
Next
From: Bruce Momjian
Date:
Subject: Re: BUG #6669: unique index w/ multiple columns and NULLs