Thread: BUG #6669: unique index w/ multiple columns and NULLs
The following bug has been logged on the website: Bug reference: 6669 Logged by: jose soares Email address: jose.soares@sferacarta.com PostgreSQL version: 8.4.8 Operating system: x86_64-pc-linux-gnu, debian Description:=20=20=20=20=20=20=20=20 Hi, I think I have found an error in pg or at least inconsistency, take a look at this. I created an unique index on two columns and pg let me enter repeated values as NULLs (unknown value), When I ask pg to tell me if there are repetitions n this index (with group by), the inconsistency becomes apparent. # create table test(id int, data date, code int); CREATE TABLE # create UNIQUE index unica on test(data,code); CREATE INDEX # \d test Table "public.test" Column | Type | Modifiers --------+---------+----------- id | integer | data | date | code | integer | Indexes: "unica" UNIQUE, btree (data, code) # insert into test values(1,current_date); INSERT 0 1 # insert into test values(2,current_date); INSERT 0 1 # insert into test values(3,current_date); INSERT 0 1 sicer_forli=3D# select current_date,code, count(*) from test group by 1,2; date | code | count ------------+------+------- 31-05-2012 | | 3 (1 row) ps: Oracle don't allows to insert two NULLs in such column. I don't know which of them is SQL Standard, but in this case oracle is not inconsistent.
jose.soares@sferacarta.com writes: > I think I have found an error in pg or at least inconsistency, take a look > at this. > I created an unique index on two columns and pg let me enter repeated values > as NULLs (unknown value), This is entirely correct per SQL standard: unique constraints do not reject duplicated rows that include nulls. If you read the standard, unique constraints are defined in terms of UNIQUE predicates, and a UNIQUE predicate for a table T is defined thus: 2) If there are no two rows in T such that the value of each column in one row is non-null and is equal to the value of the cor- responding column in the other row according to Subclause 8.2, "<comparison predicate>", then the result of the <unique predi- cate> is true; otherwise, the result of the <unique predicate> is false. (SQL92 section 8.9 <unique predicate>) This is why a primary key constraint is defined as requiring both UNIQUE and NOT NULL; you need that to ensure that there are indeed no two indistinguishable rows. (Mind you, I'm not here to defend *why* the standard is written that way. But that is what it says.) > Oracle don't allows to insert two NULLs in such column. Oracle is not exactly the most standards-compliant implementation around. They are well-known to be particularly wrong with respect to NULLs behavior. regards, tom lane
Hi Tom, Thanks for the explanation about standard sql. The goodness of it must be accepted by faith. :-) I still have a doubt about the result of the GROUP BY clause. It seems to me that there's an inconsistence between the GROUP BY clause and the unique index. The GROUP BY clause, consider NULLs as known and equal values while the index unique constraint consider NULLs as unknown values and not equals between them. Don't you think, there's an inconsistence here? j Tom Lane wrote: > jose.soares@sferacarta.com writes: > >> I think I have found an error in pg or at least inconsistency, take a look >> at this. >> I created an unique index on two columns and pg let me enter repeated values >> as NULLs (unknown value), >> > > This is entirely correct per SQL standard: unique constraints do not > reject duplicated rows that include nulls. If you read the standard, > unique constraints are defined in terms of UNIQUE predicates, and a > UNIQUE predicate for a table T is defined thus: > > 2) If there are no two rows in T such that the value of each column > in one row is non-null and is equal to the value of the cor- > responding column in the other row according to Subclause 8.2, > "<comparison predicate>", then the result of the <unique predi- > cate> is true; otherwise, the result of the <unique predicate> > is false. > > (SQL92 section 8.9 <unique predicate>) > > This is why a primary key constraint is defined as requiring both UNIQUE > and NOT NULL; you need that to ensure that there are indeed no two > indistinguishable rows. > > (Mind you, I'm not here to defend *why* the standard is written that > way. But that is what it says.) > > >> Oracle don't allows to insert two NULLs in such column. >> > > Oracle is not exactly the most standards-compliant implementation > around. They are well-known to be particularly wrong with respect to > NULLs behavior. > > regards, tom lane >
jo <jose.soares@sferacarta.com> wrote: > Thanks for the explanation about standard sql. > The goodness of it must be accepted by faith. :-) Not if you have the stamina to fight your way through the standards documents. ;-) > I still have a doubt about the result of the GROUP BY clause. > It seems to me that there's an inconsistence between the GROUP BY > clause and the unique index. > The GROUP BY clause, consider NULLs as known and equal values > while the index unique constraint consider NULLs as unknown values > and not equals between them. > Don't you think, there's an inconsistence here? I think these behaviors are required by the standard. The PostgreSQL community generally feels pretty strongly that when standard syntax is accepted, standard semantics are provided. While the standard is often criticized, in this case I think it makes sense. The meaning of NULL is traditionally "UNKNOWN or NOT APPLICABLE". It would not make sense to have a hard prohibition of two rows which only *might* be relating to the same object. In the "NOT APPLICABLE" case it would make sense, but unfortunately SQL has no way to distinguish which meaning NULL has. On the other hand, aggregates like counts might be very useful -- it is often useful to know not only how many rows have each of the known values, but how many are missing a value. Have you looked at whether an exclusion constraint would serve your needs here? -Kevin
On Fri, Jun 01, 2012 at 08:58:32AM +0200, jo wrote: > Hi Tom, > > Thanks for the explanation about standard sql. > The goodness of it must be accepted by faith. :-) > I still have a doubt about the result of the GROUP BY clause. > It seems to me that there's an inconsistence between the GROUP BY > clause and the unique index. > The GROUP BY clause, consider NULLs as known and equal values > while the index unique constraint consider NULLs as unknown values > and not equals between them. > Don't you think, there's an inconsistence here? Yes, I can see your point. I think GROUP BY is doing the best it can with the NULL; having it consider them as different would lead to long output. Also consider that COUNT(*) counts nulls, while COUNT(col) does not: WITH null_test (col) AS ( SELECT 8 UNION ALL SELECT NULL ) SELECT COUNT(*) FROM null_test UNION ALL SELECT COUNT(col) FROM null_test; count ------- 2 1 (2 rows) COUNT(*) can't skip nulls because there is no specified column, but why does COUNT(col) skip nulls --- again, inconsistent. I think NULL is helpful for unknown values, and required as the output of missing INSERT columns and unjoined outer join columns. I think the aggregates then did the best they could. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: > COUNT(*) can't skip nulls because there is no specified column, > but why does COUNT(col) skip nulls --- again, inconsistent. I disagree -- one is counting rows, the other is counting rows with a value in that column. I guess one could criticize the syntax for specifying that as non-obvious, but it seems pretty reasonable to me. -Kevin
On Mon, Jun 04, 2012 at 10:29:22AM -0500, Kevin Grittner wrote: > Bruce Momjian <bruce@momjian.us> wrote: > > > COUNT(*) can't skip nulls because there is no specified column, > > but why does COUNT(col) skip nulls --- again, inconsistent. > > I disagree -- one is counting rows, the other is counting rows with > a value in that column. I guess one could criticize the syntax for > specifying that as non-obvious, but it seems pretty reasonable to > me. I get your point about COUNT(*) really counting rows, not values, but why doesn't GROUP BY then skip nulls? WITH null_test (col1, col2) AS ( SELECT 1, null UNION ALL SELECT null, null ) SELECT COUNT(*), col2 FROM null_test group by col2 UNION ALL SELECT COUNT(col1), col2 FROM null_test group by col2; count | col2 -------+------ 2 | 1 | (2 rows) Since col2 is null in both places, why it is processed? Looks like GROUP BY is selecting the NULL rows, then COUNT is processing them based on its rules. I think the original complaint is that NULL != NULL in a WHERE clause, but GROUP BY is able to group them together just fine. Anyway, just thoughts on the topic. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
Bruce Momjian <bruce@momjian.us> wrote: > I get your point about COUNT(*) really counting rows, not values, > but why doesn't GROUP BY then skip nulls? > > WITH null_test (col1, col2) AS > ( > SELECT 1, null > UNION ALL > SELECT null, null > ) > SELECT COUNT(*), col2 FROM null_test group by col2 > UNION ALL > SELECT COUNT(col1), col2 FROM null_test group by col2; > > count | col2 > -------+------ > 2 | > 1 | > (2 rows) I'm having trouble seeing why this result is confusing. You asked for counts with a GROUP BY clause. In such a case, NULLs are a group, since you might be interested in how many *are* null. Then you did a count of all rows and a count of rows where another value wasn't NULL. You got the only reasonable result, IMO. (Well, unless you argue that a row with no known values should be an error in the first place, which if I remember correctly is what E.F. Codd argued for; but that is one point on which the standards committee didn't go with Codd's position.) > Looks like GROUP BY is selecting the NULL rows, then COUNT is > processing them based on its rules. I would tend to view it that COUNT is processing the rows it was asked to process in each case, and GROUP BY is grouping them as requested. > I think the original complaint is that NULL != NULL in a WHERE > clause, but GROUP BY is able to group them together just fine. Whoa! I think I know what you meant, but that is a dangerously misleading misstatement. It is not true that NULL = NULL, but neither is it true that NULL != NULL. In fact, you also don't get TRUE from NOT NULL = NULL. If you don't know either value, you can't know that they are equal, and you can't know that they are unequal. The results of such comparisons are UNKNOWN. That doesn't mean you would always find the count of rows where the value is NULL uninteresting; hence the IS NOT DISTINCT FROM concept is effectively used for grouping. Performing logical or set operations on data sets with missing values is an inherently tricky business, but I think that overall SQL has made reasonable choices on how to do that; my biggest gripe is that there is no standard way to distinguish between UNKNOWN and NOT APPLICABLE. The fuzziest areas seem to me to be related to that deficiency. As long as NULL is not abused for such things as "known to be zero" in an accounting record (which is subtly but significantly different from "not applicable"), NULL is almost always (IMO) better than some "magic value". If you have ever converted data from a database where names were split into multiple fields, and NULL was not allowed for middle name, you will probably agree. -Kevin
On Mon, Jun 04, 2012 at 11:26:20AM -0500, Kevin Grittner wrote: > > I think the original complaint is that NULL != NULL in a WHERE > > clause, but GROUP BY is able to group them together just fine. > > Whoa! I think I know what you meant, but that is a dangerously > misleading misstatement. It is not true that NULL = NULL, but > neither is it true that NULL != NULL. In fact, you also don't get > TRUE from NOT NULL = NULL. If you don't know either value, you > can't know that they are equal, and you can't know that they are > unequal. The results of such comparisons are UNKNOWN. That doesn't > mean you would always find the count of rows where the value is NULL > uninteresting; hence the IS NOT DISTINCT FROM concept is effectively > used for grouping. Well, thinking of this from a procedural language perspective, a new value comes in and GROUP BY has to figure out if it already has a matching value. If a NULL comes in, anything you compare it to is NULL, even another NULL, so what is the logic that allows these nulls to be placed in the same group? Every value to compare it to returns NULL. I assume IS DISTINCT FROM is the answer. > Performing logical or set operations on data sets with missing > values is an inherently tricky business, but I think that overall > SQL has made reasonable choices on how to do that; my biggest gripe > is that there is no standard way to distinguish between UNKNOWN and > NOT APPLICABLE. The fuzziest areas seem to me to be related to > that deficiency. As long as NULL is not abused for such things as > "known to be zero" in an accounting record (which is subtly but > significantly different from "not applicable"), NULL is almost > always (IMO) better than some "magic value". If you have ever > converted data from a database where names were split into multiple > fields, and NULL was not allowed for middle name, you will probably > agree. Agreed, we could use another null-type value, though NULLs are already so confusing that I am sure adding another one just caused too many groans in the room when it was suggested. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On Mon, 2012-06-04 at 11:56 -0400, Bruce Momjian wrote: > I get your point about COUNT(*) really counting rows, not values, but > why doesn't GROUP BY then skip nulls? A while ago, I came to the conclusion that applying logic to extrapolate the behavior of NULL is a bad idea: http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/ Jose was not wrong about the inconsistency between UNIQUE and GROUP BY. But the answer is that "we do it that way because the standard says so". And that's a good reason. Regards, Jeff Davis
Jeff Davis wrote: > On Mon, 2012-06-04 at 11:56 -0400, Bruce Momjian wrote: > >> I get your point about COUNT(*) really counting rows, not values, but >> why doesn't GROUP BY then skip nulls? >> > > A while ago, I came to the conclusion that applying logic to extrapolate > the behavior of NULL is a bad idea: > > http://thoughts.davisjeff.com/2009/08/02/what-is-the-deal-with-nulls/ > > Jose was not wrong about the inconsistency between UNIQUE and GROUP BY. > But the answer is that "we do it that way because the standard says so". > And that's a good reason. > > Regards, > Jeff Davis > > > The article pointed by Jeff is very insightful. NULLs, if you know them, avoid them :-( I agree with Date and Darwen about NULLs: "Chris Date and Hugh Darwen the authors of The Third Manifesto, have suggested that the SQL Null implementation is inherently flawed and should be ELIMINATED altogether^ <http://en.wikipedia.org/wiki/Null_%28SQL%29#cite_note-3rdmanifesto-16> , pointing to inconsistencies and flaws in the implementation of SQL Null-handling (particularly in aggregate functions) as proof that the entire concept of Null is flawed and should be removed from the relational model" j