Thread: Weird NOT IN effect with NULL values
When doing a subselect with NOT IN, as in SELECT name FROM customer WHERE customer_id NOT IN ( SELECT customer_id FROM salesorder ); (from Bruce Momjian's book) I get no rows if the result column returned by the subselect contains NULL values. It works as expected if I remove the NULL values from the result set. Is this behaviour correct and if so, why? I am using 7.1 beta 4. Regards, Frank
Mr. Joerdens, > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > from the result set. Is this behaviour correct and if so, why? I can see how that bug would happen. You may want to forward your e-mail to pgsql-bugs. Regardless, you'll find that you get faster results (as well as avoiding the NULL bug) if you use the following form of the query: SELECT name FROM customer WHERE NOT EXISTS ( SELECT customer_id FROM salesorder WHERE customer_id = customer.customer_id ); Bruce, you may want to consider editing your next edition to include the above modification. WHERE ... NOT IN is a bad idea for any subselect on medium-large tables. -Josh Berkus -- ______AGLIO DATABASE SOLUTIONS___________________________ Josh Berkus Complete informationtechnology josh@agliodbs.com and data management solutions (415) 565-7293 for law firms, small businesses fax 621-2533 and non-profit organizations. San Francisco
> When doing a subselect with NOT IN, as in > > SELECT name > FROM customer > WHERE customer_id NOT IN ( > SELECT customer_id > FROM salesorder > ); > > (from Bruce Momjian's book) > > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > from the result set. Is this behaviour correct and if so, why? > > I am using 7.1 beta 4. Read more in the book. It covers subqueries with nulls, bottom of pages 96. Not sure about web URL but it is in the subqueries section titled "NOT IN and Subqueries with NULL Values". -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Frank Joerdens writes: > When doing a subselect with NOT IN, as in > > SELECT name > >FROM customer > WHERE customer_id NOT IN ( > SELECT customer_id > >FROM salesorder > ); > > (from Bruce Momjian's book) > > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > >from the result set. Is this behaviour correct and if so, why? It is correct. customer_id NOT IN (value1, value2, value3, ...) (which is what the subselect would essentially resolve to) is equivalent to NOT (customer_id = value1 OR customer_id = value2 OR customer_id = value3 ...) Say value2 is NULL. Then we have NOT (customer_id = value1 OR customer_id = NULL OR customer_id = value3 ...) NOT (customer_id = value1 OR NULL OR customer_id = value3 ...) NOT (NULL) NULL which means FALSE in a WHERE condition, so no rows are returned. Note that 'xxx = NULL' is different from 'xxx IS NULL'. Also note that NULL is not the same as FALSE in general. -- Peter Eisentraut peter_e@gmx.net http://yi.org/peter-e/
On Thu, 1 Mar 2001, Frank Joerdens wrote: > When doing a subselect with NOT IN, as in > > SELECT name > FROM customer > WHERE customer_id NOT IN ( > SELECT customer_id > FROM salesorder > ); > > (from Bruce Momjian's book) > > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > from the result set. Is this behaviour correct and if so, why? > > I am using 7.1 beta 4. I believe it may be actually correct. If my reading of the spec is correct (which it possibly is not), customer_id NOT IN (subselect) is effectively, NOT ( customer_id = ANY (subselect) ) and then: Using the rules for ANY, If customer_id=<inner customer_id> for at least one row, IN returns trueso NOT IN returns false. If customer_id=<inner customer_id> is false for every row, IN returnsfalse so NOT IN returns true. Otherwise IN and NOT IN both return unknown. Since customer_id=NULL is unknown, you're getting at least one unknown in the ANY expression so NOT IN doesn't return true, it returns unknown which is not sufficient for making the where clause return the row.
this is kind of weird but it is how it works. You cannot use equality for null... Null does not equal Null Null means no value, since it's not a value it can't equal anything another no value. SELECT name FROM customer WHERE customer_id NOT IN ( SELECT customer_id FROM salesorder ) and customer_id is not null; should work Ken Frank Joerdens wrote: > When doing a subselect with NOT IN, as in > > SELECT name > FROM customer > WHERE customer_id NOT IN ( > SELECT customer_id > FROM salesorder > ); > > (from Bruce Momjian's book) > > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > from the result set. Is this behaviour correct and if so, why? > > I am using 7.1 beta 4. > > Regards, Frank
Frank Joerdens <frank@joerdens.de> writes: > When doing a subselect with NOT IN, as in > SELECT name > FROM customer > WHERE customer_id NOT IN ( > SELECT customer_id > FROM salesorder > ); > I get no rows if the result column returned by the subselect > contains NULL values. It works as expected if I remove the NULL values > from the result set. Is this behaviour correct and if so, why? This is correct because of SQL's 3-valued boolean logic. SQL92 defines "A NOT IN B" as equivalent to "NOT (A = SOME B)", and the latter has the rules c) If the implied <comparison predicate> is true for at least one row RT in T, then "R <comp op> <some>T" is true. d) If T is empty or if the implied <comparison predicate> is false for every row RT in T, then "R<comp op> <some> T" is false. e) If "R <comp op> <quantifier> T" is neither true nor false, then it is unknown. Now the implied comparison will generate TRUE for the subselect rows that contain a matching customer_id, FALSE for the rows that contain non-matching (but not null) customer_id, and UNKNOWN (null) for the rows that contain nulls. So if you have nulls then case (d) never holds: the result of A = SOME B is either true or unknown. And so the result of NOT IN is either false or unknown, and either way the outer WHERE fails. This can be justified intuitively if you consider that null means "don't know": you can say for sure that the target customer_id IS in the subselect if you find it there, but you can't say for sure that it IS NOT there, because you don't know all the subselect result elements. Bottom line: you probably want to suppress nulls in the subselect... regards, tom lane
> SELECT name > FROM customer > WHERE NOT EXISTS ( > SELECT customer_id > FROM salesorder > WHERE customer_id = customer.customer_id > ); > > Bruce, you may want to consider editing your next edition to include the > above modification. WHERE ... NOT IN is a bad idea for any subselect on > medium-large tables. FAQ item mentions this, and section 8.2 shows eqivalency at the end of the section. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026