Re: "large" IN/NOT IN subqueries result in query returning - Mailing list pgsql-sql

From John McCawley
Subject Re: "large" IN/NOT IN subqueries result in query returning
Date
Msg-id 43B19BBA.8050403@hardgeus.com
Whole thread Raw
In response to "large" IN/NOT IN subqueries result in query returning wrong data  ("George Pavlov" <gpavlov@mynewplace.com>)
List pgsql-sql
At a glance I would guess that NULL values in one or both of your tables 
is the culprit.  NULL values always return false.

Example:

A quick test on my database:

select count(*) FROM tbl_employee;
count
------- 2689

select count(*) FROM tbl_employee WHERE username IS NULL;count
-------   35

So I have 35 null values.

create table tbl_foo (username varchar(32));

insert into tbl_foo (username) values ('poop');

select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN 
(select tbl_foo.username FROM tbl_foo);count
------- 2654

So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL 
values are not included when I use the "NOT IN" my query.  Nulls can be 
confusing.

Hope this helps.

George Pavlov wrote:

>The following looks like a bug to me, but please someone review and tell
>me what I might be missing. Seems that past a certain result set size a
>"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
>even when there should be matches. No errors are returned, just faulty
>data. The exact threshholds seem to depend on the data, subquery and
>possibly the indexes in place. Nothing in the documentation mentions a
>limit to what "IN subquery" can take so I am quite dismayed (if there is
>a limit I would expect an ERROR/WARNING/something, but not wrong data).
>Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
>WinXP, and 8.0.4 and 8.0.5 on Linux.
>
>I have two tables:
>  t1 (id, name, a_type)
>  t2 (id, name)
>
>The t1 table is "big" (483211 rows), the name column contains
>duplicates, the a_type has only two values. The t2 table is "small" (40
>rows), the name values are unique. Some, but not all, t2.name values are
>present in t1.name. To be precise, there are 10 t2.name values that do
>not occur in t1.name (this is based on extraneous knowledge). I want to
>find out which of the t2.name values are not ever used in t1.name:
>
>select count(*) from t2 where t2.name not in (
>  select t1.name from t1);
>--> 0
>
>This should return 10, instead it returns 0!!! Adding a LIMIT to the
>subquery and doing some trial and error produces very interesting
>results:
>
>select count(*) from t2 where t2.name not in (
>  select t1.name from t1 limit 261683) 
>--> 13
>select count(*) from t2 where t2.name not in (
>  select t1.name from t1 limit 261684) 
>--> 0
>
>What is so magical about 261683? The JOIN alternative produces the
>correct results no matter what:
>
>select count(*) 
> from t2 left join t1 using (name)
> where t1.name is null 
>--> 10
>
>This pretty much summarizes the issue. Any thoughts greatly appreciated.
>Follow a few variations to show how the threshhold varies.
>
>-- restrict to only one a_type
>  
>


pgsql-sql by date:

Previous
From: Tom Lane
Date:
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data
Next
From: "George Pavlov"
Date:
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data