Thread: "large" IN/NOT IN subqueries result in query returning wrong data
The following looks like a bug to me, but please someone review and tell me what I might be missing. Seems that past a certain result set size a "[NOT] IN (subquery)" stops behaving as expected and returns 0 matches even when there should be matches. No errors are returned, just faulty data. The exact threshholds seem to depend on the data, subquery and possibly the indexes in place. Nothing in the documentation mentions a limit to what "IN subquery" can take so I am quite dismayed (if there is a limit I would expect an ERROR/WARNING/something, but not wrong data). Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on WinXP, and 8.0.4 and 8.0.5 on Linux. I have two tables: t1 (id, name, a_type) t2 (id, name) The t1 table is "big" (483211 rows), the name column contains duplicates, the a_type has only two values. The t2 table is "small" (40 rows), the name values are unique. Some, but not all, t2.name values are present in t1.name. To be precise, there are 10 t2.name values that do not occur in t1.name (this is based on extraneous knowledge). I want to find out which of the t2.name values are not ever used in t1.name: select count(*) from t2 where t2.name not in ( select t1.name from t1); --> 0 This should return 10, instead it returns 0!!! Adding a LIMIT to the subquery and doing some trial and error produces very interesting results: select count(*) from t2 where t2.name not in ( select t1.name from t1 limit 261683) --> 13 select count(*) from t2 where t2.name not in ( select t1.name from t1 limit 261684) --> 0 What is so magical about 261683? The JOIN alternative produces the correct results no matter what: select count(*) from t2 left join t1 using (name)where t1.name is null --> 10 This pretty much summarizes the issue. Any thoughts greatly appreciated. Follow a few variations to show how the threshhold varies. -- restrict to only one a_type -- select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P') --> 0 select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P' limit 289014) --> 11 select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P' limit 289015) --> 0 select count(*) from t2 left join (select * from t1 where a_type='P') t1 using (name)where t1.name is null --> 11 -- try distinct instead -- select count(*) from t2 where t2.name not in ( select distinct t1.name from t1) --> 0 select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 limit 56001) --> 10 select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 limit 56002) --> 0 -- distinct + a_type restriction (adding the a_type restriction -- here _lowers_ the threshhold) -- select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P') --> 0 select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P' limit 33620) --> 11 select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P' limit 33621) --> 0
"George Pavlov" <gpavlov@mynewplace.com> writes: > select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261683) > --> 13 > select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261684) > --> 0 > What is so magical about 261683? Most likely, the 261684'th row of t1 has a NULL value of name. Many people find the behavior of NOT IN with nulls unintuitive, but it's per SQL spec ... regards, tom lane
At a glance I would guess that NULL values in one or both of your tables is the culprit. NULL values always return false. Example: A quick test on my database: select count(*) FROM tbl_employee; count ------- 2689 select count(*) FROM tbl_employee WHERE username IS NULL;count ------- 35 So I have 35 null values. create table tbl_foo (username varchar(32)); insert into tbl_foo (username) values ('poop'); select count(*) FROM tbl_employee WHERE tbl_employee.username NOT IN (select tbl_foo.username FROM tbl_foo);count ------- 2654 So I only get 2654 values that are NOT IN the set 'poop'...i.e. the NULL values are not included when I use the "NOT IN" my query. Nulls can be confusing. Hope this helps. George Pavlov wrote: >The following looks like a bug to me, but please someone review and tell >me what I might be missing. Seems that past a certain result set size a >"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches >even when there should be matches. No errors are returned, just faulty >data. The exact threshholds seem to depend on the data, subquery and >possibly the indexes in place. Nothing in the documentation mentions a >limit to what "IN subquery" can take so I am quite dismayed (if there is >a limit I would expect an ERROR/WARNING/something, but not wrong data). >Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on >WinXP, and 8.0.4 and 8.0.5 on Linux. > >I have two tables: > t1 (id, name, a_type) > t2 (id, name) > >The t1 table is "big" (483211 rows), the name column contains >duplicates, the a_type has only two values. The t2 table is "small" (40 >rows), the name values are unique. Some, but not all, t2.name values are >present in t1.name. To be precise, there are 10 t2.name values that do >not occur in t1.name (this is based on extraneous knowledge). I want to >find out which of the t2.name values are not ever used in t1.name: > >select count(*) from t2 where t2.name not in ( > select t1.name from t1); >--> 0 > >This should return 10, instead it returns 0!!! Adding a LIMIT to the >subquery and doing some trial and error produces very interesting >results: > >select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261683) >--> 13 >select count(*) from t2 where t2.name not in ( > select t1.name from t1 limit 261684) >--> 0 > >What is so magical about 261683? The JOIN alternative produces the >correct results no matter what: > >select count(*) > from t2 left join t1 using (name) > where t1.name is null >--> 10 > >This pretty much summarizes the issue. Any thoughts greatly appreciated. >Follow a few variations to show how the threshhold varies. > >-- restrict to only one a_type > >
Yes, yes, of course... There are nulls in my t1 table. And, of course, NOT IN can return THREE possible values: not just TRUE or FALSE but also NULL... select distinct (moo.goo not in (null)) from moo; --> null select count(*) from moo where moo.goo not in ('gai', 'pan', null) --> 0, no matter what the goo values are Sorry... Thanks! George
On Tue, Dec 27, 2005 at 02:44:51PM -0500, Tom Lane wrote: > "George Pavlov" <gpavlov@mynewplace.com> writes: > > select count(*) from t2 where t2.name not in ( > > select t1.name from t1 limit 261683) > > --> 13 > > select count(*) from t2 where t2.name not in ( > > select t1.name from t1 limit 261684) > > --> 0 > > > What is so magical about 261683? > > Most likely, the 261684'th row of t1 has a NULL value of name. > Many people find the behavior of NOT IN with nulls unintuitive, > but it's per SQL spec ... > > regards, tom lane In 8.0 we get: elein=# select 1 in (NULL, 1, 2); ?column? ---------- t (1 row) elein=# select 3 not in (NULL, 1, 2); ?column? ---------- (1 row) For consistency, either both should return NULL or both return true/false. For completeness testing, the following are correct. Select NULL in/not in any list returns NULL. elein=# select NULL in (1,2); ?column? ---------- (1 row) elein=# select NULL not in (1,2); ?column? ---------- (1 row) elein=# select NULL in (NULL, 1,2); ?column? ---------- (1 row) elein=# select NULL not in (NULL, 1,2); ?column? ---------- (1 row) elein -------------------------------------------------------------- elein@varlena.com Varlena, LLC www.varlena.com (510)655-2584(o) (510)543-6079(c) PostgreSQL Consulting, Support & Training PostgreSQL General Bits http://www.varlena.com/GeneralBits/ -------------------------------------------------------------- AIM: varlenallc Yahoo: AElein Skype: varlenallc -------------------------------------------------------------- I have always depended on the [QA] of strangers.
elein <elein@varlena.com> writes: > In 8.0 we get: > elein=# select 1 in (NULL, 1, 2); > ?column? > ---------- > t > (1 row) > elein=# select 3 not in (NULL, 1, 2); > ?column? > ---------- > (1 row) > For consistency, either both should return NULL or > both return true/false. The above behavior is correct per spec. Feel free to argue its consistency with the SQL committee ;-) Note that the above are not inverses because you changed the lefthand input. You do get consistent results when you just add or omit NOT: regression=# select 1 not in (NULL, 1, 2);?column? ----------f (1 row) regression=# select 3 in (NULL, 1, 2);?column? ---------- (1 row) regards, tom lane
On Tue, Dec 27, 2005 at 07:25:40PM -0500, Tom Lane wrote: > elein <elein@varlena.com> writes: > > In 8.0 we get: > > > elein=# select 1 in (NULL, 1, 2); > > ?column? > > ---------- > > t > > (1 row) > > > elein=# select 3 not in (NULL, 1, 2); > > ?column? > > ---------- > > > (1 row) > > > For consistency, either both should return NULL or > > both return true/false. > > The above behavior is correct per spec. Feel free to argue its > consistency with the SQL committee ;-) Oh, no! Not the committee! > > Note that the above are not inverses because you changed the lefthand > input. You do get consistent results when you just add or omit NOT: Yes, you are right. I skipped the permutations to get down to the point. > > regression=# select 1 not in (NULL, 1, 2); > ?column? > ---------- > f > (1 row) > > regression=# select 3 in (NULL, 1, 2); > ?column? > ---------- > > (1 row) > > regards, tom lane Thanks for your clarification. ~elein elein@varlena.com
elein <elein@varlena.com> writes: > > Note that the above are not inverses because you changed the lefthand > > input. You do get consistent results when you just add or omit NOT: > Yes, you are right. I skipped the permutations to get down to the point. Remember that NULL means "unknown". So "1 IN (1,2,NULL)" *should* be true because regardless of what that unknown value is it's still obvious that 1 really is in the list. And "3 NOT IN (1,2,NULL)" is unknown because it depends on whether that unknown quantity is 3 or not. IN is the same as "= ANY" so "1 IN (1,2,NULL)" is the same as "1=1 OR 1=2 OR 1=NULL" which is true even though the last of the three is null. -- greg