"large" IN/NOT IN subqueries result in query returning wrong data - Mailing list pgsql-sql

From George Pavlov
Subject "large" IN/NOT IN subqueries result in query returning wrong data
Date
Msg-id CCB89282FCE1024EA3DCE687A96A5164039EBD13@ehost010-6.exch010.intermedia.net
Whole thread Raw
Responses Re: "large" IN/NOT IN subqueries result in query returning wrong data  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: "large" IN/NOT IN subqueries result in query returning  (John McCawley <nospam@hardgeus.com>)
List pgsql-sql
The following looks like a bug to me, but please someone review and tell
me what I might be missing. Seems that past a certain result set size a
"[NOT] IN (subquery)" stops behaving as expected and returns 0 matches
even when there should be matches. No errors are returned, just faulty
data. The exact threshholds seem to depend on the data, subquery and
possibly the indexes in place. Nothing in the documentation mentions a
limit to what "IN subquery" can take so I am quite dismayed (if there is
a limit I would expect an ERROR/WARNING/something, but not wrong data).
Read below for details. This has been tried on PG 8.0.4 and 8.1.1 on
WinXP, and 8.0.4 and 8.0.5 on Linux.

I have two tables: t1 (id, name, a_type) t2 (id, name)

The t1 table is "big" (483211 rows), the name column contains
duplicates, the a_type has only two values. The t2 table is "small" (40
rows), the name values are unique. Some, but not all, t2.name values are
present in t1.name. To be precise, there are 10 t2.name values that do
not occur in t1.name (this is based on extraneous knowledge). I want to
find out which of the t2.name values are not ever used in t1.name:

select count(*) from t2 where t2.name not in ( select t1.name from t1);
--> 0

This should return 10, instead it returns 0!!! Adding a LIMIT to the
subquery and doing some trial and error produces very interesting
results:

select count(*) from t2 where t2.name not in ( select t1.name from t1 limit 261683)
--> 13
select count(*) from t2 where t2.name not in ( select t1.name from t1 limit 261684)
--> 0

What is so magical about 261683? The JOIN alternative produces the
correct results no matter what:

select count(*) from t2 left join t1 using (name)where t1.name is null
--> 10

This pretty much summarizes the issue. Any thoughts greatly appreciated.
Follow a few variations to show how the threshhold varies.

-- restrict to only one a_type
--
select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P' limit 289014)
--> 11
select count(*) from t2 where t2.name not in ( select t1.name from t1 where t1.a_type='P' limit 289015)
--> 0
select count(*) from t2 left join (select * from t1 where a_type='P') t1 using (name)where t1.name is null
--> 11

-- try distinct instead
--
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1)
--> 0
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 limit 56001)
--> 10
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 limit 56002)
--> 0

-- distinct + a_type restriction (adding the a_type restriction
-- here _lowers_ the threshhold)
--
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P')
--> 0
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P' limit 33620)
--> 11
select count(*) from t2 where t2.name not in ( select distinct t1.name from t1 where t1.a_type='P' limit 33621)
--> 0



pgsql-sql by date:

Previous
From: "Ken Winter"
Date:
Subject: Re: The Information Schema vs the PG Catalog
Next
From: Tom Lane
Date:
Subject: Re: "large" IN/NOT IN subqueries result in query returning wrong data