Thread: IN or EXISTS
Hi all, I have read things someplace saying not exists was better than not in... or something like that. Not sure if that was for in/exists and not in/not exists, and for a lot of records or not. Here is my setup: My website has a general table, let say 60k rows. Its mostly read-only. Every once and a while we get updated data, so I: create schema upd; create table upd.general(like public.general); Then I dump the new data into upd.general. (This has many table's and steps, I'm simplifying it here). For the last step, I want to: begin; delete from public.general where gid in (select gid from upd.general); insert into public.general select * from upd.general; ... 7 other tables same way ... commit; Most of the time upd.general will be < 500 rows. Every once and a while things get messed up and we just update the entire database, so count(*) upd.general == count(*) public.general. My question is: fast is nice, but safe and less resource intensive is better, so which would I probably like better: delete from public.general where gid in (select gid from upd.general); or -- currently dont have and index, so create index general_pk on upd.general(gid); delete from public.general a where exists(select 1 from upd.general b where a.gid=b.gid); Thanks for any suggestions, -Andy
On 31/08/2011 4:30 AM, Andy Colson wrote: > Hi all, > > I have read things someplace saying not exists was better than not > in... or something like that. Not sure if that was for in/exists and > not in/not exists, and for a lot of records or not. > `EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary to build a list of values then iterate over them to check for a match. By contrast, `EXISTS' may use a simple index lookup or the like to test for the presence of a value. On the other hand, the `IN' subquery is uncorrelated needs only run once, where the `EXISTS' subquery is correlated and has to run once for every outer record. That means that the `IN' list approach can be a lot faster where the subquery in question is relatively time consuming for the number of values it returns. For example, if the `IN' query returns only 5 values and takes 100ms, you're scanning 1 million records in the outer query, and the subquery `EXISTS' version would take 50ms, using `IN' is a no-brainer since 1 million times 50ms will be a lot slower than 1 times 100ms plus the time required to scan 5 elements 1 million times. Another complication is the possible presence of NULL in an IN list. Getting NULLs in `IN' lists is a common source of questions on this list, because people are quite surprised by how it works. EXISTS avoids the NULL handling issue (and in the process demonstrates how woefully inconsistent SQL's handling of NULL really is). Theoretically the query planner could transform: SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z.y_id IS NOT NULL); into: SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id) ... or vice versa depending on which it thought would be faster. AFAIK it doesn't currently do this. To be able to do it the planner would need to know how to estimate the cost of scanning an `IN' result list. It'd also need to be able to use constraints on the target table to prove that the result of the `IN' may not contain nulls. To transform the EXISTS version into the IN version where it'd be more efficient, it'd also have to be able to use constraints on the target table to prove that results of a SELECT would be unique without explicit deduplication. All this makes me wonder ... does Pg currently support sorting IN lists and using a binary search? It'd be pretty nice to be able to prove that: SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z); is equvalent to: SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id IS NOT NULL) ... and either transform it to an EXISTS test or add an ORDER BY z_id and flag the resultset as sorted so a binary search could be done on it whenever a row hits the IN test. -- Craig Ringer
On 8/30/2011 8:33 PM, Craig Ringer wrote: > On 31/08/2011 4:30 AM, Andy Colson wrote: >> Hi all, >> >> I have read things someplace saying not exists was better than not >> in... or something like that. Not sure if that was for in/exists and >> not in/not exists, and for a lot of records or not. >> > `EXISTS' may perform faster than `IN', yes. Using `IN' it is necessary > to build a list of values then iterate over them to check for a match. > By contrast, `EXISTS' may use a simple index lookup or the like to test > for the presence of a value. > > On the other hand, the `IN' subquery is uncorrelated needs only run > once, where the `EXISTS' subquery is correlated and has to run once for > every outer record. That means that the `IN' list approach can be a lot > faster where the subquery in question is relatively time consuming for > the number of values it returns. For example, if the `IN' query returns > only 5 values and takes 100ms, you're scanning 1 million records in the > outer query, and the subquery `EXISTS' version would take 50ms, using > `IN' is a no-brainer since 1 million times 50ms will be a lot slower > than 1 times 100ms plus the time required to scan 5 elements 1 million > times. > > Another complication is the possible presence of NULL in an IN list. > Getting NULLs in `IN' lists is a common source of questions on this > list, because people are quite surprised by how it works. EXISTS avoids > the NULL handling issue (and in the process demonstrates how woefully > inconsistent SQL's handling of NULL really is). > > Theoretically the query planner could transform: > > SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE > z.y_id IS NOT NULL); > > into: > > SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id) > > ... or vice versa depending on which it thought would be faster. AFAIK > it doesn't currently do this. To be able to do it the planner would need > to know how to estimate the cost of scanning an `IN' result list. It'd > also need to be able to use constraints on the target table to prove > that the result of the `IN' may not contain nulls. To transform the > EXISTS version into the IN version where it'd be more efficient, it'd > also have to be able to use constraints on the target table to prove > that results of a SELECT would be unique without explicit deduplication. > > All this makes me wonder ... does Pg currently support sorting IN lists > and using a binary search? It'd be pretty nice to be able to prove that: > > SELECT * from y WHERE y.id IN (SELECT z.y_id FROM z); > > is equvalent to: > > SELECT * FROM y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE z_id > IS NOT NULL) > > ... and either transform it to an EXISTS test or add an ORDER BY z_id > and flag the resultset as sorted so a binary search could be done on it > whenever a row hits the IN test. > > -- > Craig Ringer > Yeah... my current code uses IN. Most of my updates are small, so my inner list is 500 integers. It runs fine. What I'm worried about is when I update the entire table, so my inner list is 60k integers. Maybe I'm just worrying for naught. I tested a table with 100k rows, ran both with explain analyzes, and they look the same: Delete (cost=11186.26..20817.60 rows=25911 width=12) (actual time=408.138..408.138 rows=0 loops=1) -> Hash Semi Join (cost=11186.26..20817.60 rows=25911 width=12) (actual time=61.997..182.573 rows=105434 loops=1) Hash Cond: (public.general.gid = upd.general.gid) -> Seq Scan on general (cost=0.00..9113.11 rows=25911 width=10) (actual time=0.004..42.364 rows=105434 loops=1) -> Hash (cost=9868.34..9868.34 rows=105434 width=10) (actual time=61.958..61.958 rows=105434 loops=1) Buckets: 16384 Batches: 1 Memory Usage: 4531kB -> Seq Scan on general (cost=0.00..9868.34 rows=105434 width=10) (actual time=0.003..34.372 rows=105434 loops=1) With or without an index, (even if I ANALYZE it) it still does a table scan and builds a hash. Both IN and EXISTS act the same way. I assume: Buckets: 16384 Batches: 1 Memory Usage: 4531kB That means a total of 4.5 meg of ram was used for the hash, so if my work_mem was lower than that it would swap? (or choose a different plan?) I'll only ever be running one update at a time, so I'm not worried about multiple connections running at once. Anyway, I'll just leave it alone (and stop optimizing things that dont need it) -Andy
On 31 Srpen 2011, 15:59, Andy Colson wrote: > I assume: > Buckets: 16384 Batches: 1 Memory Usage: 4531kB > > That means a total of 4.5 meg of ram was used for the hash, so if my > work_mem was lower than that it would swap? (or choose a different plan?) Why don't you try that? Just set the work_mem to 1MB or so and run the query. I think it'll use the same plan but multiple batches - read just part of the inner table so that the hash table fits into work_mem, scan the outer table etc. The downside is it'd rescan the outer table several times. Tomas
On Wed, 2011-08-31 at 09:33 +0800, Craig Ringer wrote: > On the other hand, the `IN' subquery is uncorrelated needs only run > once, where the `EXISTS' subquery is correlated and has to run once for > every outer record. If the EXISTS looks semantically similar to an IN (aside from NULL semantics), then it can be made into a semijoin. It doesn't require re-executing any part of the plan. I don't think there are any cases where [NOT] IN is an improvement, am I mistaken? > Another complication is the possible presence of NULL in an IN list. > Getting NULLs in `IN' lists is a common source of questions on this > list, because people are quite surprised by how it works. EXISTS avoids > the NULL handling issue (and in the process demonstrates how woefully > inconsistent SQL's handling of NULL really is). Absolutely. The NULL behavior of IN is what makes it hard to optimize, and therefore you should use EXISTS instead if the semantics are suitable. > Theoretically the query planner could transform: > > SELECT * from y WHERE y.id IN (SELECT DISTINCT z.y_id FROM z WHERE > z.y_id IS NOT NULL); > > into: > > SELECT * FROM y WHERE EXISTS (SELECT 1 FROM z WHERE z.y_id = y.id) > > ... or vice versa depending on which it thought would be faster. Although those two queries are semantically the same (I think), a lot of very similar pairs of queries are not equivalent. For instance, if it was a NOT IN you couldn't change that to a NOT EXISTS. Regards, Jeff Davis