Thread: optimizing queries using IN and EXISTS
Hi, I'm surprised at the difference in speed/execution plan between two logically equivalent queries, one using IN, the otherusing EXISTS. (At least I think they are logically equivalent) I've created a small setup that illustrates what I mean. Consider the following tables: CREATE TABLE foo ( id integer NOT NULL, CONSTRAINT foo_pkey PRIMARY KEY (id ) ) CREATE TABLE bar ( foo_ref integer, value character varying, id integer NOT NULL, CONSTRAINT bar_pkey PRIMARY KEY (id ), CONSTRAINT bar_foo_ref_fkey FOREIGN KEY (foo_ref) REFERENCES foo (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION ) The following two queries have very different query plans: SELECT * FROM foo WHERE 'text6' IN (SELECT value FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id = foo.id) and SELECT * FROM foo WHERE EXISTS(SELECT 0 FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id = foo.id AND bar.value = 'text6') Whereas the second one uses the indexes to look up the matching bar rows, the first one performs a full table scan on bar. Given that both queries are logically equivalent, I'm wondering why this optimization isn't made. Is there information missingfor the optimizer to make the better decision? Are these queries not equivalent perhaps? An EXPLAIN ANALYZE on the two queries on filled and analyzed tables highlighting the difference: EXPLAIN ANALYZE SELECT * FROM foo WHERE 'text6' IN (SELECT value FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHEREfoo2.id = foo.id) "Seq Scan on foo (cost=0.00..3316934.60 rows=5000 width=4) (actual time=6.416..10803.056 rows=1 loops=1)" " Filter: (SubPlan 1)" " SubPlan 1" " -> Nested Loop (cost=0.00..663.29 rows=1 width=8) (actual time=0.667..1.079 rows=1 loops=10000)" " -> Seq Scan on bar (cost=0.00..655.00 rows=1 width=12) (actual time=0.660..1.072 rows=1 loops=10000)" " Filter: (foo_ref = foo.id)" " -> Index Scan using foo_pkey on foo foo2 (cost=0.00..8.28 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=10000)" " Index Cond: (id = foo.id)" "Total runtime: 10803.088 ms" EXPLAIN ANALYZE SELECT * FROM foo WHERE EXISTS(SELECT 0 FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id= foo.id AND bar.value = 'text6') "Nested Loop (cost=16.58..24.88 rows=1 width=4) (actual time=0.032..0.032 rows=1 loops=1)" " -> HashAggregate (cost=16.58..16.59 rows=1 width=8) (actual time=0.029..0.029 rows=1 loops=1)" " -> Nested Loop (cost=0.00..16.58 rows=1 width=8) (actual time=0.025..0.025 rows=1 loops=1)" " -> Index Scan using bar_value_idx on bar (cost=0.00..8.29 rows=1 width=4) (actual time=0.019..0.020 rows=1loops=1)" " Index Cond: ((value)::text = 'text6'::text)" " -> Index Scan using foo_pkey on foo foo2 (cost=0.00..8.28 rows=1 width=4) (actual time=0.002..0.003 rows=1loops=1)" " Index Cond: (id = bar.foo_ref)" " -> Index Scan using foo_pkey on foo (cost=0.00..8.28 rows=1 width=4) (actual time=0.001..0.002 rows=1 loops=1)" " Index Cond: (id = bar.foo_ref)" "Total runtime: 0.064 ms" Hoping someone sheds some light on this and restores my confidence in the optimizer, Nick Hofstede PS: I know the EXIST can also be rewritten to a JOIN SELECT foo.id FROM foo JOIN bar ON bar.foo_ref = foo.id JOIN foo AS foo2 ON bar.foo_ref = foo2.id WHERE foo2.id = foo.id AND bar.value = 'text6' and ultimately to (thanks to foo2.id = foo.id) SELECT foo.id FROM foo JOIN bar ON bar.foo_ref = foo.id WHERE bar.value = 'text6' .. all of wich have an execution plan and performance similar to the EXISTS query. What I'm concerned about is that the first step from IN to EXISTS isn't made (which also precludes all following optimizationsteps) ________________________________ Inventive Designers' Email Disclaimer: http://www.inventivedesigners.com/email-disclaimer
On 18 July 2012 17:10, Nick Hofstede <Nick.Hofstede@inventivegroup.com> wrote: > Hi, > > I'm surprised at the difference in speed/execution plan between two logically equivalent queries, one using IN, the otherusing EXISTS. (At least I think they are logically equivalent) They are not logically equivalent. http://www.postgresql.org/docs/current/static/functions-subquery.html See the notes about NULL under IN. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
I realize there is a case where IN returns NULL and EXISTS returns FALSE (in case there is a matching bar with bar.valueset to NULL) In that case however, this would result in the foo row not being included in the resultset, which is the same outcome inboth cases. NOT IN vs NOT EXISTS is another story, I agree. So even though the subqueries aren't logically equivalent, I still believe the queries in their totality are. Is this reasoning (knowing the NULL will be treated as FALSE by the WHERE) a bridge too far for the optimizer? I retested with bar.value declared as NOT NULL but that doesn't seem to help. Even though this is a bit disappointing, I think it gave me a feel of what the optimizer knows about and takes into considerationand what not ... With kind regards, Nick ________________________________________ Van: Peter Geoghegan [peter@2ndquadrant.com] Verzonden: woensdag 18 juli 2012 20:40 Aan: Nick Hofstede CC: pgsql-performance@postgresql.org Onderwerp: Re: [PERFORM] optimizing queries using IN and EXISTS On 18 July 2012 17:10, Nick Hofstede <Nick.Hofstede@inventivegroup.com> wrote: > Hi, > > I'm surprised at the difference in speed/execution plan between two logically equivalent queries, one using IN, the otherusing EXISTS. (At least I think they are logically equivalent) They are not logically equivalent. http://www.postgresql.org/docs/current/static/functions-subquery.html See the notes about NULL under IN. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ________________________________ Inventive Designers' Email Disclaimer: http://www.inventivedesigners.com/email-disclaimer
Nick Hofstede <Nick.Hofstede@inventivegroup.com> writes: > I'm surprised at the difference in speed/execution plan between two logically equivalent queries, one using IN, the otherusing EXISTS. (At least I think they are logically equivalent) > SELECT * > FROM foo > WHERE 'text6' IN (SELECT value > FROM bar > JOIN foo AS foo2 > ON bar.foo_ref = foo2.id > WHERE foo2.id = foo.id) Hm. convert_ANY_sublink_to_join() rejects subqueries that contain any Vars of the parent query level, so the reference to foo.id prevents this from being converted to a semijoin. However, it seems like that's overly restrictive. I'm not sure that we could remove the test altogether, but at least outer vars used in WHERE seem safe. In the meantime, you can recast like this: SELECT * FROM foo WHERE ('text6', id) IN (SELECT value, foo2.id FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id) and still get a semijoin plan from an IN-style query. regards, tom lane
Interesting. Thanks for the work-around. Regards, Nick -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tom Lane Sent: donderdag 19 juli 2012 0:36 To: Nick Hofstede Cc: pgsql-performance@postgresql.org Subject: Re: [PERFORM] optimizing queries using IN and EXISTS Nick Hofstede <Nick.Hofstede@inventivegroup.com> writes: > I'm surprised at the difference in speed/execution plan between two > logically equivalent queries, one using IN, the other using EXISTS. > (At least I think they are logically equivalent) > SELECT * > FROM foo > WHERE 'text6' IN (SELECT value > FROM bar > JOIN foo AS foo2 > ON bar.foo_ref = foo2.id > WHERE foo2.id = foo.id) Hm. convert_ANY_sublink_to_join() rejects subqueries that contain any Vars of the parent query level, so the reference tofoo.id prevents this from being converted to a semijoin. However, it seems like that's overly restrictive. I'm not surethat we could remove the test altogether, but at least outer vars used in WHERE seem safe. In the meantime, you can recast like this: SELECT * FROM foo WHERE ('text6', id) IN (SELECT value, foo2.id FROM bar JOIN foo AS foo2 ON bar.foo_ref = foo2.id) and still get a semijoin plan from an IN-style query. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- This message has been scanned for viruses and dangerous content by MailScanner, and is believed to be clean. ________________________________ Inventive Designers' Email Disclaimer: http://www.inventivedesigners.com/email-disclaimer