Thread: Performance of subselects
Hi list, if I want to find all records from a table that don't have a matching record in another table there are at least two ways to do it: Using a left outer join or using a subselect. I always thought that the planner would create identical plans for both approaches, but actually they are quite different which leads to a bad performance in one case. I tried the following test case: chschroe=# create table a (id integer primary key); NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "a_pkey" for table "a" CREATE TABLE chschroe=# create table b (id serial not null, fk integer not null, primary key (id, fk)); NOTICE: CREATE TABLE will create implicit sequence "b_id_seq" for serial column "b.id" NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "b_pkey" for table "b" CREATE TABLE chschroe=# insert into a select generate_series(1, 500000); INSERT 0 500000 chschroe=# insert into b(fk) select generate_series(1, 450000); INSERT 0 450000 chschroe=# analyze a; ANALYZE chschroe=# analyze b; ANALYZE chschroe=# explain analyze select * from b where fk not in (select id from a); QUERY PLAN --------------------------------------------------------------------------------------------------------------------------- Seq Scan on b (cost=10645.00..1955718703.00 rows=225000 width=8) (actual time=65378590.167..65378590.167 rows=0 loops=1) Filter: (NOT (subplan)) SubPlan -> Materialize (cost=10645.00..18087.00 rows=500000 width=4) (actual time=0.008..72.326 rows=225000 loops=450000) -> Seq Scan on a (cost=0.00..7703.00 rows=500000 width=4) (actual time=0.008..894.163 rows=450000 loops=1) Total runtime: 65378595.489 ms (6 rows) chschroe=# explain analyze select b.* from b left outer join a on b.fk = a.id where a.id is null; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------- Hash Left Join (cost=16395.00..38041.00 rows=225000 width=8) (actual time=1040.840..1040.840 rows=0 loops=1) Hash Cond: (b.fk = a.id) Filter: (a.id IS NULL) -> Seq Scan on b (cost=0.00..6933.00 rows=450000 width=8) (actual time=0.010..149.508 rows=450000 loops=1) -> Hash (cost=7703.00..7703.00 rows=500000 width=4) (actual time=408.126..408.126 rows=500000 loops=1) -> Seq Scan on a (cost=0.00..7703.00 rows=500000 width=4) (actual time=0.007..166.168 rows=500000 loops=1) Total runtime: 1041.945 ms (7 rows) Is there any difference between the two approaches that explain why the plans are so different? There would be a difference if the subselect could generate null values, but since the id field is a primary key field, it should be implicitly declared not null. Another interesting thing: If table "a" contains only 400,000 rows (instead of 500,000) the query planner decides to use a hashed subplan and performance is fine again: chschroe=# explain analyze select * from b where fk not in (select id from a); QUERY PLAN ------------------------------------------------------------------------------------------------------------------ Seq Scan on b (cost=7163.00..15221.00 rows=225000 width=8) (actual time=472.969..497.096 rows=50000 loops=1) Filter: (NOT (hashed subplan)) SubPlan -> Seq Scan on a (cost=0.00..6163.00 rows=400000 width=4) (actual time=0.010..124.503 rows=400000 loops=1) Total runtime: 509.632 ms (5 rows) Why this different plan? All tests have been performed on a PostgreSQL 8.2.9 server: chschroe=# select version(); version ------------------------------------------------------------------------------------------------------------------ PostgreSQL 8.2.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20061115 (prerelease) (SUSE Linux) (1 row) Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
On Thu, Mar 5, 2009 at 11:25 PM, Christian Schröder <cs@deriva.de> wrote: > Hi list, > if I want to find all records from a table that don't have a matching record > in another table there are at least two ways to do it: Using a left outer > join or using a subselect. I always thought that the planner would create > identical plans for both approaches, but actually they are quite different > which leads to a bad performance in one case. > I tried the following test case: SNIP > All tests have been performed on a PostgreSQL 8.2.9 server: Have you looked at the release notes for 8.2.12? Especially this line: Fix planner misestimation of selectivity when transitive equality is applied to an outer-join clause (Tom) This could result in bad plans for queries like ... from a left join b on a.a1 = b.b1 where a.a1 = 42 ...
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes: > if I want to find all records from a table that don't have a matching > record in another table there are at least two ways to do it: Using a > left outer join or using a subselect. I always thought that the planner > would create identical plans for both approaches, but actually they are > quite different which leads to a bad performance in one case. No, they're not the same; NOT IN has different semantics for nulls. > Another interesting thing: If table "a" contains only 400,000 rows > (instead of 500,000) the query planner decides to use a hashed subplan > and performance is fine again: You're probably at the threshold where it doesn't think the hashtable would fit in work_mem. regards, tom lane
Tom Lane wrote: > No, they're not the same; NOT IN has different semantics for nulls. > But in this case the column in the subselect has a not-null constraint. Does the planner recognize this constraint? > You're probably at the threshold where it doesn't think the hashtable > would fit in work_mem. > I have read in the docs that the "work_mem" value should be increased carefully because the total memory used can be many times the value of "work_mem". Is there any statistics available about how many concurrent sort or hash operations are running and how much memory they consume? This would help to find out if the value can be changed without running out of memory. Regards, Christian Schröder -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
On Sun, Mar 8, 2009 at 6:37 PM, Christian Schröder <cs@deriva.de> wrote: > Tom Lane wrote: >> >> No, they're not the same; NOT IN has different semantics for nulls. >> > > But in this case the column in the subselect has a not-null constraint. Does > the planner recognize this constraint? not in this case, afaik >> You're probably at the threshold where it doesn't think the hashtable >> would fit in work_mem. work_mem constraints amount of memory allocated per connection, hence you can run out of memory if too many connections try to use too much of it at the same time, that's why it is advisable to set work_mem per connection/query, should the connection/query require more. -- GJ
On Sun, Mar 8, 2009 at 12:47 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote: > work_mem constraints amount of memory allocated per connection, hence Actually, it's per sort. And there can be > 1 sort per query. > you can run out of memory if too many connections try to use too much > of it at the same time, that's why it is advisable to set work_mem per > connection/query, should the connection/query require more. Definitely.
2009/3/9 Christian Schröder <cs@deriva.de>: > > I understand why this is advisable; however, something inside me hates the > idea to put this kind of database specific stuff inside an application. How > about portability? Why does the application developer have to know about > database internals? He knows sql, that should be sufficient. > I have the (maybe naive) idea of a clear separation of database > administration (including performance tuning) and application development. > Is this idea completely wrong? > and it is a good idea, especially since you might want to consider heterogeneous DB target. Personally , I put views in DB, and wrap it in a nice LIB. That also means, that I can change things in one place. Just make sure, that you have proper test harness around that sort of LIB, and everything should be okay. I've seen projects in the past, where authors were trying to make queries work on mysql/psql - and that just doesn't fly with me. -- GJ
Scott Marlowe wrote: >> you can run out of memory if too many connections try to use too much >> of it at the same time, that's why it is advisable to set work_mem per >> connection/query, should the connection/query require more. >> > Definitely. > I understand why this is advisable; however, something inside me hates the idea to put this kind of database specific stuff inside an application. How about portability? Why does the application developer have to know about database internals? He knows sql, that should be sufficient. I have the (maybe naive) idea of a clear separation of database administration (including performance tuning) and application development. Is this idea completely wrong? Regards, Christian -- Deriva GmbH Tel.: +49 551 489500-42 Financial IT and Consulting Fax: +49 551 489500-91 Hans-Böckler-Straße 2 http://www.deriva.de D-37079 Göttingen Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer
2009/3/9 Christian Schröder <cs@deriva.de>: > Scott Marlowe wrote: >>> >>> you can run out of memory if too many connections try to use too much >>> of it at the same time, that's why it is advisable to set work_mem per >>> connection/query, should the connection/query require more. >>> >> >> Definitely. >> > > I understand why this is advisable; however, something inside me hates the > idea to put this kind of database specific stuff inside an application. How > about portability? Why does the application developer have to know about > database internals? He knows sql, that should be sufficient. > I have the (maybe naive) idea of a clear separation of database > administration (including performance tuning) and application development. > Is this idea completely wrong? You can always use a different account for things that need big work_mem (typically reporting queries etc.) and alter that user to have a different work_mem. That all the dev needs to know is which account to use. You can also set it by database if that's a better fit.
2009/3/6 Christian Schröder <cs@deriva.de>
Hi list,
if I want to find all records from a table that don't have a matching record in another table there are at least two ways to do it: Using a left outer join or using a subselect. I always thought that the planner would create identical plans for both approaches, but actually they are quite different which leads to a bad performance in one case.
Couldn't you also use: SELECT fk FROM b EXCEPT SELECT id FROM a;
On Mon, Mar 09, 2009 at 11:45:38AM +0100, Christian Schröder wrote: > Scott Marlowe wrote: >>> you can run out of memory if too many connections try to use too >>> much of it at the same time, that's why it is advisable to set >>> work_mem per connection/query, should the connection/query require >>> more. >>> >> Definitely. >> > I understand why this is advisable; however, something inside me > hates the idea to put this kind of database specific stuff inside > an application. How about portability? What about it? Portability among DBMS back-ends is an extremely expensive and complicated proposition no matter how you approach it. Unless your main value proposition is that portability--an entity-relationship diagram extraction tool, for example--it's usually not worth even attempting this. > Why does the application developer have to know about database > internals? He knows sql, that should be sufficient. It's not. > I have the (maybe naive) idea of a clear separation of database > administration (including performance tuning) and application > development. Is this idea completely wrong? Yes. Fortunately, knowing this, you can adjust your expectations and your development plan. :) Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate