Thread: Performance of subselects

Performance of subselects

From
Christian Schröder
Date:
Hi list,
if I want to find all records from a table that don't have a matching
record in another table there are at least two ways to do it: Using a
left outer join or using a subselect. I always thought that the planner
would create identical plans for both approaches, but actually they are
quite different which leads to a bad performance in one case.
I tried the following test case:

chschroe=# create table a (id integer primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "a_pkey"
for table "a"
CREATE TABLE
chschroe=# create table b (id serial not null, fk integer not null,
primary key (id, fk));
NOTICE:  CREATE TABLE will create implicit sequence "b_id_seq" for
serial column "b.id"
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "b_pkey"
for table "b"
CREATE TABLE
chschroe=# insert into a select generate_series(1, 500000);
INSERT 0 500000
chschroe=# insert into b(fk) select generate_series(1, 450000);
INSERT 0 450000
chschroe=# analyze a;
ANALYZE
chschroe=# analyze b;
ANALYZE
chschroe=# explain analyze select * from b where fk not in (select id
from a);
                                                        QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
 Seq Scan on b  (cost=10645.00..1955718703.00 rows=225000 width=8)
(actual time=65378590.167..65378590.167 rows=0 loops=1)
   Filter: (NOT (subplan))
   SubPlan
     ->  Materialize  (cost=10645.00..18087.00 rows=500000 width=4)
(actual time=0.008..72.326 rows=225000 loops=450000)
           ->  Seq Scan on a  (cost=0.00..7703.00 rows=500000 width=4)
(actual time=0.008..894.163 rows=450000 loops=1)
 Total runtime: 65378595.489 ms
(6 rows)
chschroe=# explain analyze select b.* from b left outer join a on b.fk =
a.id where a.id is null;
                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=16395.00..38041.00 rows=225000 width=8) (actual
time=1040.840..1040.840 rows=0 loops=1)
   Hash Cond: (b.fk = a.id)
   Filter: (a.id IS NULL)
   ->  Seq Scan on b  (cost=0.00..6933.00 rows=450000 width=8) (actual
time=0.010..149.508 rows=450000 loops=1)
   ->  Hash  (cost=7703.00..7703.00 rows=500000 width=4) (actual
time=408.126..408.126 rows=500000 loops=1)
         ->  Seq Scan on a  (cost=0.00..7703.00 rows=500000 width=4)
(actual time=0.007..166.168 rows=500000 loops=1)
 Total runtime: 1041.945 ms
(7 rows)

Is there any difference between the two approaches that explain why the
plans are so different? There would be a difference if the subselect
could generate null values, but since the id field is a primary key
field, it should be implicitly declared not null.

Another interesting thing: If table "a" contains only 400,000 rows
(instead of 500,000) the query planner decides to use a hashed subplan
and performance is fine again:

chschroe=# explain analyze select * from b where fk not in (select id
from a);
                                                    QUERY PLAN
------------------------------------------------------------------------------------------------------------------
 Seq Scan on b  (cost=7163.00..15221.00 rows=225000 width=8) (actual
time=472.969..497.096 rows=50000 loops=1)
   Filter: (NOT (hashed subplan))
   SubPlan
     ->  Seq Scan on a  (cost=0.00..6163.00 rows=400000 width=4) (actual
time=0.010..124.503 rows=400000 loops=1)
 Total runtime: 509.632 ms
(5 rows)

Why this different plan?

All tests have been performed on a PostgreSQL 8.2.9 server:
chschroe=# select version();
                                                     version
------------------------------------------------------------------------------------------------------------------
 PostgreSQL 8.2.9 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC)
4.1.2 20061115 (prerelease) (SUSE Linux)
(1 row)

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer



Re: Performance of subselects

From
Scott Marlowe
Date:
On Thu, Mar 5, 2009 at 11:25 PM, Christian Schröder <cs@deriva.de> wrote:
> Hi list,
> if I want to find all records from a table that don't have a matching record
> in another table there are at least two ways to do it: Using a left outer
> join or using a subselect. I always thought that the planner would create
> identical plans for both approaches, but actually they are quite different
> which leads to a bad performance in one case.
> I tried the following test case:

SNIP

> All tests have been performed on a PostgreSQL 8.2.9 server:

Have you looked at the release notes for 8.2.12?  Especially this line:

Fix planner misestimation of selectivity when transitive equality is
applied to an outer-join clause (Tom)

This could result in bad plans for queries like ... from a left join b
on a.a1 = b.b1 where a.a1 = 42 ...

Re: Performance of subselects

From
Tom Lane
Date:
=?ISO-8859-1?Q?Christian_Schr=F6der?= <cs@deriva.de> writes:
> if I want to find all records from a table that don't have a matching
> record in another table there are at least two ways to do it: Using a
> left outer join or using a subselect. I always thought that the planner
> would create identical plans for both approaches, but actually they are
> quite different which leads to a bad performance in one case.

No, they're not the same; NOT IN has different semantics for nulls.

> Another interesting thing: If table "a" contains only 400,000 rows
> (instead of 500,000) the query planner decides to use a hashed subplan
> and performance is fine again:

You're probably at the threshold where it doesn't think the hashtable
would fit in work_mem.

            regards, tom lane

Re: Performance of subselects

From
Christian Schröder
Date:
Tom Lane wrote:
> No, they're not the same; NOT IN has different semantics for nulls.
>
But in this case the column in the subselect has a not-null constraint.
Does the planner recognize this constraint?
> You're probably at the threshold where it doesn't think the hashtable
> would fit in work_mem.
>
I have read in the docs that the "work_mem" value should be increased
carefully because the total memory used can be many times the value of
"work_mem". Is there any statistics available about how many concurrent
sort or hash operations are running and how much memory they consume?
This would help to find out if the value can be changed without running
out of memory.

Regards,
    Christian Schröder

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer


Re: Performance of subselects

From
Grzegorz Jaśkiewicz
Date:
On Sun, Mar 8, 2009 at 6:37 PM, Christian Schröder <cs@deriva.de> wrote:
> Tom Lane wrote:
>>
>> No, they're not the same; NOT IN has different semantics for nulls.
>>
>
> But in this case the column in the subselect has a not-null constraint. Does
> the planner recognize this constraint?
not in this case, afaik


>> You're probably at the threshold where it doesn't think the hashtable
>> would fit in work_mem.

work_mem constraints amount of memory allocated per connection, hence
you can run out of memory if too many connections try to use too much
of it at the same time, that's why it is advisable to set work_mem per
connection/query, should the connection/query require more.


--
GJ

Re: Performance of subselects

From
Scott Marlowe
Date:
On Sun, Mar 8, 2009 at 12:47 PM, Grzegorz Jaśkiewicz <gryzman@gmail.com> wrote:
> work_mem constraints amount of memory allocated per connection, hence

Actually, it's per sort. And there can be > 1 sort per query.

> you can run out of memory if too many connections try to use too much
> of it at the same time, that's why it is advisable to set work_mem per
> connection/query, should the connection/query require more.

Definitely.

Re: Performance of subselects

From
Grzegorz Jaśkiewicz
Date:
2009/3/9 Christian Schröder <cs@deriva.de>:

>
> I understand why this is advisable; however, something inside me hates the
> idea to put this kind of database specific stuff inside an application. How
> about portability? Why does the application developer have to know about
> database internals? He knows sql, that should be sufficient.
> I have the (maybe naive) idea of a clear separation of database
> administration (including performance tuning) and application development.
> Is this idea completely wrong?
>

and it is a good idea, especially since you might want to consider
heterogeneous DB target. Personally , I put views in DB, and wrap it
in a nice LIB. That also means, that I can change things in one place.
Just make sure, that you have proper test harness around that sort of
LIB, and everything should be okay.
I've seen projects in the past, where authors were trying to make
queries work on mysql/psql - and that just doesn't fly with me.

--
GJ

Re: Performance of subselects

From
Christian Schröder
Date:
Scott Marlowe wrote:
>> you can run out of memory if too many connections try to use too much
>> of it at the same time, that's why it is advisable to set work_mem per
>> connection/query, should the connection/query require more.
>>
> Definitely.
>
I understand why this is advisable; however, something inside me hates
the idea to put this kind of database specific stuff inside an
application. How about portability? Why does the application developer
have to know about database internals? He knows sql, that should be
sufficient.
I have the (maybe naive) idea of a clear separation of database
administration (including performance tuning) and application
development. Is this idea completely wrong?

Regards,
    Christian

--
Deriva GmbH                         Tel.: +49 551 489500-42
Financial IT and Consulting         Fax:  +49 551 489500-91
Hans-Böckler-Straße 2                  http://www.deriva.de
D-37079 Göttingen

Deriva CA Certificate: http://www.deriva.de/deriva-ca.cer


Re: Performance of subselects

From
Scott Marlowe
Date:
2009/3/9 Christian Schröder <cs@deriva.de>:
> Scott Marlowe wrote:
>>>
>>> you can run out of memory if too many connections try to use too much
>>> of it at the same time, that's why it is advisable to set work_mem per
>>> connection/query, should the connection/query require more.
>>>
>>
>> Definitely.
>>
>
> I understand why this is advisable; however, something inside me hates the
> idea to put this kind of database specific stuff inside an application. How
> about portability? Why does the application developer have to know about
> database internals? He knows sql, that should be sufficient.
> I have the (maybe naive) idea of a clear separation of database
> administration (including performance tuning) and application development.
> Is this idea completely wrong?

You can always use a different account for things that need big
work_mem (typically reporting queries etc.) and alter that user to
have a different work_mem.  That all the dev needs to know is which
account to use.  You can also set it by database if that's a better
fit.

Re: Performance of subselects

From
Thom Brown
Date:
2009/3/6 Christian Schröder <cs@deriva.de>

Hi list,
if I want to find all records from a table that don't have a matching record in another table there are at least two ways to do it: Using a left outer join or using a subselect. I always thought that the planner would create identical plans for both approaches, but actually they are quite different which leads to a bad performance in one case.

Couldn't you also use: SELECT fk FROM b EXCEPT SELECT id FROM a;

Re: Performance of subselects

From
David Fetter
Date:
On Mon, Mar 09, 2009 at 11:45:38AM +0100, Christian Schröder wrote:
> Scott Marlowe wrote:
>>> you can run out of memory if too many connections try to use too
>>> much of it at the same time, that's why it is advisable to set
>>> work_mem per connection/query, should the connection/query require
>>> more.
>>>
>> Definitely.
>>
> I understand why this is advisable; however, something inside me
> hates  the idea to put this kind of database specific stuff inside
> an  application. How about portability?

What about it?  Portability among DBMS back-ends is an extremely
expensive and complicated proposition no matter how you approach it.
Unless your main value proposition is that portability--an
entity-relationship diagram extraction tool, for example--it's usually
not worth even attempting this.

> Why does the application developer  have to know about database
> internals?  He knows sql, that should be  sufficient.

It's not.

> I have the (maybe naive) idea of a clear separation of database
> administration (including performance tuning) and application
> development.  Is this idea completely wrong?

Yes.  Fortunately, knowing this, you can adjust your expectations and
your development plan. :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate