Thread: full outer join performance

full outer join performance

From
Ben
Date:
Are full outer joins expected to perform much worse than inner joins?
I'm seeing 2 orders of magnitude difference for an almost identical
query. (Well, as "identical" as you can get, comparing a query with an
outer join to one without.) This is on 8.0.3, recently analyzed. Here
are the explain plans:

music=# explain select
music-#  extractbasenamefrompath(files.path),trms.trm
music-# from files,nodes
music-# full outer join trms on (trms.id = nodes.trm)
music-# where
music-#  nodes.fileid = files.id and
music-#  extractbasenamefrompath(files.path) = 'Joy Division/Substance/';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Hash Join  (cost=3932.55..11891.05 rows=18 width=117)
   Hash Cond: ("outer".fileid = "inner".id)
   ->  Hash Left Join  (cost=3867.51..11391.65 rows=86827 width=44)
         Hash Cond: ("outer".trm = "inner".id)
         ->  Seq Scan on nodes  (cost=0.00..1557.27 rows=86827 width=8)
         ->  Hash  (cost=2867.21..2867.21 rows=88521 width=44)
               ->  Seq Scan on trms  (cost=0.00..2867.21 rows=88521
width=44)
   ->  Hash  (cost=64.99..64.99 rows=18 width=81)
         ->  Index Scan using basename_idx on files  (cost=0.00..64.99
rows=18 width=81)
               Index Cond: (extractbasenamefrompath(path) = 'Joy
Division/Substance/'::text)
(10 rows)

music=# explain select
music-#  extractbasenamefrompath(files.path),trms.trm
music-# from
music-#  nodes,files,trms
music-# where
music-#  nodes.fileid = files.id and
music-#  nodes.trm = trms.id and
music-#  extractbasenamefrompath(files.path) = 'Joy Division/Substance/';
                                         QUERY PLAN
---------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.00..202.04 rows=18 width=117)
   ->  Nested Loop  (cost=0.00..119.46 rows=18 width=81)
         ->  Index Scan using basename_idx on files  (cost=0.00..64.99
rows=18 width=81)
               Index Cond: (extractbasenamefrompath(path) = 'Joy
Division/Substance/'::text)
         ->  Index Scan using nodes_fileid_idx on nodes
(cost=0.00..3.01 rows=1 width=8)
               Index Cond: (nodes.fileid = "outer".id)
   ->  Index Scan using trms_pkey on trms  (cost=0.00..4.57 rows=1 width=44)
         Index Cond: ("outer".trm = trms.id)
(8 rows)


Re: full outer join performance

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
> Are full outer joins expected to perform much worse than inner joins?

You're getting burnt by the outer join forcing a bad join order.
There's some related discussion here:
http://www.postgresql.org/docs/8.0/static/explicit-joins.html

            regards, tom lane

Re: full outer join performance

From
Ben
Date:
Hrm, as I understand that page, there's not much that can be done about
the join order for outer joins. At least, not when there's on 3 tables
and 1 outer join involved. Am I missing something?

Tom Lane wrote:

>Ben <bench@silentmedia.com> writes:
>
>
>>Are full outer joins expected to perform much worse than inner joins?
>>
>>
>
>You're getting burnt by the outer join forcing a bad join order.
>There's some related discussion here:
>http://www.postgresql.org/docs/8.0/static/explicit-joins.html
>
>            regards, tom lane
>
>


Re: full outer join performance

From
Tom Lane
Date:
Ben <bench@silentmedia.com> writes:
> Hrm, as I understand that page, there's not much that can be done about
> the join order for outer joins. At least, not when there's on 3 tables
> and 1 outer join involved. Am I missing something?

Without knowing what you want the query to do, it's difficult to say if
it's OK to rearrange the join order.

Frequently it's OK to rearrange the join order even without risking any
change in the query results, but PG doesn't currently have any smarts
about that, so you have to do it manually by changing the query.

            regards, tom lane

Re: full outer join performance

From
Tom Lane
Date:
Scott Marlowe <smarlowe@g2switchworks.com> writes:
> Tom, would that help the planner make better choices for this kind of
> query?

I think the problem was that he had

    select ... from a, b full join c on ... where ...

where table b is big and you only need a few rows from it, so it really
needs to be joined last, but the above forced doing it first.  It wasn't
clear to me why he wanted the full join at all (in fact, you could see
from the plan that the planner had been able to reduce it to a left join
because there were WHERE clauses that'd discard one set of null-extended
rows anyway).  Without knowing that, it's hard to say whether there's
another way to get what he wants.

            regards, tom lane

Re: full outer join performance

From
Scott Marlowe
Date:
On Tue, 2005-09-13 at 13:28, Ben wrote:
> Hrm, as I understand that page, there's not much that can be done about
> the join order for outer joins. At least, not when there's on 3 tables
> and 1 outer join involved. Am I missing something?

You might be able to do some kind of thing like:

select * from (
    select * from table1 <whereclausehere>) as a
    ) left join (
    select * from table2 <whereclausehere) as b
on     (a.somefield=b.somefield2)
where    <anotherwhereclause)

or something.

Tom, would that help the planner make better choices for this kind of
query?

Re: full outer join performance

From
Ben
Date:
I see. I think I'm going to restructure my logic so I won't have to use
the outer join after all, but thanks for the pointers anyway.

Tom Lane wrote:

>Ben <bench@silentmedia.com> writes:
>
>
>>Hrm, as I understand that page, there's not much that can be done about
>>the join order for outer joins. At least, not when there's on 3 tables
>>and 1 outer join involved. Am I missing something?
>>
>>
>
>Without knowing what you want the query to do, it's difficult to say if
>it's OK to rearrange the join order.
>
>Frequently it's OK to rearrange the join order even without risking any
>change in the query results, but PG doesn't currently have any smarts
>about that, so you have to do it manually by changing the query.
>
>            regards, tom lane
>
>


Re: full outer join performance

From
Ben
Date:
Tom Lane wrote:

>I think the problem was that he had
>
>    select ... from a, b full join c on ... where ...
>
>where table b is big and you only need a few rows from it, so it really
>needs to be joined last, but the above forced doing it first.  It wasn't
>clear to me why he wanted the full join at all (in fact, you could see
>from the plan that the planner had been able to reduce it to a left join
>because there were WHERE clauses that'd discard one set of null-extended
>rows anyway).  Without knowing that, it's hard to say whether there's
>another way to get what he wants.
>
>
>
Without the outer join, the output was missing the joined rows from a
and b when there was no matching row in c. (Tables a and c both
referenced b.)

I suppose I could have used a left join instead of a full join, but
either way the results were much slower than an inner join, and I was
able to redo some other logic to let an inner join work just fine.