Thread: Outer join query plans and performance

Outer join query plans and performance

From
Rich Doughty
Date:
I tried on pgsql-general but got no reply. re-posting here as it's
probably the best place to ask

I'm having some significant performance problems with left join. Can
anyone give me any pointers as to why the following 2 query plans are so
different?


EXPLAIN SELECT *
FROM
     tokens.ta_tokens      t  LEFT JOIN
     tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
     tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
WHERE
     h1.histdate = 'now';


  Nested Loop Left Join  (cost=0.00..68778.43 rows=2215 width=1402)
    ->  Nested Loop  (cost=0.00..55505.62 rows=2215 width=714)
          ->  Index Scan using idx_tokenhist__histdate on ta_tokenhist h1  (cost=0.00..22970.70 rows=5752 width=688)
                Index Cond: (histdate = '2005-10-24 13:28:38.411844'::timestamp without time zone)
          ->  Index Scan using ta_tokens_pkey on ta_tokens t  (cost=0.00..5.64 rows=1 width=26)
                Index Cond: ((t.token_id)::integer = ("outer".token_id)::integer)
    ->  Index Scan using fkx_tokenhist__tokens on ta_tokenhist h2  (cost=0.00..5.98 rows=1 width=688)
          Index Cond: (("outer".token_id)::integer = (h2.token_id)::integer)


Performance is fine for this one and the plan is pretty much as i'd
expect.

This is where i hit a problem.


EXPLAIN SELECT *
FROM
     tokens.ta_tokens      t  LEFT JOIN
     tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
     tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
WHERE
     h2.histdate = 'now';


  Hash Join  (cost=1249148.59..9000709.22 rows=2215 width=1402)
    Hash Cond: (("outer".token_id)::integer = ("inner".token_id)::integer)
    ->  Hash Left Join  (cost=1225660.51..8181263.40 rows=4045106 width=714)
          Hash Cond: (("outer".token_id)::integer = ("inner".token_id)::integer)
          ->  Seq Scan on ta_tokens t  (cost=0.00..71828.06 rows=4045106 width=26)
          ->  Hash  (cost=281243.21..281243.21 rows=10504921 width=688)
                ->  Seq Scan on ta_tokenhist h1  (cost=0.00..281243.21 rows=10504921 width=688)
    ->  Hash  (cost=22970.70..22970.70 rows=5752 width=688)
          ->  Index Scan using idx_tokenhist__histdate on ta_tokenhist h2  (cost=0.00..22970.70 rows=5752 width=688)
                Index Cond: (histdate = '2005-10-24 13:34:51.371905'::timestamp without time zone)


I would understand if h2 was joined on h1, but it isn't. It only joins
on t. can anyone give any tips on improving the performance of the second
query (aside from changing the join order manually)?


select version();
                                                    version
--------------------------------------------------------------------------------------------------------------
  PostgreSQL 8.0.3 on i486-pc-linux-gnu, compiled by GCC cc (GCC) 4.0.2 20050821 (prerelease) (Debian 4.0.1-6)


Thanks

--

   - Rich Doughty

Re: Outer join query plans and performance

From
Tom Lane
Date:
Rich Doughty <rich@opusvl.com> writes:
> EXPLAIN SELECT *
> FROM
>      tokens.ta_tokens      t  LEFT JOIN
>      tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
>      tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
> WHERE
>      h1.histdate = 'now';

> EXPLAIN SELECT *
> FROM
>      tokens.ta_tokens      t  LEFT JOIN
>      tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
>      tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
> WHERE
>      h2.histdate = 'now';

The reason these are different is that the second case constrains only
the last-to-be-joined table, so the full cartesian product of t and h1
has to be formed.  If this wasn't what you had in mind, you might be
able to rearrange the order of the LEFT JOINs, but bear in mind that
in general, changing outer-join ordering changes the results.  (This
is why the planner won't fix it for you.)

            regards, tom lane

Re: Outer join query plans and performance

From
Rich Doughty
Date:
Tom Lane wrote:
> Rich Doughty <rich@opusvl.com> writes:
>
>>EXPLAIN SELECT *
>>FROM
>>     tokens.ta_tokens      t  LEFT JOIN
>>     tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
>>     tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
>>WHERE
>>     h1.histdate = 'now';
>
>
>>EXPLAIN SELECT *
>>FROM
>>     tokens.ta_tokens      t  LEFT JOIN
>>     tokens.ta_tokenhist   h1 ON t.token_id = h1.token_id LEFT JOIN
>>     tokens.ta_tokenhist   h2 ON t.token_id = h2.token_id
>>WHERE
>>     h2.histdate = 'now';
>
>
> The reason these are different is that the second case constrains only
> the last-to-be-joined table, so the full cartesian product of t and h1
> has to be formed.  If this wasn't what you had in mind, you might be
> able to rearrange the order of the LEFT JOINs, but bear in mind that
> in general, changing outer-join ordering changes the results.  (This
> is why the planner won't fix it for you.)

FWIW mysql 4.1 (and i'm no fan at all of mysql) completes both these queries
in approximately 3 seconds. postgres does the first in 6 seconds and the
second in a lot longer (eventually abandoned).


--

   - Rich Doughty

Re: Outer join query plans and performance

From
Tom Lane
Date:
Rich Doughty <rich@opusvl.com> writes:
> Tom Lane wrote:
>> The reason these are different is that the second case constrains only
>> the last-to-be-joined table, so the full cartesian product of t and h1
>> has to be formed.  If this wasn't what you had in mind, you might be
>> able to rearrange the order of the LEFT JOINs, but bear in mind that
>> in general, changing outer-join ordering changes the results.  (This
>> is why the planner won't fix it for you.)

> FWIW mysql 4.1 (and i'm no fan at all of mysql) completes both these queries
> in approximately 3 seconds.

Does mysql get the correct answer, though?  It's hard to see how they do
this fast unless they (a) are playing fast and loose with the semantics,
or (b) have very substantially more analysis logic for OUTER JOIN semantics
than we do.  Perhaps mysql 5.x is better about this sort of thing, but
for 4.x I'd definitely find theory (a) more plausible than (b).

The cases that would be interesting are those where rearranging the
outer join order actually does change the correct answer --- it may not
in this particular case, I haven't thought hard about it.  It seems
fairly likely to me that they are rearranging the join order here, and
I'm just wondering whether they have the logic needed to verify that
such a transformation is correct.

            regards, tom lane

Re: Outer join query plans and performance

From
"Kevin Grittner"
Date:
In this particular case both outer joins are to the same table, and
the where clause is applied to one or the other, so it's pretty easy
to prove that they should generate identical results.  I'll grant that
this is not generally very useful; but then, simple test cases often
don't look very useful.

We've had mixed results with PostgreSQL and queries with
multiple outer joins when the WHERE clause limits the results
based on columns from the optional tables.  In at least one case
which performs very well, we have enough tables to cause the
"genetic" optimizer to kick in.  (So I suppose there is a chance
that sometimes it won't perform well, although we haven't seen
that happen yet.)

I can't speak to MySQL, but both Sybase and MaxDB handled
such cases accurately, and chose a plan with very fast
execution.  Sybase, however, spent 5 to 10 seconds in the
optimizer finding the sub-second plan.

-Kevin


>>> Tom Lane <tgl@sss.pgh.pa.us>  >>>
Rich Doughty <rich@opusvl.com> writes:
> Tom Lane wrote:
>> The reason these are different is that the second case constrains
only
>> the last-to-be-joined table, so the full cartesian product of t and
h1
>> has to be formed.  If this wasn't what you had in mind, you might be
>> able to rearrange the order of the LEFT JOINs, but bear in mind that
>> in general, changing outer-join ordering changes the results.  (This
>> is why the planner won't fix it for you.)

> FWIW mysql 4.1 (and i'm no fan at all of mysql) completes both these
queries
> in approximately 3 seconds.

Does mysql get the correct answer, though?  It's hard to see how they do
this fast unless they (a) are playing fast and loose with the semantics,
or (b) have very substantially more analysis logic for OUTER JOIN
semantics
than we do.  Perhaps mysql 5.x is better about this sort of thing, but
for 4.x I'd definitely find theory (a) more plausible than (b).

The cases that would be interesting are those where rearranging the
outer join order actually does change the correct answer --- it may not
in this particular case, I haven't thought hard about it.  It seems
fairly likely to me that they are rearranging the join order here, and
I'm just wondering whether they have the logic needed to verify that
such a transformation is correct.

            regards, tom lane


Re: Outer join query plans and performance

From
Rich Doughty
Date:
Tom Lane wrote:
> Rich Doughty <rich@opusvl.com> writes:
>
>>Tom Lane wrote:
>>
>>>The reason these are different is that the second case constrains only
>>>the last-to-be-joined table, so the full cartesian product of t and h1
>>>has to be formed.  If this wasn't what you had in mind, you might be
>>>able to rearrange the order of the LEFT JOINs, but bear in mind that
>>>in general, changing outer-join ordering changes the results.  (This
>>>is why the planner won't fix it for you.)
>
>
>>FWIW mysql 4.1 (and i'm no fan at all of mysql) completes both these queries
>>in approximately 3 seconds.
>
>
> Does mysql get the correct answer, though?  It's hard to see how they do
> this fast unless they (a) are playing fast and loose with the semantics,
> or (b) have very substantially more analysis logic for OUTER JOIN semantics
> than we do.  Perhaps mysql 5.x is better about this sort of thing, but
> for 4.x I'd definitely find theory (a) more plausible than (b).

i would assume so. i'll re-run my testcase later and verify the results of the
two side-by-side.

> The cases that would be interesting are those where rearranging the
> outer join order actually does change the correct answer --- it may not
> in this particular case, I haven't thought hard about it.  It seems
> fairly likely to me that they are rearranging the join order here, and
> I'm just wondering whether they have the logic needed to verify that
> such a transformation is correct.
>
>             regards, tom lane
>


--

   - Rich Doughty