Thread: self join revisited

From:
Rikard Pavelic
Date:

How hard would it be to teach planer to optimize self join?

While this query which demonstrates it is not that common

SELECT count(*)
FROM
    big_table a
    INNER JOIN big_table b ON a.id = b.id;

This type of query (self joining large table) is very common
(at least in our environment because of heavy usage of views).

It would be great if Postgres could rewrite this query

SELECT bt1.id, bt1.total, sq.id, sq.total
FROM
    big_table bt1
    INNER JOIN small_table st1 on st1.big_id = bt1.id
    INNER JOIN
    (
        SELECT bt2.id, st2.total
        FROM
            big_table bt2
            INNER JOIN small_table st2 on st2.big_id = bt2.id
        WHERE
            st2.total > 100
    ) sq ON sq.id = bt1.id
WHERE
    st1.total<200

like this

SELECT bt1.id, bt1.total, bt1.id, st2.total
FROM
    big_table bt1
    INNER JOIN small_table st1 on st1.big_id = bt1.id
    INNER JOIN small_table st2 on st2.big_id = bt1.id AND st2.total > 100
WHERE
    st1.total<200

Regards,
Rikard

From:
Matthew Wakeling
Date:

On Wed, 1 Apr 2009, Rikard Pavelic wrote:
> It would be great if Postgres could rewrite this query
>
> SELECT bt1.id, bt1.total, sq.id, sq.total
> FROM
>     big_table bt1
>     INNER JOIN small_table st1 on st1.big_id = bt1.id
>     INNER JOIN
>     (
>         SELECT bt2.id, st2.total
>         FROM
>             big_table bt2
>             INNER JOIN small_table st2 on st2.big_id = bt2.id
>         WHERE
>             st2.total > 100
>     ) sq ON sq.id = bt1.id
> WHERE
>     st1.total<200
>
> like this
>
> SELECT bt1.id, bt1.total, bt1.id, st2.total
> FROM
>     big_table bt1
>     INNER JOIN small_table st1 on st1.big_id = bt1.id
>     INNER JOIN small_table st2 on st2.big_id = bt1.id AND st2.total > 100
> WHERE
>     st1.total<200

Those queries are only equivalent if big_table.id is unique. However, even
so some benefit could be gained from a self-join algorithm. For instance,
if given some rather evil cleverness, it could be adapted to calculate
overlaps very quickly.

However, a self-join is very similar to a merge join, and the benefit over
a standard merge join would be small.

Matthew

--
"We did a risk management review.  We concluded that there was no risk
 of any management."        -- Hugo Mills <>

From:
Tom Lane
Date:

Rikard Pavelic <> writes:
> It would be great if Postgres could rewrite this query

AFAICS those queries are not going to produce the same results,
so Postgres would be 100% incorrect to rewrite it like that for you.

(If they do produce the same results, it would depend on a bunch
of assumptions you have not stated.)

            regards, tom lane

From:
Rikard Pavelic
Date:

Tom Lane wrote:
> Rikard Pavelic <> writes:
>
>> It would be great if Postgres could rewrite this query
>>
>
> AFAICS those queries are not going to produce the same results,
> so Postgres would be 100% incorrect to rewrite it like that for you.
>
> (If they do produce the same results, it would depend on a bunch
> of assumptions you have not stated.)
>
>             regards, tom lane
>

Can I try again? :)

How hard would it be to teach the planner about preserving uniqueness of
relations in subqueries?
And using that information to remove unnecessary self joins on unique sets?

I can try to rewrite some queries to test it on real data for how much
gain it would provide.

Regards,
Rikard

From:
Robert Haas
Date:

> Can I try again? :)
>
> How hard would it be to teach the planner about preserving uniqueness of
> relations in subqueries?
> And using that information to remove unnecessary self joins on unique sets?
>
> I can try to rewrite some queries to test it on real data for how much
> gain it would provide.

I think join removal is a swell idea.  In fact, I went so far as to
post a patch to implement it.  :-)

http://archives.postgresql.org/message-id/

It's a slightly different problem, because I'm looking at removing
left joins that provably don't change the output set due to a
sufficiently strong uniqueness contraint on the nullable side of the
join, and you're looking at removing self-joins that provably don't
change the output set, which I believe requires establishing that all
the columns of some unique index are constrained to be equal between
the two copies of the table.  But the two problems are very similar in
that you need an efficient way to assess whether there's an adequate
unique constraint, and some of the infrastructure could probably be
shared.

The problem from a coding perspective seems to be how and where to do
the test for unique-ness.  In either the left-join case or the
self-join case, you need to verify that one of the relations involved
has a unique index whose column list is equal to or a superset of the
available merge-joinable clauses (or perhaps hash-joinable clauses). I
ended up putting the logic in sort_inner_and_outer(), but that's
making the decision to drop the join fairly late in the game.  It
would be nice to make it earlier, before we start the dynamic
programming algorithm, especially for self-join removal, where
throwing away the join after it's been constructed involves moving the
quals around.

create_unique_path() also does some interesting stuff in this area,
but I haven't figured out how much of that might be applicable to join
removal.

...Robert