Thread: Improving our clauseless-join heuristics

Improving our clauseless-join heuristics

From
Tom Lane
Date:
I looked into the behavior complained of here:
http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php

The problem query can be abstracted to
select * from a, b, c, dwhere a.x = b.y and (a.z = c.c or a.z = d.d)

Table a is much larger than the others (in fact, in the given example
c and d are known to be single rows), and there are indexes on the
mentioned columns of a.  In this situation, the best plan is to
cross-join c and d, then use a BitmapOr indexscan to pick out the rows
of a that satisfy the OR condition, and finally join that small number
of rows to b.  The planner will use a cross-join-first plan if we omit
b and the first WHERE clause from the query; but in the query as given,
it fails to discover that plan and falls back on a vastly inferior plan
that involves forming the a/b join first.

The reason for this behavior is the anti-clauseless-join heuristics in
join_search_one_level().  Without b, there are no join clauses available
at join level 2, so the planner is forced to form all three 2-way cross
joins; and then at level 3 it finds out that joining a to c/d works
well.  With b, we find the a/b join has a usable join clause so we form
that join, and then we decide not to make any 2-way clauseless joins.
So the c/d join is never constructed and there is no way to exploit the
desirable indexscan at higher levels.

After some reflection I think that the blame should be pinned on
have_relevant_joinclause(), which is essentially defined as "is there
any join clause that can be evaluated at the join of these two
relations?".  I think it would work better to define it as "is there any
join clause that both these relations participate in?".  In the majority
of real-world queries, join clauses relate exactly two relations, so
that these two definitions are equivalent.  However, when we do have
join clauses involving 3 or more relations, such as the OR clause in
this example, it's evidently useful to consider cross-product joins of
the smaller relations so that the join clause can be applied during the
scan of the largest table.

It would probably not be a good idea to back-patch such a change, since
it might have consequences I can't foresee at the moment.  But I'm
strongly tempted to squeeze it into 9.2.  Thoughts?
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Robert Haas
Date:
On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I looked into the behavior complained of here:
> http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php
>
> The problem query can be abstracted to
>
>        select * from a, b, c, d
>        where a.x = b.y and (a.z = c.c or a.z = d.d)
>
> Table a is much larger than the others (in fact, in the given example
> c and d are known to be single rows), and there are indexes on the
> mentioned columns of a.  In this situation, the best plan is to
> cross-join c and d, then use a BitmapOr indexscan to pick out the rows
> of a that satisfy the OR condition, and finally join that small number
> of rows to b.  The planner will use a cross-join-first plan if we omit
> b and the first WHERE clause from the query; but in the query as given,
> it fails to discover that plan and falls back on a vastly inferior plan
> that involves forming the a/b join first.
>
> The reason for this behavior is the anti-clauseless-join heuristics in
> join_search_one_level().  Without b, there are no join clauses available
> at join level 2, so the planner is forced to form all three 2-way cross
> joins; and then at level 3 it finds out that joining a to c/d works
> well.  With b, we find the a/b join has a usable join clause so we form
> that join, and then we decide not to make any 2-way clauseless joins.
> So the c/d join is never constructed and there is no way to exploit the
> desirable indexscan at higher levels.
>
> After some reflection I think that the blame should be pinned on
> have_relevant_joinclause(), which is essentially defined as "is there
> any join clause that can be evaluated at the join of these two
> relations?".  I think it would work better to define it as "is there any
> join clause that both these relations participate in?".  In the majority
> of real-world queries, join clauses relate exactly two relations, so
> that these two definitions are equivalent.  However, when we do have
> join clauses involving 3 or more relations, such as the OR clause in
> this example, it's evidently useful to consider cross-product joins of
> the smaller relations so that the join clause can be applied during the
> scan of the largest table.
>
> It would probably not be a good idea to back-patch such a change, since
> it might have consequences I can't foresee at the moment.  But I'm
> strongly tempted to squeeze it into 9.2.  Thoughts?

I think it's getting a little late in the day to be whacking the
planner around too much, but I have to admit that seems like a pretty
good and safe change to me, so maybe we should go ahead and do it.
I'm a bit worried, though, that with all the planner changes this
release we are going to spend a lot of time tracking down regressions
either in planning time or in plan quality.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> After some reflection I think that the blame should be pinned on
>> have_relevant_joinclause(), which is essentially defined as "is there
>> any join clause that can be evaluated at the join of these two
>> relations?".  I think it would work better to define it as "is there any
>> join clause that both these relations participate in?".

> I think it's getting a little late in the day to be whacking the
> planner around too much, but I have to admit that seems like a pretty
> good and safe change to me, so maybe we should go ahead and do it.
> I'm a bit worried, though, that with all the planner changes this
> release we are going to spend a lot of time tracking down regressions
> either in planning time or in plan quality.

Could be.  I think though that this fits in pretty naturally with the
parameterized-path changes, since both of them are really directed
towards being able to apply inner indexscans in cases where we could
not before.
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Stephen Frost
Date:
Robert,

* Robert Haas (robertmhaas@gmail.com) wrote:
> I'm a bit worried, though, that with all the planner changes this
> release we are going to spend a lot of time tracking down regressions
> either in planning time or in plan quality.

I dunno, I agree that there will likely be more regressions, but I
also think people might be happier with more changes in release 'X' than
some changes in X and then some more in Y, both of which end up
destablizing their environment, particularly if we advertise that 'X'
has lots of changes and that people should test it thoroughly..

My feeling is that this would be good to include in 9.2, but I could be
swayed either way- I certainly agree with the sentiment that it's
getting to a pretty late date.
Thanks,
    Stephen

Re: Improving our clauseless-join heuristics

From
Robert Haas
Date:
On Fri, Apr 13, 2012 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> After some reflection I think that the blame should be pinned on
>>> have_relevant_joinclause(), which is essentially defined as "is there
>>> any join clause that can be evaluated at the join of these two
>>> relations?".  I think it would work better to define it as "is there any
>>> join clause that both these relations participate in?".
>
>> I think it's getting a little late in the day to be whacking the
>> planner around too much, but I have to admit that seems like a pretty
>> good and safe change to me, so maybe we should go ahead and do it.
>> I'm a bit worried, though, that with all the planner changes this
>> release we are going to spend a lot of time tracking down regressions
>> either in planning time or in plan quality.
>
> Could be.  I think though that this fits in pretty naturally with the
> parameterized-path changes, since both of them are really directed
> towards being able to apply inner indexscans in cases where we could
> not before.

Yeah, I had the same thought.  I am more worried that either the
changes we made for index-only scans or the parameterized paths stuff
is going to turn out to introduce nasty, as-yet-unforeseen
regressions.  But at this point we're in for that pain whatever we do
about this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
I want to clarify small doubt in this regard. 
In function make_rels_by_clause_joins(..), it tries to join the given
relation old_rel with other relations if there exist a join between them. 
What I can understand is, it is because if there exists a join condition its
better to join with that relation.
However if the given relation old_rel is not able to join any relation, then
why can't it try to make cross-join with other relations there itself.

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Friday, April 13, 2012 8:02 PM
To: pgsql-hackers@postgreSQL.org
Subject: [HACKERS] Improving our clauseless-join heuristics

I looked into the behavior complained of here:
http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php

The problem query can be abstracted to
select * from a, b, c, dwhere a.x = b.y and (a.z = c.c or a.z = d.d)

Table a is much larger than the others (in fact, in the given example
c and d are known to be single rows), and there are indexes on the
mentioned columns of a.  In this situation, the best plan is to
cross-join c and d, then use a BitmapOr indexscan to pick out the rows
of a that satisfy the OR condition, and finally join that small number
of rows to b.  The planner will use a cross-join-first plan if we omit
b and the first WHERE clause from the query; but in the query as given,
it fails to discover that plan and falls back on a vastly inferior plan
that involves forming the a/b join first.

The reason for this behavior is the anti-clauseless-join heuristics in
join_search_one_level().  Without b, there are no join clauses available
at join level 2, so the planner is forced to form all three 2-way cross
joins; and then at level 3 it finds out that joining a to c/d works
well.  With b, we find the a/b join has a usable join clause so we form
that join, and then we decide not to make any 2-way clauseless joins.
So the c/d join is never constructed and there is no way to exploit the
desirable indexscan at higher levels.

After some reflection I think that the blame should be pinned on
have_relevant_joinclause(), which is essentially defined as "is there
any join clause that can be evaluated at the join of these two
relations?".  I think it would work better to define it as "is there any
join clause that both these relations participate in?".  In the majority
of real-world queries, join clauses relate exactly two relations, so
that these two definitions are equivalent.  However, when we do have
join clauses involving 3 or more relations, such as the OR clause in
this example, it's evidently useful to consider cross-product joins of
the smaller relations so that the join clause can be applied during the
scan of the largest table.

It would probably not be a good idea to back-patch such a change, since
it might have consequences I can't foresee at the moment.  But I'm
strongly tempted to squeeze it into 9.2.  Thoughts?
        regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
> I want to clarify small doubt in this regard. 
> In function make_rels_by_clause_joins(..), it tries to join the given
> relation old_rel with other relations if there exist a join between them. 
> What I can understand is, it is because if there exists a join condition its
> better to join with that relation.
> However if the given relation old_rel is not able to join any relation, then
> why can't it try to make cross-join with other relations there itself.

Hm?  That case is handled by make_rels_by_clauseless_joins.  I suppose
we could refactor to combine those two functions, but what's the point?
If that's not what your question is about, then I don't understand.
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
>> That case is handled by make_rels_by_clauseless_joins

It will be handled by make_rels_by_clauseless_joins() if given rel old_rel
doesn't have any join clause.
However if it has join clause but doesn't able to join with any other rels
like in the example you have provided for relation c, it is not able to join
with other rel d. 
In such cases it can do cross-join with d, because it has not found any
relation to join with.
Doesn't it will address the problem you mentioned?

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Monday, April 16, 2012 9:10 AM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Improving our clauseless-join heuristics 

Amit Kapila <amit.kapila@huawei.com> writes:
> I want to clarify small doubt in this regard. 
> In function make_rels_by_clause_joins(..), it tries to join the given
> relation old_rel with other relations if there exist a join between them. 
> What I can understand is, it is because if there exists a join condition
its
> better to join with that relation.
> However if the given relation old_rel is not able to join any relation,
then
> why can't it try to make cross-join with other relations there itself.

Hm?  That case is handled by make_rels_by_clauseless_joins.  I suppose
we could refactor to combine those two functions, but what's the point?
If that's not what your question is about, then I don't understand.
        regards, tom lane



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
> That case is handled by make_rels_by_clauseless_joins
> It will be handled by make_rels_by_clauseless_joins() if given rel old_rel
> doesn't have any join clause.
> However if it has join clause but doesn't able to join with any other rels
> like in the example you have provided for relation c, it is not able to join
> with other rel d. 
> In such cases it can do cross-join with d, because it has not found any
> relation to join with.
> Doesn't it will address the problem you mentioned?

Sounds to me like that's going in the wrong direction, ie, joining to
exactly the wrong relations.  If you have to cross-join it's better to
cross-join against relations that are included in one of the join
clauses that does mention the current relation.

Another way to look at this is that if we have
select ... from a,b,c,d where a.x = b.y + c.z

we want to consider a cross-join of b and c, in the hopes that we can do
something useful with the join clause at the next level where it can
join to a.  From b's perspective there is no percentage in joining to d.
On the other hand, when we come to consider d, it will have no join
clauses so we will consider joining it to each other rel in turn.  So
if there is any value in joining d early, we will find that out.

Or at least that's the theory.  Now that I look at it this way, I think
there is a bug here: when we are at level 2, we only consider later rels
in the list for forced clauseless joins.  That would be okay if the
condition were symmetrical, but it isn't.  This makes for a bogus
FROM-list ordering dependency in handling of clauseless joins.  Not
sure how much that matters in the real world, but it still seems wrong.
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
>>Another way to look at this is that if we have

>>    select ... from a,b,c,d where a.x = b.y + c.z

>>we want to consider a cross-join of b and c, in the hopes that we can do
>>something useful with the join clause at the next level where it can
>>join to a.  From b's perspective there is no percentage in joining to d.

For this kind of query, currently (referring 9.0.3 code) also it considers
join of b,c and b,d. 
As there is no join clause between b,c,d so it will go in path of
make_rels_by_clauseless_joins() where it considers join of b,c and b,d.

In this kind of query, if the suggestion by me in below mail is followed,
then it will consider joining a,b a,c a,d at level-2 in function
make_rels_by_clause_joins() which it currently doesn't do which may generate
useless join paths.
However in real-world scenario's this kind of queries where 2 cols of
different tables are
used in one side expression (b.y + c.z) of where clause may be less.

>>On the other hand, when we come to consider d, it will have no join
>>clauses so we will consider joining it to each other rel in turn.  

When it come to consider d, as at level -2 it only consider later rels. So
it should not consider joining with each other rel.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Monday, April 16, 2012 9:57 AM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Improving our clauseless-join heuristics 

Amit Kapila <amit.kapila@huawei.com> writes:
> That case is handled by make_rels_by_clauseless_joins
> It will be handled by make_rels_by_clauseless_joins() if given rel old_rel
> doesn't have any join clause.
> However if it has join clause but doesn't able to join with any other rels
> like in the example you have provided for relation c, it is not able to
join
> with other rel d. 
> In such cases it can do cross-join with d, because it has not found any
> relation to join with.
> Doesn't it will address the problem you mentioned?

Sounds to me like that's going in the wrong direction, ie, joining to
exactly the wrong relations.  If you have to cross-join it's better to
cross-join against relations that are included in one of the join
clauses that does mention the current relation.

Another way to look at this is that if we have
select ... from a,b,c,d where a.x = b.y + c.z

we want to consider a cross-join of b and c, in the hopes that we can do
something useful with the join clause at the next level where it can
join to a.  From b's perspective there is no percentage in joining to d.
On the other hand, when we come to consider d, it will have no join
clauses so we will consider joining it to each other rel in turn.  So
if there is any value in joining d early, we will find that out.

Or at least that's the theory.  Now that I look at it this way, I think
there is a bug here: when we are at level 2, we only consider later rels
in the list for forced clauseless joins.  That would be okay if the
condition were symmetrical, but it isn't.  This makes for a bogus
FROM-list ordering dependency in handling of clauseless joins.  Not
sure how much that matters in the real world, but it still seems wrong.
        regards, tom lane



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
> For this kind of query, currently (referring 9.0.3 code) also it considers
> join of b,c and b,d. 
> As there is no join clause between b,c,d so it will go in path of
> make_rels_by_clauseless_joins() where it considers join of b,c and b,d.

> In this kind of query, if the suggestion by me in below mail is followed,
> then it will consider joining a,b a,c a,d at level-2 in function
> make_rels_by_clause_joins() which it currently doesn't do which may generate
> useless join paths.
> However in real-world scenario's this kind of queries where 2 cols of
> different tables are
> used in one side expression (b.y + c.z) of where clause may be less.

> On the other hand, when we come to consider d, it will have no join
> clauses so we will consider joining it to each other rel in turn.  

> When it come to consider d, as at level -2 it only consider later rels. So
> it should not consider joining with each other rel.

I might still be misunderstanding, but I think what you are suggesting
is that in the loop in make_rels_by_clause_joins, if we find that the
old_rel doesn't have a join clause/restriction with the current
other_rel, we check to see whether other_rel has any join clauses at
all, and force the join to occur anyway if it doesn't.

I can't really get excited about doing it that way instead of the
current way.  In the first place, it seems to miss the need to
clauseless-join two relations when neither of them have any join
clauses, for instance plain old "SELECT * FROM a, b". So you still need
something like the make_rels_by_clauseless_joins code path, and it's
not entirely clear how to avoid duplicated work there.  In the second
place, instead of N tests to see whether a rel lacks any join clauses,
we'd now have something approaching O(N^2) such tests, in the typical
case where most base rels are directly joined to only a few other rels.
So it seems to make things slower for little obvious benefit.

In general, queries with join-clause-less rels are pretty uncommon,
so I don't want to introduce extra work into make_rels_by_clause_joins
to handle the case.
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
>>I might still be misunderstanding, but I think what you are suggesting
>>is that in the loop in make_rels_by_clause_joins, if we find that the
>>old_rel doesn't have a join clause/restriction with the current
>>other_rel, we check to see whether other_rel has any join clauses at
>>all, and force the join to occur anyway if it doesn't.

It is on similar lines, but the only difference is that it will try to join 
old_rel with other_rel list incase 
old_rel is not able to join with any of other_rel in the list with proper
join clause between them. 

>>In the second place, instead of N tests to see whether a rel lacks any
join clauses,
>>we'd now have something approaching O(N^2) such tests, in the typical
>>case where most base rels are directly joined to only a few other rels.

In the typical case where most base rels are directly joined to only a few
other rels, it will not go in the path where it has to try new combinations,
as the idea is that if in first pass the old_rel is not able find any
suitable rel to join then only it will go in second pass to try to join with
other_rel list.

>>In the first place, it seems to miss the need to
>>clauseless-join two relations when neither of them have any join
>>clauses, for instance plain old "SELECT * FROM a, b". So you still need
>>something like the make_rels_by_clauseless_joins code path, and it's
>>not entirely clear how to avoid duplicated work there.

I believe this will not be related to clause-less joins because we have
entered the function make_rels_by_clause_joins() for relation which has join
clause and which can be joined to one of the other_rel. So in most cases I
believe it will find one of the members in other_rel list with a proper join
clause. Only in some cases when it is not able to find any of the other_rel
with proper join clause between them, it will go in new path. So in a way if
we see this case handling is different from handling clause-less join case.

The other way like you said "is there any join clause that both these
relations participate in?"

Does this mean that we will evaluate the join list to see if there can be
any valid join between 2 relations. 
Is it possible that in all scenarios or cases we will be able to find the
existence of any valid join which 
can be possible may be not at this level but at next level.

-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Tuesday, April 17, 2012 2:05 AM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Improving our clauseless-join heuristics 

Amit Kapila <amit.kapila@huawei.com> writes:
> For this kind of query, currently (referring 9.0.3 code) also it considers
> join of b,c and b,d. 
> As there is no join clause between b,c,d so it will go in path of
> make_rels_by_clauseless_joins() where it considers join of b,c and b,d.

> In this kind of query, if the suggestion by me in below mail is followed,
> then it will consider joining a,b a,c a,d at level-2 in function
> make_rels_by_clause_joins() which it currently doesn't do which may
generate
> useless join paths.
> However in real-world scenario's this kind of queries where 2 cols of
> different tables are
> used in one side expression (b.y + c.z) of where clause may be less.

> On the other hand, when we come to consider d, it will have no join
> clauses so we will consider joining it to each other rel in turn.  

> When it come to consider d, as at level -2 it only consider later rels. So
> it should not consider joining with each other rel.

I might still be misunderstanding, but I think what you are suggesting
is that in the loop in make_rels_by_clause_joins, if we find that the
old_rel doesn't have a join clause/restriction with the current
other_rel, we check to see whether other_rel has any join clauses at
all, and force the join to occur anyway if it doesn't.

I can't really get excited about doing it that way instead of the
current way.  In the first place, it seems to miss the need to
clauseless-join two relations when neither of them have any join
clauses, for instance plain old "SELECT * FROM a, b". So you still need
something like the make_rels_by_clauseless_joins code path, and it's
not entirely clear how to avoid duplicated work there.  In the second
place, instead of N tests to see whether a rel lacks any join clauses,
we'd now have something approaching O(N^2) such tests, in the typical
case where most base rels are directly joined to only a few other rels.
So it seems to make things slower for little obvious benefit.

In general, queries with join-clause-less rels are pretty uncommon,
so I don't want to introduce extra work into make_rels_by_clause_joins
to handle the case.
        regards, tom lane



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
>> I might still be misunderstanding, but I think what you are suggesting
>> is that in the loop in make_rels_by_clause_joins, if we find that the
>> old_rel doesn't have a join clause/restriction with the current
>> other_rel, we check to see whether other_rel has any join clauses at
>> all, and force the join to occur anyway if it doesn't.

> It is on similar lines, but the only difference is that it will try to join 
> old_rel with other_rel list incase 
> old_rel is not able to join with any of other_rel in the list with proper
> join clause between them. 

I'm afraid I'm still not following you very well.  Perhaps you could
submit a proposed patch?
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
>>I'm afraid I'm still not following you very well.  Perhaps you could
>>submit a proposed patch?

Before that can you please explain in little more detail (if possible with
small example) about the idea you have told in original mail : "is there any
join clause that both these relations participate in?"

I wanted to know the detail about the idea you told to see if what I am
proposing has any merit as compare to your idea.


-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Wednesday, April 18, 2012 10:36 AM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Improving our clauseless-join heuristics 

Amit Kapila <amit.kapila@huawei.com> writes:
>> I might still be misunderstanding, but I think what you are suggesting
>> is that in the loop in make_rels_by_clause_joins, if we find that the
>> old_rel doesn't have a join clause/restriction with the current
>> other_rel, we check to see whether other_rel has any join clauses at
>> all, and force the join to occur anyway if it doesn't.

> It is on similar lines, but the only difference is that it will try to
join 
> old_rel with other_rel list incase 
> old_rel is not able to join with any of other_rel in the list with proper
> join clause between them. 

I'm afraid I'm still not following you very well.  Perhaps you could
submit a proposed patch?
        regards, tom lane



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
>> I'm afraid I'm still not following you very well.  Perhaps you could
>> submit a proposed patch?

> Before that can you please explain in little more detail (if possible with
> small example) about the idea you have told in original mail : "is there any
> join clause that both these relations participate in?"

Um ... wasn't that well enough explained already?

I think there are basically two cases.  You can have a join clause that
is immediately useful for joining two relations, say
select ... from a,b where a.x = b.y;

This is "immediate" in the sense that you can apply it when joining a
to b, regardless of any other relations involved in the query.

Or you can have a case like
select ... from a,b,c where (a.x + b.y) = c.z;

This clause is not immediately useful for joining any two of the three
relations in the query.  It will be useful when we get to level 3,
particularly so if we chose to join a and b first and there's an index
on c.z.  But we would have had to accept doing a cartesian join of a and
b to arrive at that situation.  In this example, we have no alternative
except to do some cartesian join at level 2 --- but as soon as we add
some more tables and join clauses to the example, we could get
distracted from the possibility that a cartesian join of a and b might
be a good idea.

Given that make_rels_by_joins doesn't (and shouldn't IMO) have any
detailed understanding of the semantics of particular join clauses,
I would not expect it to realize that joining a to b is the most likely
option out of the three possible clauseless joins that are available
at level 2 in this query.  It's going to have to generate all 3, and
then costing at the next level will figure out what's best to do.

However, I think it *does* need to understand that clauses relating
3 or more relations can work like this.  In the code as it stood before
last week, it would actively reject joining a to b if there were any
additional relations in the query.  That's just not right.
        regards, tom lane


Re: Improving our clauseless-join heuristics

From
Amit Kapila
Date:
>> Um ... wasn't that well enough explained already?

Yes, it was well explained and I understood also, but what I wanted to
understand the solution with which you have resolved the problem.

The way I am telling was as below code. 
With this extra paths will get generated, but it will as well consider for
joining c and d in query:
select * from a, b, c, d where a.x = b.y and (a.z = c.c or a.z = d.d)


static void 
make_rels_by_clause_joins(PlannerInfo *root,                                                  RelOptInfo *old_rel,
                                           ListCell *other_rels) 
 
{        ListCell   *l; ++       bool        bIsold_relJointoanyother_rel = false; 
       for_each_cell(l, other_rels)        {                RelOptInfo *other_rel = (RelOptInfo *) lfirst(l); 
               if (!bms_overlap(old_rel->relids, other_rel->relids) &&
(have_relevant_joinclause(root,old_rel, other_rel)
 
||                         have_join_order_restriction(root, old_rel,
other_rel)))                { ++                    bIsold_relJointoanyother_rel = true;                        (void)
make_join_rel(root,old_rel, other_rel);                }        } ++       /*if old_rel is not able to join with any
otherrel than try
 
joining it ++       with other_rels which has join clause.*/ ++        if(bIsold_relJointoanyother_rel == false) ++
 { 
 
++                     for_each_cell(l, other_rels) ++                     { ++                          RelOptInfo
*other_rel= (RelOptInfo *)
 
lfirst(l); 
++                          if (!bms_overlap(old_rel->relids,
other_rel->relids) && ++                              (has_join_restriction(root,
other_rel)||other_rel->joininfo != NIL)) ++                          { 
++                                     (void) make_join_rel(root, old_rel,
other_rel); ++                          } ++                     } ++       }
}
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us] 
Sent: Wednesday, April 18, 2012 11:59 AM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Improving our clauseless-join heuristics 

Amit Kapila <amit.kapila@huawei.com> writes:
>> I'm afraid I'm still not following you very well.  Perhaps you could
>> submit a proposed patch?

> Before that can you please explain in little more detail (if possible with
> small example) about the idea you have told in original mail : "is there
any
> join clause that both these relations participate in?"

Um ... wasn't that well enough explained already?

I think there are basically two cases.  You can have a join clause that
is immediately useful for joining two relations, say
select ... from a,b where a.x = b.y;

This is "immediate" in the sense that you can apply it when joining a
to b, regardless of any other relations involved in the query.

Or you can have a case like
select ... from a,b,c where (a.x + b.y) = c.z;

This clause is not immediately useful for joining any two of the three
relations in the query.  It will be useful when we get to level 3,
particularly so if we chose to join a and b first and there's an index
on c.z.  But we would have had to accept doing a cartesian join of a and
b to arrive at that situation.  In this example, we have no alternative
except to do some cartesian join at level 2 --- but as soon as we add
some more tables and join clauses to the example, we could get
distracted from the possibility that a cartesian join of a and b might
be a good idea.

Given that make_rels_by_joins doesn't (and shouldn't IMO) have any
detailed understanding of the semantics of particular join clauses,
I would not expect it to realize that joining a to b is the most likely
option out of the three possible clauseless joins that are available
at level 2 in this query.  It's going to have to generate all 3, and
then costing at the next level will figure out what's best to do.

However, I think it *does* need to understand that clauses relating
3 or more relations can work like this.  In the code as it stood before
last week, it would actively reject joining a to b if there were any
additional relations in the query.  That's just not right.
        regards, tom lane



Re: Improving our clauseless-join heuristics

From
Tom Lane
Date:
Amit Kapila <amit.kapila@huawei.com> writes:
> The way I am telling was as below code. 
> With this extra paths will get generated, but it will as well consider for
> joining c and d in query:
> select * from a, b, c, d where a.x = b.y and (a.z = c.c or a.z = d.d)

I think this would just be dead code as of HEAD.  With the recent
changes in the definition of have_relevant_joinclause, if we get into
make_rels_by_clause_joins there should always be at least one other
relation that the old_rel is considered to join to.  In any case,
I don't see the point of adding more logic to make_rels_by_clause_joins
unless it allows us to take out as much or more work elsewhere.
        regards, tom lane