Thread: Improving our clauseless-join heuristics
I looked into the behavior complained of here: http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php The problem query can be abstracted to select * from a, b, c, dwhere a.x = b.y and (a.z = c.c or a.z = d.d) Table a is much larger than the others (in fact, in the given example c and d are known to be single rows), and there are indexes on the mentioned columns of a. In this situation, the best plan is to cross-join c and d, then use a BitmapOr indexscan to pick out the rows of a that satisfy the OR condition, and finally join that small number of rows to b. The planner will use a cross-join-first plan if we omit b and the first WHERE clause from the query; but in the query as given, it fails to discover that plan and falls back on a vastly inferior plan that involves forming the a/b join first. The reason for this behavior is the anti-clauseless-join heuristics in join_search_one_level(). Without b, there are no join clauses available at join level 2, so the planner is forced to form all three 2-way cross joins; and then at level 3 it finds out that joining a to c/d works well. With b, we find the a/b join has a usable join clause so we form that join, and then we decide not to make any 2-way clauseless joins. So the c/d join is never constructed and there is no way to exploit the desirable indexscan at higher levels. After some reflection I think that the blame should be pinned on have_relevant_joinclause(), which is essentially defined as "is there any join clause that can be evaluated at the join of these two relations?". I think it would work better to define it as "is there any join clause that both these relations participate in?". In the majority of real-world queries, join clauses relate exactly two relations, so that these two definitions are equivalent. However, when we do have join clauses involving 3 or more relations, such as the OR clause in this example, it's evidently useful to consider cross-product joins of the smaller relations so that the join clause can be applied during the scan of the largest table. It would probably not be a good idea to back-patch such a change, since it might have consequences I can't foresee at the moment. But I'm strongly tempted to squeeze it into 9.2. Thoughts? regards, tom lane
On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I looked into the behavior complained of here: > http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php > > The problem query can be abstracted to > > select * from a, b, c, d > where a.x = b.y and (a.z = c.c or a.z = d.d) > > Table a is much larger than the others (in fact, in the given example > c and d are known to be single rows), and there are indexes on the > mentioned columns of a. In this situation, the best plan is to > cross-join c and d, then use a BitmapOr indexscan to pick out the rows > of a that satisfy the OR condition, and finally join that small number > of rows to b. The planner will use a cross-join-first plan if we omit > b and the first WHERE clause from the query; but in the query as given, > it fails to discover that plan and falls back on a vastly inferior plan > that involves forming the a/b join first. > > The reason for this behavior is the anti-clauseless-join heuristics in > join_search_one_level(). Without b, there are no join clauses available > at join level 2, so the planner is forced to form all three 2-way cross > joins; and then at level 3 it finds out that joining a to c/d works > well. With b, we find the a/b join has a usable join clause so we form > that join, and then we decide not to make any 2-way clauseless joins. > So the c/d join is never constructed and there is no way to exploit the > desirable indexscan at higher levels. > > After some reflection I think that the blame should be pinned on > have_relevant_joinclause(), which is essentially defined as "is there > any join clause that can be evaluated at the join of these two > relations?". I think it would work better to define it as "is there any > join clause that both these relations participate in?". In the majority > of real-world queries, join clauses relate exactly two relations, so > that these two definitions are equivalent. However, when we do have > join clauses involving 3 or more relations, such as the OR clause in > this example, it's evidently useful to consider cross-product joins of > the smaller relations so that the join clause can be applied during the > scan of the largest table. > > It would probably not be a good idea to back-patch such a change, since > it might have consequences I can't foresee at the moment. But I'm > strongly tempted to squeeze it into 9.2. Thoughts? I think it's getting a little late in the day to be whacking the planner around too much, but I have to admit that seems like a pretty good and safe change to me, so maybe we should go ahead and do it. I'm a bit worried, though, that with all the planner changes this release we are going to spend a lot of time tracking down regressions either in planning time or in plan quality. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> After some reflection I think that the blame should be pinned on >> have_relevant_joinclause(), which is essentially defined as "is there >> any join clause that can be evaluated at the join of these two >> relations?". I think it would work better to define it as "is there any >> join clause that both these relations participate in?". > I think it's getting a little late in the day to be whacking the > planner around too much, but I have to admit that seems like a pretty > good and safe change to me, so maybe we should go ahead and do it. > I'm a bit worried, though, that with all the planner changes this > release we are going to spend a lot of time tracking down regressions > either in planning time or in plan quality. Could be. I think though that this fits in pretty naturally with the parameterized-path changes, since both of them are really directed towards being able to apply inner indexscans in cases where we could not before. regards, tom lane
Robert, * Robert Haas (robertmhaas@gmail.com) wrote: > I'm a bit worried, though, that with all the planner changes this > release we are going to spend a lot of time tracking down regressions > either in planning time or in plan quality. I dunno, I agree that there will likely be more regressions, but I also think people might be happier with more changes in release 'X' than some changes in X and then some more in Y, both of which end up destablizing their environment, particularly if we advertise that 'X' has lots of changes and that people should test it thoroughly.. My feeling is that this would be good to include in 9.2, but I could be swayed either way- I certainly agree with the sentiment that it's getting to a pretty late date. Thanks, Stephen
On Fri, Apr 13, 2012 at 10:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Apr 13, 2012 at 10:32 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> After some reflection I think that the blame should be pinned on >>> have_relevant_joinclause(), which is essentially defined as "is there >>> any join clause that can be evaluated at the join of these two >>> relations?". I think it would work better to define it as "is there any >>> join clause that both these relations participate in?". > >> I think it's getting a little late in the day to be whacking the >> planner around too much, but I have to admit that seems like a pretty >> good and safe change to me, so maybe we should go ahead and do it. >> I'm a bit worried, though, that with all the planner changes this >> release we are going to spend a lot of time tracking down regressions >> either in planning time or in plan quality. > > Could be. I think though that this fits in pretty naturally with the > parameterized-path changes, since both of them are really directed > towards being able to apply inner indexscans in cases where we could > not before. Yeah, I had the same thought. I am more worried that either the changes we made for index-only scans or the parameterized paths stuff is going to turn out to introduce nasty, as-yet-unforeseen regressions. But at this point we're in for that pain whatever we do about this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I want to clarify small doubt in this regard. In function make_rels_by_clause_joins(..), it tries to join the given relation old_rel with other relations if there exist a join between them. What I can understand is, it is because if there exists a join condition its better to join with that relation. However if the given relation old_rel is not able to join any relation, then why can't it try to make cross-join with other relations there itself. -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane Sent: Friday, April 13, 2012 8:02 PM To: pgsql-hackers@postgreSQL.org Subject: [HACKERS] Improving our clauseless-join heuristics I looked into the behavior complained of here: http://archives.postgresql.org/pgsql-performance/2012-04/msg00093.php The problem query can be abstracted to select * from a, b, c, dwhere a.x = b.y and (a.z = c.c or a.z = d.d) Table a is much larger than the others (in fact, in the given example c and d are known to be single rows), and there are indexes on the mentioned columns of a. In this situation, the best plan is to cross-join c and d, then use a BitmapOr indexscan to pick out the rows of a that satisfy the OR condition, and finally join that small number of rows to b. The planner will use a cross-join-first plan if we omit b and the first WHERE clause from the query; but in the query as given, it fails to discover that plan and falls back on a vastly inferior plan that involves forming the a/b join first. The reason for this behavior is the anti-clauseless-join heuristics in join_search_one_level(). Without b, there are no join clauses available at join level 2, so the planner is forced to form all three 2-way cross joins; and then at level 3 it finds out that joining a to c/d works well. With b, we find the a/b join has a usable join clause so we form that join, and then we decide not to make any 2-way clauseless joins. So the c/d join is never constructed and there is no way to exploit the desirable indexscan at higher levels. After some reflection I think that the blame should be pinned on have_relevant_joinclause(), which is essentially defined as "is there any join clause that can be evaluated at the join of these two relations?". I think it would work better to define it as "is there any join clause that both these relations participate in?". In the majority of real-world queries, join clauses relate exactly two relations, so that these two definitions are equivalent. However, when we do have join clauses involving 3 or more relations, such as the OR clause in this example, it's evidently useful to consider cross-product joins of the smaller relations so that the join clause can be applied during the scan of the largest table. It would probably not be a good idea to back-patch such a change, since it might have consequences I can't foresee at the moment. But I'm strongly tempted to squeeze it into 9.2. Thoughts? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Amit Kapila <amit.kapila@huawei.com> writes: > I want to clarify small doubt in this regard. > In function make_rels_by_clause_joins(..), it tries to join the given > relation old_rel with other relations if there exist a join between them. > What I can understand is, it is because if there exists a join condition its > better to join with that relation. > However if the given relation old_rel is not able to join any relation, then > why can't it try to make cross-join with other relations there itself. Hm? That case is handled by make_rels_by_clauseless_joins. I suppose we could refactor to combine those two functions, but what's the point? If that's not what your question is about, then I don't understand. regards, tom lane
>> That case is handled by make_rels_by_clauseless_joins It will be handled by make_rels_by_clauseless_joins() if given rel old_rel doesn't have any join clause. However if it has join clause but doesn't able to join with any other rels like in the example you have provided for relation c, it is not able to join with other rel d. In such cases it can do cross-join with d, because it has not found any relation to join with. Doesn't it will address the problem you mentioned? -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, April 16, 2012 9:10 AM To: Amit Kapila Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving our clauseless-join heuristics Amit Kapila <amit.kapila@huawei.com> writes: > I want to clarify small doubt in this regard. > In function make_rels_by_clause_joins(..), it tries to join the given > relation old_rel with other relations if there exist a join between them. > What I can understand is, it is because if there exists a join condition its > better to join with that relation. > However if the given relation old_rel is not able to join any relation, then > why can't it try to make cross-join with other relations there itself. Hm? That case is handled by make_rels_by_clauseless_joins. I suppose we could refactor to combine those two functions, but what's the point? If that's not what your question is about, then I don't understand. regards, tom lane
Amit Kapila <amit.kapila@huawei.com> writes: > That case is handled by make_rels_by_clauseless_joins > It will be handled by make_rels_by_clauseless_joins() if given rel old_rel > doesn't have any join clause. > However if it has join clause but doesn't able to join with any other rels > like in the example you have provided for relation c, it is not able to join > with other rel d. > In such cases it can do cross-join with d, because it has not found any > relation to join with. > Doesn't it will address the problem you mentioned? Sounds to me like that's going in the wrong direction, ie, joining to exactly the wrong relations. If you have to cross-join it's better to cross-join against relations that are included in one of the join clauses that does mention the current relation. Another way to look at this is that if we have select ... from a,b,c,d where a.x = b.y + c.z we want to consider a cross-join of b and c, in the hopes that we can do something useful with the join clause at the next level where it can join to a. From b's perspective there is no percentage in joining to d. On the other hand, when we come to consider d, it will have no join clauses so we will consider joining it to each other rel in turn. So if there is any value in joining d early, we will find that out. Or at least that's the theory. Now that I look at it this way, I think there is a bug here: when we are at level 2, we only consider later rels in the list for forced clauseless joins. That would be okay if the condition were symmetrical, but it isn't. This makes for a bogus FROM-list ordering dependency in handling of clauseless joins. Not sure how much that matters in the real world, but it still seems wrong. regards, tom lane
>>Another way to look at this is that if we have >> select ... from a,b,c,d where a.x = b.y + c.z >>we want to consider a cross-join of b and c, in the hopes that we can do >>something useful with the join clause at the next level where it can >>join to a. From b's perspective there is no percentage in joining to d. For this kind of query, currently (referring 9.0.3 code) also it considers join of b,c and b,d. As there is no join clause between b,c,d so it will go in path of make_rels_by_clauseless_joins() where it considers join of b,c and b,d. In this kind of query, if the suggestion by me in below mail is followed, then it will consider joining a,b a,c a,d at level-2 in function make_rels_by_clause_joins() which it currently doesn't do which may generate useless join paths. However in real-world scenario's this kind of queries where 2 cols of different tables are used in one side expression (b.y + c.z) of where clause may be less. >>On the other hand, when we come to consider d, it will have no join >>clauses so we will consider joining it to each other rel in turn. When it come to consider d, as at level -2 it only consider later rels. So it should not consider joining with each other rel. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Monday, April 16, 2012 9:57 AM To: Amit Kapila Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving our clauseless-join heuristics Amit Kapila <amit.kapila@huawei.com> writes: > That case is handled by make_rels_by_clauseless_joins > It will be handled by make_rels_by_clauseless_joins() if given rel old_rel > doesn't have any join clause. > However if it has join clause but doesn't able to join with any other rels > like in the example you have provided for relation c, it is not able to join > with other rel d. > In such cases it can do cross-join with d, because it has not found any > relation to join with. > Doesn't it will address the problem you mentioned? Sounds to me like that's going in the wrong direction, ie, joining to exactly the wrong relations. If you have to cross-join it's better to cross-join against relations that are included in one of the join clauses that does mention the current relation. Another way to look at this is that if we have select ... from a,b,c,d where a.x = b.y + c.z we want to consider a cross-join of b and c, in the hopes that we can do something useful with the join clause at the next level where it can join to a. From b's perspective there is no percentage in joining to d. On the other hand, when we come to consider d, it will have no join clauses so we will consider joining it to each other rel in turn. So if there is any value in joining d early, we will find that out. Or at least that's the theory. Now that I look at it this way, I think there is a bug here: when we are at level 2, we only consider later rels in the list for forced clauseless joins. That would be okay if the condition were symmetrical, but it isn't. This makes for a bogus FROM-list ordering dependency in handling of clauseless joins. Not sure how much that matters in the real world, but it still seems wrong. regards, tom lane
Amit Kapila <amit.kapila@huawei.com> writes: > For this kind of query, currently (referring 9.0.3 code) also it considers > join of b,c and b,d. > As there is no join clause between b,c,d so it will go in path of > make_rels_by_clauseless_joins() where it considers join of b,c and b,d. > In this kind of query, if the suggestion by me in below mail is followed, > then it will consider joining a,b a,c a,d at level-2 in function > make_rels_by_clause_joins() which it currently doesn't do which may generate > useless join paths. > However in real-world scenario's this kind of queries where 2 cols of > different tables are > used in one side expression (b.y + c.z) of where clause may be less. > On the other hand, when we come to consider d, it will have no join > clauses so we will consider joining it to each other rel in turn. > When it come to consider d, as at level -2 it only consider later rels. So > it should not consider joining with each other rel. I might still be misunderstanding, but I think what you are suggesting is that in the loop in make_rels_by_clause_joins, if we find that the old_rel doesn't have a join clause/restriction with the current other_rel, we check to see whether other_rel has any join clauses at all, and force the join to occur anyway if it doesn't. I can't really get excited about doing it that way instead of the current way. In the first place, it seems to miss the need to clauseless-join two relations when neither of them have any join clauses, for instance plain old "SELECT * FROM a, b". So you still need something like the make_rels_by_clauseless_joins code path, and it's not entirely clear how to avoid duplicated work there. In the second place, instead of N tests to see whether a rel lacks any join clauses, we'd now have something approaching O(N^2) such tests, in the typical case where most base rels are directly joined to only a few other rels. So it seems to make things slower for little obvious benefit. In general, queries with join-clause-less rels are pretty uncommon, so I don't want to introduce extra work into make_rels_by_clause_joins to handle the case. regards, tom lane
>>I might still be misunderstanding, but I think what you are suggesting >>is that in the loop in make_rels_by_clause_joins, if we find that the >>old_rel doesn't have a join clause/restriction with the current >>other_rel, we check to see whether other_rel has any join clauses at >>all, and force the join to occur anyway if it doesn't. It is on similar lines, but the only difference is that it will try to join old_rel with other_rel list incase old_rel is not able to join with any of other_rel in the list with proper join clause between them. >>In the second place, instead of N tests to see whether a rel lacks any join clauses, >>we'd now have something approaching O(N^2) such tests, in the typical >>case where most base rels are directly joined to only a few other rels. In the typical case where most base rels are directly joined to only a few other rels, it will not go in the path where it has to try new combinations, as the idea is that if in first pass the old_rel is not able find any suitable rel to join then only it will go in second pass to try to join with other_rel list. >>In the first place, it seems to miss the need to >>clauseless-join two relations when neither of them have any join >>clauses, for instance plain old "SELECT * FROM a, b". So you still need >>something like the make_rels_by_clauseless_joins code path, and it's >>not entirely clear how to avoid duplicated work there. I believe this will not be related to clause-less joins because we have entered the function make_rels_by_clause_joins() for relation which has join clause and which can be joined to one of the other_rel. So in most cases I believe it will find one of the members in other_rel list with a proper join clause. Only in some cases when it is not able to find any of the other_rel with proper join clause between them, it will go in new path. So in a way if we see this case handling is different from handling clause-less join case. The other way like you said "is there any join clause that both these relations participate in?" Does this mean that we will evaluate the join list to see if there can be any valid join between 2 relations. Is it possible that in all scenarios or cases we will be able to find the existence of any valid join which can be possible may be not at this level but at next level. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Tuesday, April 17, 2012 2:05 AM To: Amit Kapila Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving our clauseless-join heuristics Amit Kapila <amit.kapila@huawei.com> writes: > For this kind of query, currently (referring 9.0.3 code) also it considers > join of b,c and b,d. > As there is no join clause between b,c,d so it will go in path of > make_rels_by_clauseless_joins() where it considers join of b,c and b,d. > In this kind of query, if the suggestion by me in below mail is followed, > then it will consider joining a,b a,c a,d at level-2 in function > make_rels_by_clause_joins() which it currently doesn't do which may generate > useless join paths. > However in real-world scenario's this kind of queries where 2 cols of > different tables are > used in one side expression (b.y + c.z) of where clause may be less. > On the other hand, when we come to consider d, it will have no join > clauses so we will consider joining it to each other rel in turn. > When it come to consider d, as at level -2 it only consider later rels. So > it should not consider joining with each other rel. I might still be misunderstanding, but I think what you are suggesting is that in the loop in make_rels_by_clause_joins, if we find that the old_rel doesn't have a join clause/restriction with the current other_rel, we check to see whether other_rel has any join clauses at all, and force the join to occur anyway if it doesn't. I can't really get excited about doing it that way instead of the current way. In the first place, it seems to miss the need to clauseless-join two relations when neither of them have any join clauses, for instance plain old "SELECT * FROM a, b". So you still need something like the make_rels_by_clauseless_joins code path, and it's not entirely clear how to avoid duplicated work there. In the second place, instead of N tests to see whether a rel lacks any join clauses, we'd now have something approaching O(N^2) such tests, in the typical case where most base rels are directly joined to only a few other rels. So it seems to make things slower for little obvious benefit. In general, queries with join-clause-less rels are pretty uncommon, so I don't want to introduce extra work into make_rels_by_clause_joins to handle the case. regards, tom lane
Amit Kapila <amit.kapila@huawei.com> writes: >> I might still be misunderstanding, but I think what you are suggesting >> is that in the loop in make_rels_by_clause_joins, if we find that the >> old_rel doesn't have a join clause/restriction with the current >> other_rel, we check to see whether other_rel has any join clauses at >> all, and force the join to occur anyway if it doesn't. > It is on similar lines, but the only difference is that it will try to join > old_rel with other_rel list incase > old_rel is not able to join with any of other_rel in the list with proper > join clause between them. I'm afraid I'm still not following you very well. Perhaps you could submit a proposed patch? regards, tom lane
>>I'm afraid I'm still not following you very well. Perhaps you could >>submit a proposed patch? Before that can you please explain in little more detail (if possible with small example) about the idea you have told in original mail : "is there any join clause that both these relations participate in?" I wanted to know the detail about the idea you told to see if what I am proposing has any merit as compare to your idea. -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, April 18, 2012 10:36 AM To: Amit Kapila Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving our clauseless-join heuristics Amit Kapila <amit.kapila@huawei.com> writes: >> I might still be misunderstanding, but I think what you are suggesting >> is that in the loop in make_rels_by_clause_joins, if we find that the >> old_rel doesn't have a join clause/restriction with the current >> other_rel, we check to see whether other_rel has any join clauses at >> all, and force the join to occur anyway if it doesn't. > It is on similar lines, but the only difference is that it will try to join > old_rel with other_rel list incase > old_rel is not able to join with any of other_rel in the list with proper > join clause between them. I'm afraid I'm still not following you very well. Perhaps you could submit a proposed patch? regards, tom lane
Amit Kapila <amit.kapila@huawei.com> writes: >> I'm afraid I'm still not following you very well. Perhaps you could >> submit a proposed patch? > Before that can you please explain in little more detail (if possible with > small example) about the idea you have told in original mail : "is there any > join clause that both these relations participate in?" Um ... wasn't that well enough explained already? I think there are basically two cases. You can have a join clause that is immediately useful for joining two relations, say select ... from a,b where a.x = b.y; This is "immediate" in the sense that you can apply it when joining a to b, regardless of any other relations involved in the query. Or you can have a case like select ... from a,b,c where (a.x + b.y) = c.z; This clause is not immediately useful for joining any two of the three relations in the query. It will be useful when we get to level 3, particularly so if we chose to join a and b first and there's an index on c.z. But we would have had to accept doing a cartesian join of a and b to arrive at that situation. In this example, we have no alternative except to do some cartesian join at level 2 --- but as soon as we add some more tables and join clauses to the example, we could get distracted from the possibility that a cartesian join of a and b might be a good idea. Given that make_rels_by_joins doesn't (and shouldn't IMO) have any detailed understanding of the semantics of particular join clauses, I would not expect it to realize that joining a to b is the most likely option out of the three possible clauseless joins that are available at level 2 in this query. It's going to have to generate all 3, and then costing at the next level will figure out what's best to do. However, I think it *does* need to understand that clauses relating 3 or more relations can work like this. In the code as it stood before last week, it would actively reject joining a to b if there were any additional relations in the query. That's just not right. regards, tom lane
>> Um ... wasn't that well enough explained already? Yes, it was well explained and I understood also, but what I wanted to understand the solution with which you have resolved the problem. The way I am telling was as below code. With this extra paths will get generated, but it will as well consider for joining c and d in query: select * from a, b, c, d where a.x = b.y and (a.z = c.c or a.z = d.d) static void make_rels_by_clause_joins(PlannerInfo *root, RelOptInfo *old_rel, ListCell *other_rels) { ListCell *l; ++ bool bIsold_relJointoanyother_rel = false; for_each_cell(l, other_rels) { RelOptInfo *other_rel = (RelOptInfo *) lfirst(l); if (!bms_overlap(old_rel->relids, other_rel->relids) && (have_relevant_joinclause(root,old_rel, other_rel) || have_join_order_restriction(root, old_rel, other_rel))) { ++ bIsold_relJointoanyother_rel = true; (void) make_join_rel(root,old_rel, other_rel); } } ++ /*if old_rel is not able to join with any otherrel than try joining it ++ with other_rels which has join clause.*/ ++ if(bIsold_relJointoanyother_rel == false) ++ { ++ for_each_cell(l, other_rels) ++ { ++ RelOptInfo *other_rel= (RelOptInfo *) lfirst(l); ++ if (!bms_overlap(old_rel->relids, other_rel->relids) && ++ (has_join_restriction(root, other_rel)||other_rel->joininfo != NIL)) ++ { ++ (void) make_join_rel(root, old_rel, other_rel); ++ } ++ } ++ } } -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: Wednesday, April 18, 2012 11:59 AM To: Amit Kapila Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Improving our clauseless-join heuristics Amit Kapila <amit.kapila@huawei.com> writes: >> I'm afraid I'm still not following you very well. Perhaps you could >> submit a proposed patch? > Before that can you please explain in little more detail (if possible with > small example) about the idea you have told in original mail : "is there any > join clause that both these relations participate in?" Um ... wasn't that well enough explained already? I think there are basically two cases. You can have a join clause that is immediately useful for joining two relations, say select ... from a,b where a.x = b.y; This is "immediate" in the sense that you can apply it when joining a to b, regardless of any other relations involved in the query. Or you can have a case like select ... from a,b,c where (a.x + b.y) = c.z; This clause is not immediately useful for joining any two of the three relations in the query. It will be useful when we get to level 3, particularly so if we chose to join a and b first and there's an index on c.z. But we would have had to accept doing a cartesian join of a and b to arrive at that situation. In this example, we have no alternative except to do some cartesian join at level 2 --- but as soon as we add some more tables and join clauses to the example, we could get distracted from the possibility that a cartesian join of a and b might be a good idea. Given that make_rels_by_joins doesn't (and shouldn't IMO) have any detailed understanding of the semantics of particular join clauses, I would not expect it to realize that joining a to b is the most likely option out of the three possible clauseless joins that are available at level 2 in this query. It's going to have to generate all 3, and then costing at the next level will figure out what's best to do. However, I think it *does* need to understand that clauses relating 3 or more relations can work like this. In the code as it stood before last week, it would actively reject joining a to b if there were any additional relations in the query. That's just not right. regards, tom lane
Amit Kapila <amit.kapila@huawei.com> writes: > The way I am telling was as below code. > With this extra paths will get generated, but it will as well consider for > joining c and d in query: > select * from a, b, c, d where a.x = b.y and (a.z = c.c or a.z = d.d) I think this would just be dead code as of HEAD. With the recent changes in the definition of have_relevant_joinclause, if we get into make_rels_by_clause_joins there should always be at least one other relation that the old_rel is considered to join to. In any case, I don't see the point of adding more logic to make_rels_by_clause_joins unless it allows us to take out as much or more work elsewhere. regards, tom lane