Thread: Limiting the number of parameterized indexpaths created
I looked into the complaint of unreasonable planner runtime in bug #7626, http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php In the given example, the indexed relation "foo" has join clauses with 30 other relations. The code that I added in commit 3b8968f25232ad09001bf35ab4cc59f5a501193e will try all 2^30 combinations of those rels as possible outer relations for a parameterized indexscan :-(. So clearly, the idea that we can just try everything and not have some kind of heuristic restriction isn't going to work. This particular example, and probably a lot of similar examples, does have a principled non-heuristic fix, which comes from the fact that we recognize all the join clauses as belonging to the same equivalence class. Therefore, using more than one of them in an indexscan is useless. (The code already does know that and discards the extra clauses, but that happens too far downstream to prevent the exponential search time here.) The extra function eclass_already_used() added in the attached patch attacks the problem this way, and is sufficient to resolve the bug for the case presented. However, we still have an issue for cases where the join clauses aren't equivalence clauses. (For instance, if you just change all the "=" signs to "<" in Brian's example, it still takes forever.) So we also need some heuristic rule to limit the number of cases considered. I spent a good deal of time thinking about how the indexscan code might make use of the joinlist data structure (which prevents exponential planning time growth overall by limiting which joins we'll consider) to fix this; the idea being to not consider outer-relation sets that couldn't actually be presented for use because of joinlist restrictions. But this looked messy, and probably expensive in itself. Moreover it seemed like practical implementations of the idea would carry the restriction that we could never generate parameterized paths for joinlist subproblems, which would be a real shame. And we'd be doing a lot of work for what is probably a very unusual corner case, given that I think eclass_already_used() will fix the problem in most real cases. So what I'm proposing instead, which is implemented in the other half of the attached patch, is that we simply put an arbitrary limit on how many outer-relation sets we'll consider while generating parameterized indexscans. As written, the patch limits the number of relid sets to 10 times the number of join clauses it's working with, which is small enough to keep the runtime of this test case to something reasonable. I think that in practice this will still allow useful combination-outer-rel cases to be found. I wonder though if anyone has a better idea? regards, tom lane diff --git a/src/backend/optimizer/path/indxpath.c b/src/backend/optimizer/path/indxpath.c index 444ec2a40e4ac2b0ed8435017149c7664869e34c..78aaca4f279d6ba580643c75d82d82bfc88fba38 100644 *** a/src/backend/optimizer/path/indxpath.c --- b/src/backend/optimizer/path/indxpath.c *************** static void consider_index_join_outer_re *** 92,97 **** --- 92,98 ---- IndexClauseSet *eclauseset, List **bitindexpaths, List *indexjoinclauses, + int considered_clauses, List **considered_relids); static void get_join_index_paths(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index, *************** static void get_join_index_paths(Planner *** 101,106 **** --- 102,109 ---- List **bitindexpaths, Relids relids, List **considered_relids); + static bool eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids, + List *indexjoinclauses); static bool bms_equal_any(Relids relids, List *relids_list); static void get_index_paths(PlannerInfo *root, RelOptInfo *rel, IndexOptInfo *index, IndexClauseSet *clauses, *************** consider_index_join_clauses(PlannerInfo *** 447,452 **** --- 450,456 ---- IndexClauseSet *eclauseset, List **bitindexpaths) { + int considered_clauses = 0; List *considered_relids = NIL; int indexcol; *************** consider_index_join_clauses(PlannerInfo *** 460,467 **** * filter (qpqual); which is where an available clause would end up being * applied if we omit it from the indexquals. * ! * This looks expensive, but in practical cases there won't be very many ! * distinct sets of outer rels to consider. * * For simplicity in selecting relevant clauses, we represent each set of * outer rels as a maximum set of clause_relids --- that is, the indexed --- 464,474 ---- * filter (qpqual); which is where an available clause would end up being * applied if we omit it from the indexquals. * ! * This looks expensive, but in most practical cases there won't be very ! * many distinct sets of outer rels to consider. As a safety valve when ! * that's not true, we use a heuristic: limit the number of outer rel sets ! * considered to a multiple of the number of clauses considered. (We'll ! * always consider using each individual join clause, though.) * * For simplicity in selecting relevant clauses, we represent each set of * outer rels as a maximum set of clause_relids --- that is, the indexed *************** consider_index_join_clauses(PlannerInfo *** 471,486 **** --- 478,497 ---- for (indexcol = 0; indexcol < index->ncolumns; indexcol++) { /* Consider each applicable simple join clause */ + considered_clauses += list_length(jclauseset->indexclauses[indexcol]); consider_index_join_outer_rels(root, rel, index, rclauseset, jclauseset, eclauseset, bitindexpaths, jclauseset->indexclauses[indexcol], + considered_clauses, &considered_relids); /* Consider each applicable eclass join clause */ + considered_clauses += list_length(eclauseset->indexclauses[indexcol]); consider_index_join_outer_rels(root, rel, index, rclauseset, jclauseset, eclauseset, bitindexpaths, eclauseset->indexclauses[indexcol], + considered_clauses, &considered_relids); } } *************** consider_index_join_clauses(PlannerInfo *** 494,499 **** --- 505,511 ---- * 'rel', 'index', 'rclauseset', 'jclauseset', 'eclauseset', and * 'bitindexpaths' as above * 'indexjoinclauses' is a list of RestrictInfos for join clauses + * considered_clauses is the total number of clauses considered so far * '*considered_relids' is a list of all relids sets already considered */ static void *************** consider_index_join_outer_rels(PlannerIn *** 504,509 **** --- 516,522 ---- IndexClauseSet *eclauseset, List **bitindexpaths, List *indexjoinclauses, + int considered_clauses, List **considered_relids) { ListCell *lc; *************** consider_index_join_outer_rels(PlannerIn *** 522,528 **** /* * Generate the union of this clause's relids set with each * previously-tried set. This ensures we try this clause along with ! * every interesting subset of previous clauses. * * Note: get_join_index_paths adds entries to *considered_relids, but * it prepends them to the list, so that we won't visit new entries --- 535,543 ---- /* * Generate the union of this clause's relids set with each * previously-tried set. This ensures we try this clause along with ! * every interesting subset of previous clauses. However, to avoid ! * exponential growth of planning time when there are many clauses, ! * limit the number of relid sets accepted to 10 * considered_clauses. * * Note: get_join_index_paths adds entries to *considered_relids, but * it prepends them to the list, so that we won't visit new entries *************** consider_index_join_outer_rels(PlannerIn *** 543,548 **** --- 558,584 ---- if (bms_subset_compare(clause_relids, oldrelids) != BMS_DIFFERENT) continue; + /* + * If this clause was derived from an equivalence class, the + * clause list may contain other clauses derived from the same + * eclass. We should not consider that combining this clause with + * one of those clauses generates a usefully different + * parameterization; so skip if any clause derived from the same + * eclass would already have been included by oldrelids. + */ + if (rinfo->parent_ec && + eclass_already_used(rinfo->parent_ec, oldrelids, + indexjoinclauses)) + continue; + + /* + * If the number of relid sets considered exceeds our heuristic + * limit, stop considering combinations of clauses. We'll still + * consider the current clause alone, though (below this loop). + */ + if (list_length(*considered_relids) >= 10 * considered_clauses) + break; + /* OK, try the union set */ get_join_index_paths(root, rel, index, rclauseset, jclauseset, eclauseset, *************** get_join_index_paths(PlannerInfo *root, *** 648,653 **** --- 684,711 ---- } /* + * eclass_already_used + * True if any join clause usable with oldrelids was generated from + * the specified equivalence class. + */ + static bool + eclass_already_used(EquivalenceClass *parent_ec, Relids oldrelids, + List *indexjoinclauses) + { + ListCell *lc; + + foreach(lc, indexjoinclauses) + { + RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc); + + if (rinfo->parent_ec == parent_ec && + bms_is_subset(rinfo->clause_relids, oldrelids)) + return true; + } + return false; + } + + /* * bms_equal_any * True if relids is bms_equal to any member of relids_list *
On 30 October 2012 21:57, Tom Lane <tgl@sss.pgh.pa.us> wrote: > So what I'm proposing instead, which is implemented in the other half of > the attached patch, is that we simply put an arbitrary limit on how many > outer-relation sets we'll consider while generating parameterized > indexscans. As written, the patch limits the number of relid sets to 10 > times the number of join clauses it's working with, which is small > enough to keep the runtime of this test case to something reasonable. > I think that in practice this will still allow useful > combination-outer-rel cases to be found. I wonder though if anyone has > a better idea? That seems like a useful limit, but it is arbitrary and not controllable by user. An example like that is actually fairly common when you put back together a deep class hierarchy with one table per class. I'm a little surprised the whole thing doesn't collapse anyway, when using constants as shown since aid = 2 AND aid =3 will be no rows. But I guess that's no really important. We should be able to do a better job of recognising some other aspect/symmetry of this situation and then simplifying the problem from that. Calculating all paths treats the problem as a complete unknown and we must be able to do better than that. If we look at the problem from the perspective of how we would handle it if one of the tables was very large, how would that change things? Can we use a greedy algorithm starting with largest table? This is hand-waving, but it is useful to discuss these things. I'd be happy if you could give a fuller explanation of this issue so we can all learn. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon Riggs <simon@2ndQuadrant.com> writes: > We should be able to do a better job of recognising some other > aspect/symmetry of this situation and then simplifying the problem > from that. Calculating all paths treats the problem as a complete > unknown and we must be able to do better than that. If we look at the > problem from the perspective of how we would handle it if one of the > tables was very large, how would that change things? Can we use a > greedy algorithm starting with largest table? Doesn't seem particularly relevant. The issue is which parameterized paths to generate for the current index of the current table. It could possibly make sense to try harder if the current table is large than if it's small --- but no matter how large it is, we can't afford to enumerate all 2^N possibilities if we have join clauses linking to N other relations and N is more than a handful. The reason this is a problem is that you might have indexes, or even individual join clauses, that require multiple other rels to be used effectively. Suppose we have an index t1ab on t1(a, b) and a query WHERE t1.a = t2.x AND t1.b = t3.y If t1 is large while t2 and t3 are both small, the best plan is to join t2 and t3 first (even if that's a cartesian join) and then use both t2.x and t3.y as parameters for an indexscan on t1ab. The same observation can apply even for single index columns and single join clauses: WHERE t1.a = (t2.x + t3.y) So the context where this is an issue is that we're considering the specific index t1ab and wondering what parameterized indexscan paths to create using it. To find a good plan in these examples, we have to create a path that's parameterized by both t2 and t3. It's not so hard to consider that possibility here, but what if there are 30 different join clauses (linking to 30 other rels) that could be used with the same index? We can't afford to try all the combinations exhaustively, and even if we could, it's just about certain that most of them aren't worth our time. Before 9.2, we didn't have this problem in this guise because we didn't generate parameterized paths bottom-up; instead, given that we had already decided to join t2 and t3 for some reason, we would look to see what indexpaths we could make for t1 using values from t2 and/or t3. However, that approach had big problems of its own: it couldn't find plans that involved pushing parameters down through an intervening join. Another point here is that no such path can get chosen unless the planner decides to consider joining t2 and t3 to begin with --- and if that's a cartesian join, it will reject the idea out of hand. In the previous coding there was basically no realistic fix for that; if we let the thing try useless cartesian joins, planning time on large queries will go to hell in a handbasket. With the new code, we could in principle make note of interesting parameterized paths that require multiple outer relations, and then allow those specific combinations of relations to get cartesian-joined. I've not had time to pursue this yet but I think it should happen sometime. So I think bottom-up creation of parameterized paths is better than what we were doing pre-9.2, but we do have to stop it from going crazy when there are too many join clauses for the same index. Now the $64 question is how hard do we have to try to find the "best" alternative when there are too many. I think queries in which there are a lot of join clauses touching the same index but they are not related by equivalence classes are probably pretty pathological. It's questionable whether we should spend a lot of skull sweat on such cases, or for that matter that we'd even recognize an especially good solution if we saw it. One idea that occurred to me while typing this up is to restrict the maximum number of other-rels that we'll consider together, ie, when we have 30 other rels, don't try to form all 2^30 subsets but only those containing at most K rels, for some pretty small K (say, the number of index columns). But I'm not really sure it's worth the effort. regards, tom lane
On 31 October 2012 19:44, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Before 9.2, we didn't have this problem in this guise because we didn't > generate parameterized paths bottom-up; instead, given that we had > already decided to join t2 and t3 for some reason, we would look to see > what indexpaths we could make for t1 using values from t2 and/or t3. > However, that approach had big problems of its own: it couldn't find > plans that involved pushing parameters down through an intervening join. Given that most joins patterns are repeated consistently across multiple SQL statements, is it possible to pre-calculate paths at ANALYZE time, or perhaps cache them (persistently?) when first generated? That way we just look up best paths, which would make join planning much faster. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I looked into the complaint of unreasonable planner runtime in bug #7626, > http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php > > In the given example, the indexed relation "foo" has join clauses with > 30 other relations. The code that I added in commit > 3b8968f25232ad09001bf35ab4cc59f5a501193e will try all 2^30 combinations > of those rels as possible outer relations for a parameterized indexscan > :-(. So clearly, the idea that we can just try everything and not have > some kind of heuristic restriction isn't going to work. You know, when I read this, my first thought was ... why is this an exponential relationship instead of a linear one? Even now, I'm not sure I quite understand that. With a parameterized path, we get an index scan (or index-only scan) with a.id taking its value from some outer scan, but it can't take its value from more than one outer scan.Can it? So what does it mean to parameterize the scanof foo by both ag1 (aid) and ag2 (aid)? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I looked into the complaint of unreasonable planner runtime in bug #7626, >> http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php > You know, when I read this, my first thought was ... why is this an > exponential relationship instead of a linear one? Because it's considering *combinations* of outer relations for a parameterized scan. For instance consider an index on t(a,b) and a queryWHERE t.a = x.c1 AND t.b = y.c2 There are three different parameterized paths we could create: one relying on x only, one relying on y only, one relying on both. The one relying on y only is probably going to suck, if this is a btree index, but at the level we're working at here that's not yet apparent. The other two are definitely both worthy of consideration, since it might or might not be worth it to join x and y first in order to use both conditions in scanning t. So in general, given join clauses that reference N different outer relations, you could have as many as 2^N-1 sets of outer relations that could possibly generate usefully-different parameterized paths. In practice, since all these clauses must be usable with the same index, there's probably not nearly that many useful combinations --- but again, it's hard to know exactly which ones are winners in advance of doing any cost calculations. regards, tom lane
On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Oct 30, 2012 at 5:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> I looked into the complaint of unreasonable planner runtime in bug #7626, >>> http://archives.postgresql.org/pgsql-bugs/2012-10/msg00232.php > >> You know, when I read this, my first thought was ... why is this an >> exponential relationship instead of a linear one? > > Because it's considering *combinations* of outer relations for a > parameterized scan. For instance consider an index on t(a,b) > and a query > WHERE t.a = x.c1 AND t.b = y.c2 > There are three different parameterized paths we could create: one > relying on x only, one relying on y only, one relying on both. Sure, but that example is different from the test case provided in the bug report. I agree that here we need to try paths parameterized by a, b, or both a and b. Things might blow up multiplicatively, because we have join clauses referencing both t.a and t.b. But they shouldn't blow up exponentially, because each of t.a and t.b can only be parameterized by ONE thing (I think). And in the example in the bug report, only one column of the table (foo.id) is mentioned. foo.id can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more than one of those at a time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> There are three different parameterized paths we could create: one >> relying on x only, one relying on y only, one relying on both. > Sure, but that example is different from the test case provided in the > bug report. I agree that here we need to try paths parameterized by > a, b, or both a and b. Things might blow up multiplicatively, because > we have join clauses referencing both t.a and t.b. But they shouldn't > blow up exponentially, because each of t.a and t.b can only be > parameterized by ONE thing (I think). Um, no. This is a useful counterexample: WHERE t.a > x.c1 AND t.a < y.c2 With a range constraint like this one, it's possible for the doubly-parameterized path to be quite useful while either singly-parameterized path is basically useless. And these examples aren't even going into cases you might get with non-btree indexes, where clauses could interact in much more complicated ways. > And in the example in the bug > report, only one column of the table (foo.id) is mentioned. foo.id > can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more > than one of those at a time. In the example, we do figure out that the clauses are redundant, but only further downstream. The code that's got the problem can't really assume such a thing. As patched, it will indeed limit what it considers to at most one additional clause per index column, once it's hit the heuristic limit --- but it's entirely possible for it to miss useful combinations because of that. regards, tom lane
On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Nov 5, 2012 at 2:05 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> There are three different parameterized paths we could create: one >>> relying on x only, one relying on y only, one relying on both. > >> Sure, but that example is different from the test case provided in the >> bug report. I agree that here we need to try paths parameterized by >> a, b, or both a and b. Things might blow up multiplicatively, because >> we have join clauses referencing both t.a and t.b. But they shouldn't >> blow up exponentially, because each of t.a and t.b can only be >> parameterized by ONE thing (I think). > > Um, no. This is a useful counterexample: > > WHERE t.a > x.c1 AND t.a < y.c2 > > With a range constraint like this one, it's possible for the > doubly-parameterized path to be quite useful while either > singly-parameterized path is basically useless. And these examples > aren't even going into cases you might get with non-btree indexes, > where clauses could interact in much more complicated ways. Well, OK. So maybe you also need the operator to be the same as well. >> And in the example in the bug >> report, only one column of the table (foo.id) is mentioned. foo.id >> can be driven by ag1.aid OR ag2.aid OR ag3.aid OR ..., but not more >> than one of those at a time. > > In the example, we do figure out that the clauses are redundant, but > only further downstream. The code that's got the problem can't really > assume such a thing. As patched, it will indeed limit what it considers > to at most one additional clause per index column, once it's hit the > heuristic limit --- but it's entirely possible for it to miss useful > combinations because of that. Seems unfortunate, but I don't understand the code well enough to know how to do better. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Um, no. This is a useful counterexample: >> WHERE t.a > x.c1 AND t.a < y.c2 > Well, OK. So maybe you also need the operator to be the same as well. Nope. A counterexample to that claim is a GIN index on an array column: WHERE t.arraycol @> array[1,2,3] AND t.arraycol @> array[4,5,6] This restriction is equivalent to WHERE t.arraycol @> array[1,2,3,4,5,6] which is substantially more selective than either constraint alone. If the two RHS arrays are not constants, but are coming from different tables x and y, then we have something isomorphic to the previous example (at least from the perspective of indxpath.c), but it would not be good for indxpath.c to assume that these clauses couldn't be useful together. We *can* make a simplifying assumption of the kind you suggest when we know that the clauses were all generated from the same equivalence class, because then we have very strong assumptions about what the clauses' semantics are. (And indeed the patch does take care of that case separately.) But for the general case of non-equijoin clauses we can't assume very much at all about whether clauses are redundant, at least not without knowledge that indxpath.c hasn't got. >> .... As patched, it will indeed limit what it considers >> to at most one additional clause per index column, once it's hit the >> heuristic limit --- but it's entirely possible for it to miss useful >> combinations because of that. > Seems unfortunate, but I don't understand the code well enough to know > how to do better. Me either. What I will say is that as patched, the code will still find all useful clause combinations as long as there aren't too many other relations involved. I've not been able to think of another way of restricting the search that doesn't reject possibly-useful combinations even in otherwise-very-simple queries. regards, tom lane
On Mon, Nov 5, 2012 at 3:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Nov 5, 2012 at 2:44 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Um, no. This is a useful counterexample: >>> WHERE t.a > x.c1 AND t.a < y.c2 > >> Well, OK. So maybe you also need the operator to be the same as well. > > Nope. A counterexample to that claim is a GIN index on an array column: > > WHERE t.arraycol @> array[1,2,3] AND t.arraycol @> array[4,5,6] > > This restriction is equivalent to > > WHERE t.arraycol @> array[1,2,3,4,5,6] > > which is substantially more selective than either constraint alone. > If the two RHS arrays are not constants, but are coming from different > tables x and y, then we have something isomorphic to the previous > example (at least from the perspective of indxpath.c), but it would > not be good for indxpath.c to assume that these clauses couldn't be > useful together. Neat example. > We *can* make a simplifying assumption of the kind you suggest when > we know that the clauses were all generated from the same equivalence > class, because then we have very strong assumptions about what the > clauses' semantics are. (And indeed the patch does take care of that > case separately.) But for the general case of non-equijoin clauses > we can't assume very much at all about whether clauses are redundant, > at least not without knowledge that indxpath.c hasn't got. OK. Fortunately, I don't think we need to care too much about that case, since non-equijoins are pretty rare. A reasonable heuristic restriction seems fine for that case ... at least until the next problem case shows up. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company