Re: Limiting the number of parameterized indexpaths created - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Re: Limiting the number of parameterized indexpaths created |
Date | |
Msg-id | 15514.1351712659@sss.pgh.pa.us Whole thread Raw |
In response to | Re: Limiting the number of parameterized indexpaths created (Simon Riggs <simon@2ndQuadrant.com>) |
Responses |
Re: Limiting the number of parameterized indexpaths created
|
List | pgsql-hackers |
Simon Riggs <simon@2ndQuadrant.com> writes: > We should be able to do a better job of recognising some other > aspect/symmetry of this situation and then simplifying the problem > from that. Calculating all paths treats the problem as a complete > unknown and we must be able to do better than that. If we look at the > problem from the perspective of how we would handle it if one of the > tables was very large, how would that change things? Can we use a > greedy algorithm starting with largest table? Doesn't seem particularly relevant. The issue is which parameterized paths to generate for the current index of the current table. It could possibly make sense to try harder if the current table is large than if it's small --- but no matter how large it is, we can't afford to enumerate all 2^N possibilities if we have join clauses linking to N other relations and N is more than a handful. The reason this is a problem is that you might have indexes, or even individual join clauses, that require multiple other rels to be used effectively. Suppose we have an index t1ab on t1(a, b) and a query WHERE t1.a = t2.x AND t1.b = t3.y If t1 is large while t2 and t3 are both small, the best plan is to join t2 and t3 first (even if that's a cartesian join) and then use both t2.x and t3.y as parameters for an indexscan on t1ab. The same observation can apply even for single index columns and single join clauses: WHERE t1.a = (t2.x + t3.y) So the context where this is an issue is that we're considering the specific index t1ab and wondering what parameterized indexscan paths to create using it. To find a good plan in these examples, we have to create a path that's parameterized by both t2 and t3. It's not so hard to consider that possibility here, but what if there are 30 different join clauses (linking to 30 other rels) that could be used with the same index? We can't afford to try all the combinations exhaustively, and even if we could, it's just about certain that most of them aren't worth our time. Before 9.2, we didn't have this problem in this guise because we didn't generate parameterized paths bottom-up; instead, given that we had already decided to join t2 and t3 for some reason, we would look to see what indexpaths we could make for t1 using values from t2 and/or t3. However, that approach had big problems of its own: it couldn't find plans that involved pushing parameters down through an intervening join. Another point here is that no such path can get chosen unless the planner decides to consider joining t2 and t3 to begin with --- and if that's a cartesian join, it will reject the idea out of hand. In the previous coding there was basically no realistic fix for that; if we let the thing try useless cartesian joins, planning time on large queries will go to hell in a handbasket. With the new code, we could in principle make note of interesting parameterized paths that require multiple outer relations, and then allow those specific combinations of relations to get cartesian-joined. I've not had time to pursue this yet but I think it should happen sometime. So I think bottom-up creation of parameterized paths is better than what we were doing pre-9.2, but we do have to stop it from going crazy when there are too many join clauses for the same index. Now the $64 question is how hard do we have to try to find the "best" alternative when there are too many. I think queries in which there are a lot of join clauses touching the same index but they are not related by equivalence classes are probably pretty pathological. It's questionable whether we should spend a lot of skull sweat on such cases, or for that matter that we'd even recognize an especially good solution if we saw it. One idea that occurred to me while typing this up is to restrict the maximum number of other-rels that we'll consider together, ie, when we have 30 other rels, don't try to form all 2^30 subsets but only those containing at most K rels, for some pretty small K (say, the number of index columns). But I'm not really sure it's worth the effort. regards, tom lane
pgsql-hackers by date: