Thread: Review remove {join,from}_collapse_limit, add enable_join_ordering
Hi Robert, Hi all, The patch applies cleanly and works as intended - no surprise here. After the changes the documentation is at least as easy to understand as before and the code changes look sensible Also not surprisingly that's not the area I expected problems I guess ;-) For performance testing I replayed query logs from sites I easily could get my hands on (3 different, halfway interesting ones). I found no relevant differences on the first site which is sensible because {from,join}_collapse_limit wasn't reached anyway. More interesting are the queries from the two sites having reporting queries: On the first, simpler, schema I found on average 30% plan time increase and 40% execution time decrease. Most of the queries stayed the same, only a few changed radically (in both directions). No big differences between geqo=on/off. The queries on the second reporting schema unfortunately are different. Its the one were I copied the crazy example I attached in the original thread. With geqo=off a good part of the queries used daily use too much memory to plan sensibly and geqo=on outright fails with: "Error: Failed to make a valid plan" on some. I stopped trying to make performance measurements there. Noticeable even some plans which were plannable in reasonable time before now are problematic with enable_join_ordering=false! I agree that those queries are crazy, but I am not sure how many of those are out there... So, while I think the changes are principally a good idea, as {from,join}_collapse_limit are a bit confusing options, I personally! do not think geqo is ready for it today, especially as the benefit is relatively small. If I am the only one having access to such complicated queries its fine - I am working on the sites query generation/schema anyway. Could perhaps some other people having complicated queries check how they work out with those changes? It should be enough to check with a very big {join,from}_collapse_limit? Kevin? I have also to admit that I somewhat like the current behaviour in theory. Currently you can have a view with hand-optimized JOIN order which will not get inlined and/or reordered use it together with something unoptimized and the unoptimized part will be reordered in many cases... I found it somewhat hard to review a patch were my meaning was biased from beginning. As Tom listed himself listed himself as a reviewer I will happiliy (err?) concede to his and your judgement. Andres
Andres Freund <andres@anarazel.de> writes: > The queries on the second reporting schema unfortunately are different. Its the > one were I copied the crazy example I attached in the original thread. > With geqo=off a good part of the queries used daily use too much memory to plan > sensibly and geqo=on outright fails with: > "Error: Failed to make a valid plan" > on some. We're not going to be able to fix this unless you show us examples. > Noticeable even some plans which were plannable in reasonable time before now > are problematic with enable_join_ordering=false! And this even more so --- it doesn't make any sense at all. > So, while I think the changes are principally a good idea, as > {from,join}_collapse_limit are a bit confusing options, I personally! do not > think geqo is ready for it today, especially as the benefit is relatively > small. In general I think that any such bugs are there anyway and need to be fixed anyway. regards, tom lane
On Thursday 16 July 2009 15:13:02 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The queries on the second reporting schema unfortunately are different. > > Its the one were I copied the crazy example I attached in the original > > thread. With geqo=off a good part of the queries used daily use too much > > memory to plan sensibly and geqo=on outright fails with: > > "Error: Failed to make a valid plan" > > on some. > We're not going to be able to fix this unless you show us examples. In the other thread I attached a similar to the real schema + example query. Not enough? And why? > > Noticeable even some plans which were plannable in reasonable time before > > now are problematic with enable_join_ordering=false! > And this even more so --- it doesn't make any sense at all. Why? With a high from_collapse_limit more subqueries get inlined - before inlining they get planned separately. > > So, while I think the changes are principally a good idea, as > > {from,join}_collapse_limit are a bit confusing options, I personally! do > > not think geqo is ready for it today, especially as the benefit is > > relatively small. > In general I think that any such bugs are there anyway and need to be > fixed anyway. Understandable. Andres
On Thursday 16 July 2009 15:18:01 Andres Freund wrote: > On Thursday 16 July 2009 15:13:02 Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > The queries on the second reporting schema unfortunately are different. > > > Its the one were I copied the crazy example I attached in the original > > > thread. With geqo=off a good part of the queries used daily use too > > > much memory to plan sensibly and geqo=on outright fails with: > > > "Error: Failed to make a valid plan" > > > on some. > > > > We're not going to be able to fix this unless you show us examples. > > In the other thread I attached a similar to the real schema + example > query. Not enough? And why? For reference: http://archives.postgresql.org/message- id/200907091700.43411.andres@anarazel.de Andres
Andres Freund <andres@anarazel.de> writes: > On Thursday 16 July 2009 15:13:02 Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >>> "Error: Failed to make a valid plan" >> We're not going to be able to fix this unless you show us examples. > In the other thread I attached a similar to the real schema + example query. > Not enough? And why? I tried the example query and couldn't get "Failed to make a valid plan" out of it ... what settings do you need for that? However, I do observe that this seems a sufficient counterexample against the theory that we can just remove the collapse limits and let GEQO save us on very complex queries. On my machine, the example query takes about 22 seconds to plan using CVS HEAD w/ all default settings. If I set both collapse_limit variables to very high values (I used 999), it takes ... um ... not sure; I gave up waiting after half an hour. I also tried with geqo_effort reduced to the minimum of 1, but that didn't produce a plan in reasonable time either (I gave up after ten minutes). So if we remove the collapse limits, Postgres will completely fail on this query --- the only way out would be enable_join_ordering = off, which is hardly likely to produce a decent plan. Maybe we should leave the collapse_limit logic alone and address Robert's gripes by just raising the default values a lot (I'm thinking 100 or so). That way there's an escape hatch for anyone who has pathological queries to deal with --- just dial the settings down. >>> Noticeable even some plans which were plannable in reasonable time before >>> now are problematic with enable_join_ordering=false! >> And this even more so --- it doesn't make any sense at all. > Why? With a high from_collapse_limit more subqueries get inlined - before > inlining they get planned separately. Okay, I misunderstood which two cases you were comparing there. regards, tom lane
On Thu, Jul 16, 2009 at 4:16 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > However, I do observe that this seems a sufficient counterexample > against the theory that we can just remove the collapse limits and let > GEQO save us on very complex queries. On my machine, the example query > takes about 22 seconds to plan using CVS HEAD w/ all default settings. > If I set both collapse_limit variables to very high values (I used 999), > it takes ... um ... not sure; I gave up waiting after half an hour. What's the point of GEQO if it doesn't guarantee to produce the optimal plana and *also* doesn't guarantee to produce some plan, any plan, within some reasonable amount of time? Either we need to fix that or else I don't see what it's buying us over our regular planner which also might not produce a plan within a reasonable amount of time but at least if it does it'll be the right plan. -- greg http://mit.edu/~gsstark/resume.pdf
I wrote: > If I set both collapse_limit variables to very high values (I used 999), > it takes ... um ... not sure; I gave up waiting after half an hour. > I also tried with geqo_effort reduced to the minimum of 1, but that > didn't produce a plan in reasonable time either (I gave up after ten > minutes). After I gave up letting the machine be idle to get a fair timing, I turned on oprofile monitoring. It looks a bit interesting: samples % image name symbol name 886498 53.8090 postgres have_relevant_eclass_joinclause 460596 27.9574 postgres bms_overlap 142764 8.6655 postgres bms_is_subset 126274 7.6646 postgres have_join_order_restriction 14205 0.8622 postgres list_nth_cell 2721 0.1652 postgres generate_join_implied_equalities 2445 0.1484 libc-2.9.so memset 2202 0.1337 postgres have_relevant_joinclause 1678 0.1019 postgres make_canonical_pathkey 1648 0.1000 postgres pfree 884 0.0537 postgres bms_union 762 0.0463 postgres gimme_tree 660 0.0401 libc-2.9.so memcpy 571 0.0347 postgres AllocSetFree 475 0.0288 postgres AllocSetAlloc 431 0.0262 postgres has_relevant_eclass_joinclause 389 0.0236 postgres check_list_invariants 260 0.0158 postgres join_is_legal 238 0.0144 postgres bms_copy So maybe a redesign of the equivalence-class joinclause mechanism is in order. Still, this is unlikely to fix the fundamental issue that the time for large join problems grows nonlinearly. regards, tom lane
Re: Re: Review remove {join,from}_collapse_limit, add enable_join_ordering
From
Kenneth Marshall
Date:
On Thu, Jul 16, 2009 at 04:27:39PM +0100, Greg Stark wrote: > On Thu, Jul 16, 2009 at 4:16 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > However, I do observe that this seems a sufficient counterexample > > against the theory that we can just remove the collapse limits and let > > GEQO save us on very complex queries. ?On my machine, the example query > > takes about 22 seconds to plan using CVS HEAD w/ all default settings. > > If I set both collapse_limit variables to very high values (I used 999), > > it takes ... um ... not sure; I gave up waiting after half an hour. > > What's the point of GEQO if it doesn't guarantee to produce the > optimal plana and *also* doesn't guarantee to produce some plan, any > plan, within some reasonable amount of time? Either we need to fix > that or else I don't see what it's buying us over our regular planner > which also might not produce a plan within a reasonable amount of time > but at least if it does it'll be the right plan. > I do agree that we should have an actually time limit cap for GEQO that would have it return the best plan so far at that time. Then you can at least bound your planning time. Regards, Ken
On Thursday 16 July 2009 17:16:31 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Thursday 16 July 2009 15:13:02 Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: > >>> "Error: Failed to make a valid plan" > >> > >> We're not going to be able to fix this unless you show us examples. > > > > In the other thread I attached a similar to the real schema + example > > query. Not enough? And why? > > I tried the example query and couldn't get "Failed to make a valid plan" > out of it ... what settings do you need for that? It unfortunately depends on settings and luck. This dependence on luck was the reason why I liked geqo to behave "somewhat" deterministically... With {join,from}_collapse_limit = 100 it seems to be triggered reliably. With lower values it seems harder trigger, with bigger it simply takes too long to even get there. Efficiencywise using geqo with higher limits nearly all time is spent in: geqo gimme_tree have_join_order_restriction has_legal_joinclause have_relevant_joinclause have_relevant_eclass (30% self) bms_overlap (50%self) I am not yet fully understanding geqo, but it looks like there are some possibilities to improve this. Although such efficiency improvements would no not explain the completely failing plans... Do you have an idea which kind of plans benefit most from using geqo? I had a somewhat hard time finding any query were geqo was substantially faster than the standard join search. That also somewhat explains why I saw improvements with 64bit bitmapsets... > However, I do observe that this seems a sufficient counterexample > against the theory that we can just remove the collapse limits and let > GEQO save us on very complex queries. On my machine, the example query > takes about 22 seconds to plan using CVS HEAD w/ all default settings. > If I set both collapse_limit variables to very high values (I used 999), > it takes ... um ... not sure; I gave up waiting after half an hour. > I also tried with geqo_effort reduced to the minimum of 1, but that > didn't produce a plan in reasonable time either (I gave up after ten > minutes). So if we remove the collapse limits, Postgres will completely > fail on this query --- the only way out would be enable_join_ordering = > off, which is hardly likely to produce a decent plan. > Maybe we should leave the collapse_limit logic alone and address > Robert's gripes by just raising the default values a lot (I'm thinking > 100 or so). That way there's an escape hatch for anyone who has > pathological queries to deal with --- just dial the settings down. Yes, I think thats sensible. I don't know if there are any queries out there that benefit from a higher limits. Andres
On Thu, Jul 16, 2009 at 4:32 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > samples % image name symbol name > 886498 53.8090 postgres have_relevant_eclass_joinclause > 460596 27.9574 postgres bms_overlap > > So maybe a redesign of the equivalence-class joinclause mechanism is in > order. Still, this is unlikely to fix the fundamental issue that the > time for large join problems grows nonlinearly. Perhaps it's GEQO's fault that it's using these functions inappropriately, calling them often to calculate these answers whenever it needs them instead of looking once for join clauses and then optimizing based on the results. But I've never actually looked at geqo, mabe that's inherent in the design? -- greg http://mit.edu/~gsstark/resume.pdf
Greg Stark <gsstark@mit.edu> writes: > On Thu, Jul 16, 2009 at 4:32 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> So maybe a redesign of the equivalence-class joinclause mechanism is in >> order. �Still, this is unlikely to fix the fundamental issue that the >> time for large join problems grows nonlinearly. > Perhaps it's GEQO's fault that it's using these functions > inappropriately, calling them often to calculate these answers > whenever it needs them instead of looking once for join clauses and > then optimizing based on the results. But I've never actually looked > at geqo, mabe that's inherent in the design? geqo isn't doing anything the regular planner wouldn't do under similar conditions. It might well be that better caching is the answer to this particular problem, but I don't have time to look closer today. regards, tom lane
On Thursday 16 July 2009 17:27:39 Greg Stark wrote: > On Thu, Jul 16, 2009 at 4:16 PM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > However, I do observe that this seems a sufficient counterexample > > against the theory that we can just remove the collapse limits and let > > GEQO save us on very complex queries. On my machine, the example query > > takes about 22 seconds to plan using CVS HEAD w/ all default settings. > > If I set both collapse_limit variables to very high values (I used 999), > > it takes ... um ... not sure; I gave up waiting after half an hour. > What's the point of GEQO if it doesn't guarantee to produce the > optimal plana and *also* doesn't guarantee to produce some plan, any > plan, within some reasonable amount of time? Either we need to fix > that or else I don't see what it's buying us over our regular planner > which also might not produce a plan within a reasonable amount of time > but at least if it does it'll be the right plan. Well, I could not find a plan where it errored out with the old limits. So one could argue its just not adapted. Although I also could not find a single case where geqo was relevantly faster with the default settings even if it was used. The default settings currently make it relatively hard to trigger geqo at all. Andres
Andres Freund <andres@anarazel.de> writes: > The default settings currently make it relatively hard to trigger geqo at all. Yes, and that was intentional. One of the implications of what we're discussing here is that geqo would get used a lot more for "typical complex queries" (if there is any such thing as a typical one). So it's fully to be expected that the fallout would be pressure to improve geqo in various ways. Given that we are at the start of the development cycle, that prospect doesn't scare me --- there's plenty of time to fix whatever needs fixing. However, I am leaning to the feeling that I don't want to be putting people in a position where they have no alternative but to use geqo. So adjusting rather than removing the collapse limits is seeming like a good idea. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On Thursday 16 July 2009 17:16:31 Tom Lane wrote: >> I tried the example query and couldn't get "Failed to make a valid plan" >> out of it ... what settings do you need for that? > It unfortunately depends on settings and luck. This dependence on luck was the > reason why I liked geqo to behave "somewhat" deterministically... > With {join,from}_collapse_limit = 100 it seems to be triggered reliably. With > lower values it seems harder trigger, with bigger it simply takes too long to > even get there. OK, I see it at 100. Would you confirm that what you get is the failure in random_init_pool (geqo_pool.c) not the identically-phrased message elsewhere? (If you have VERBOSITY = verbose you should see the error location info.) regards, tom lane
On Thursday 16 July 2009 18:23:06 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Thursday 16 July 2009 17:16:31 Tom Lane wrote: > >> I tried the example query and couldn't get "Failed to make a valid plan" > >> out of it ... what settings do you need for that? > > > > It unfortunately depends on settings and luck. This dependence on luck > > was the reason why I liked geqo to behave "somewhat" deterministically... > > > > With {join,from}_collapse_limit = 100 it seems to be triggered reliably. > > With lower values it seems harder trigger, with bigger it simply takes > > too long to even get there. > > OK, I see it at 100. Would you confirm that what you get is the failure > in random_init_pool (geqo_pool.c) not the identically-phrased message > elsewhere? (If you have VERBOSITY = verbose you should see the error > location info.) Yes. I should have seen that. Its not exactly surprising... Andres
On Thursday 16 July 2009 17:59:58 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > The default settings currently make it relatively hard to trigger geqo at > > all. > > Yes, and that was intentional. One of the implications of what we're > discussing here is that geqo would get used a lot more for "typical > complex queries" (if there is any such thing as a typical one). So > it's fully to be expected that the fallout would be pressure to improve > geqo in various ways. > > Given that we are at the start of the development cycle, that prospect > doesn't scare me --- there's plenty of time to fix whatever needs > fixing. However, I am leaning to the feeling that I don't want to be > putting people in a position where they have no alternative but to use > geqo. So adjusting rather than removing the collapse limits is seeming > like a good idea. Hm. I see a, a bit more fundamental problem with geqo: I tried several queries, and I found not a single one, where the whole genetical process did any significant improvments to the 'worth'. It seems that always the best variant out of the pool is either the path choosen in the end, or at least the cost difference is _really_ low. Andres
On Thu, Jul 16, 2009 at 12:49 PM, Andres Freund<andres@anarazel.de> wrote: > On Thursday 16 July 2009 17:59:58 Tom Lane wrote: >> Andres Freund <andres@anarazel.de> writes: >> > The default settings currently make it relatively hard to trigger geqo at >> > all. >> >> Yes, and that was intentional. One of the implications of what we're >> discussing here is that geqo would get used a lot more for "typical >> complex queries" (if there is any such thing as a typical one). So >> it's fully to be expected that the fallout would be pressure to improve >> geqo in various ways. >> >> Given that we are at the start of the development cycle, that prospect >> doesn't scare me --- there's plenty of time to fix whatever needs >> fixing. However, I am leaning to the feeling that I don't want to be >> putting people in a position where they have no alternative but to use >> geqo. So adjusting rather than removing the collapse limits is seeming >> like a good idea. > Hm. I see a, a bit more fundamental problem with geqo: > I tried several queries, and I found not a single one, where the whole > genetical process did any significant improvments to the 'worth'. > It seems that always the best variant out of the pool is either the path > choosen in the end, or at least the cost difference is _really_ low. Ouch. Did you insert some debugging code to get that information, or how did you come to that conclusion? ...Robert
On Thu, Jul 16, 2009 at 11:32 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > I wrote: >> If I set both collapse_limit variables to very high values (I used 999), >> it takes ... um ... not sure; I gave up waiting after half an hour. >> I also tried with geqo_effort reduced to the minimum of 1, but that >> didn't produce a plan in reasonable time either (I gave up after ten >> minutes). > > After I gave up letting the machine be idle to get a fair timing, > I turned on oprofile monitoring. It looks a bit interesting: That is interesting, but there's not really enough detail here to see what is going on. I'm more interested in what the high-level functions are doing that's causing these guys to be called so many times. As Greg says, if the planning time curve for GEQO isn't better than the one for the standard planner, it's the epitome of pointless. > So maybe a redesign of the equivalence-class joinclause mechanism is in > order. Still, this is unlikely to fix the fundamental issue that the > time for large join problems grows nonlinearly. Nonlinear is one thing, but this looks more like exponential. I understand that the standard planner is exponential; GEQO should not be. ...Robert
On Thursday 16 July 2009 19:13:55 Robert Haas wrote: > On Thu, Jul 16, 2009 at 12:49 PM, Andres Freund<andres@anarazel.de> wrote: > > On Thursday 16 July 2009 17:59:58 Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: > >> > The default settings currently make it relatively hard to trigger geqo > >> > at all. > >> > >> Yes, and that was intentional. One of the implications of what we're > >> discussing here is that geqo would get used a lot more for "typical > >> complex queries" (if there is any such thing as a typical one). So > >> it's fully to be expected that the fallout would be pressure to improve > >> geqo in various ways. > >> > >> Given that we are at the start of the development cycle, that prospect > >> doesn't scare me --- there's plenty of time to fix whatever needs > >> fixing. However, I am leaning to the feeling that I don't want to be > >> putting people in a position where they have no alternative but to use > >> geqo. So adjusting rather than removing the collapse limits is seeming > >> like a good idea. > > > > Hm. I see a, a bit more fundamental problem with geqo: > > I tried several queries, and I found not a single one, where the whole > > genetical process did any significant improvments to the 'worth'. > > It seems that always the best variant out of the pool is either the path > > choosen in the end, or at least the cost difference is _really_ low. > Ouch. Did you insert some debugging code to get that information, or > how did you come to that conclusion? Yes, I enabled GEQO_DEBUG and added some more debugging output. Btw, a higher generation count does not change that. Andres
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Jul 16, 2009 at 11:32 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: >> So maybe a redesign of the equivalence-class joinclause mechanism is in >> order. �Still, this is unlikely to fix the fundamental issue that the >> time for large join problems grows nonlinearly. > Nonlinear is one thing, but this looks more like exponential. I > understand that the standard planner is exponential; GEQO should not > be. Well, the equivclass code is new as of 8.3. It's possible that this got broken relatively recently ... regards, tom lane
On Thursday 16 July 2009 19:22:30 Robert Haas wrote: > On Thu, Jul 16, 2009 at 11:32 AM, Tom Lane<tgl@sss.pgh.pa.us> wrote: > > I wrote: > >> If I set both collapse_limit variables to very high values (I used 999), > >> it takes ... um ... not sure; I gave up waiting after half an hour. > >> I also tried with geqo_effort reduced to the minimum of 1, but that > >> didn't produce a plan in reasonable time either (I gave up after ten > >> minutes). > > > > After I gave up letting the machine be idle to get a fair timing, > > I turned on oprofile monitoring. It looks a bit interesting: > That is interesting, but there's not really enough detail here to see > what is going on. I'm more interested in what the high-level > functions are doing that's causing these guys to be called so many > times. As Greg says, if the planning time curve for GEQO isn't better > than the one for the standard planner, it's the epitome of pointless. It is not the actual genetic searching I now found out (or more precisely, read the trace correctly). At the start of the query GEQO fills a pool with random paths through the searchspace. Unfortunately a random path is not very likely to succeed. So it checks and checks and... Thats why that problem is not visible with a simple join out of 100 or so tables - all paths are valid there... Andres
On Thu, Jul 16, 2009 at 06:49:08PM +0200, Andres Freund wrote: > On Thursday 16 July 2009 17:59:58 Tom Lane wrote: > > Andres Freund <andres@anarazel.de> writes: > > > The default settings currently make it relatively hard to trigger geqo at > > > all. > > > > Yes, and that was intentional. One of the implications of what we're > > discussing here is that geqo would get used a lot more for "typical > > complex queries" (if there is any such thing as a typical one). So > > it's fully to be expected that the fallout would be pressure to improve > > geqo in various ways. > > > > Given that we are at the start of the development cycle, that prospect > > doesn't scare me --- there's plenty of time to fix whatever needs > > fixing. However, I am leaning to the feeling that I don't want to be > > putting people in a position where they have no alternative but to use > > geqo. So adjusting rather than removing the collapse limits is seeming > > like a good idea. > Hm. I see a, a bit more fundamental problem with geqo: > I tried several queries, and I found not a single one, where the whole > genetical process did any significant improvments to the 'worth'. > It seems that always the best variant out of the pool is either the path > choosen in the end, or at least the cost difference is _really_ low. > > > Andres > Hi Andres, From some of my reading of the literature on join order optimization via random sampling, such as what would establish the initial GEQO pool, there is a very good possibility of having a "pretty good" plan in the first pool, especially for our larger initial pool sizes of 100-1000. And in fact, the final plan has a good chance of being of approximately the same cost as a member of the initial pool. Uniform sampling alone can give you a close to optimum plan 80% of the time with an initial sample size of 100. And using biased sampling raises that to 99% or better. Regards, Ken