Thread: Avoiding bad prepared-statement plans.
I've been discussing this with Josh, Heikki, and Peter E. over the past few weeks. As Peter observed years ago, prepared statements can perform badly because their plans are overly generic. Also, statistics change and sometimes plans should change with them. It would be nice if we could avoid users having to worry about these things. I have some ideas that I'm willing to work on, if people agree that they're useful. These are simple changes; the goal is to avoid pathological performance at minimal cost, not to make prepared statements faster than parameterized ones for everyone. The ideas interact in various ways. = Projected-cost threshold = If a prepared statement takes parameters, and the generic plan has a high projected cost, re-plan each EXECUTE individually with all its parameter values bound. It may or may not help, but unless the planner is vastly over-pessimistic, re-planning isn't going to dominate execution time for these cases anyway. = Actual-cost threshold = Also stop using the generic plan if the statement takes a long time to run in practice. Statistics may have gone bad. It could also be a one-off due to a load peak or something, but that's handled by: = Plan refresh = Periodically re-plan prepared statements on EXECUTE. This is also a chance for queries that were being re-planned every time to go back to a generic plan. Does that sound useful? Maybe it's too much engineering for little gain compared to re-planning every EXECUTE. OTOH it doesn't look that much harder than going that way either. And maybe there's some extra planning effort that might be worthwhile for a reusable plan but not for an individual query. Jeroen
Yeb Havinga wrote: >> I've been discussing this with Josh, Heikki, and Peter E. over the >> past few weeks. > Is this searchable in the archives? I'm interested in ideas discussed. No, sorry. These were face-to-face discussions at linux.conf.au and FOSDEM. >> If a prepared statement takes parameters, and the generic plan has a >> high projected cost, re-plan each EXECUTE individually with all its >> parameter values bound. It may or may not help, but unless the >> planner is vastly over-pessimistic, re-planning isn't going to >> dominate execution time for these cases anyway. > This sounds like a really nice to have feature. Maybe it'd also be > possible to skip replanning between executes if the current bound values > are 'indexwise-equivalent' to the values used at previous planning, i.e. > nothing in the statistics indicates that execution cost would be (much) > different. Are there more ways to cut down on planning time? Obviously > some plannedstatement/plannerinfo structures could be kept, but maybe > it'd also be possible to plan only that part of the join tree where the > params are used in a scan/join qual. I think we should be careful not to over-think this. Planning isn't *that* costly, so apply Amdahl's Law liberally. I'm proposing some easy things we could do without adding much overhead or maintenance burden; I've been assuming that getting intimate with the planner would risk those advantages. Jeroen
On 09/02/10 12:08, Jeroen Vermeulen wrote: > > = Projected-cost threshold = [snip - this is the simple bit. Sounds very sensible. ] > = Actual-cost threshold = > > Also stop using the generic plan if the statement takes a long time to > run in practice. Do you mean: 1. Rollback the current query and start again 2. Mark the plan as a bad one and plan again next execute If you can figure out how to do #1 then you could probably do it for all queries, but I'm guessing it's far from simple to implement. > = Plan refresh = > > Periodically re-plan prepared statements on EXECUTE. This is also a > chance for queries that were being re-planned every time to go back to a > generic plan. Presumably some score based on update stats and vacuum activity etc. The good side of all these ideas is good indeed. The bad side is plan instability. Someone somewhere will have a generic plan that turns out better than the specific plan (due to bad stats or config settings or just planner limitations). The question is (I guess): How many more winners will there be than losers? -- Richard Huxton Archonet Ltd
Jeroen Vermeulen wrote: > I've been discussing this with Josh, Heikki, and Peter E. over the > past few weeks. Is this searchable in the archives? I'm interested in ideas discussed. > If a prepared statement takes parameters, and the generic plan has a > high projected cost, re-plan each EXECUTE individually with all its > parameter values bound. It may or may not help, but unless the > planner is vastly over-pessimistic, re-planning isn't going to > dominate execution time for these cases anyway. This sounds like a really nice to have feature. Maybe it'd also be possible to skip replanning between executes if the current bound values are 'indexwise-equivalent' to the values used at previous planning, i.e. nothing in the statistics indicates that execution cost would be (much) different. Are there more ways to cut down on planning time? Obviously some plannedstatement/plannerinfo structures could be kept, but maybe it'd also be possible to plan only that part of the join tree where the params are used in a scan/join qual. regards, Yeb Havinga
On Tuesday 09 February 2010 13:08:54 Jeroen Vermeulen wrote: > I've been discussing this with Josh, Heikki, and Peter E. over the past > few weeks. > > As Peter observed years ago, prepared statements can perform badly > because their plans are overly generic. Also, statistics change and > sometimes plans should change with them. It would be nice if we could > avoid users having to worry about these things. > > I have some ideas that I'm willing to work on, if people agree that > they're useful. These are simple changes; the goal is to avoid > pathological performance at minimal cost, not to make prepared > statements faster than parameterized ones for everyone. The ideas > interact in various ways. > > > = Projected-cost threshold = > > If a prepared statement takes parameters, and the generic plan has a > high projected cost, re-plan each EXECUTE individually with all its > parameter values bound. It may or may not help, but unless the planner > is vastly over-pessimistic, re-planning isn't going to dominate > execution time for these cases anyway. > > = Actual-cost threshold = > > Also stop using the generic plan if the statement takes a long time to > run in practice. Statistics may have gone bad. It could also be a > one-off due to a load peak or something, but that's handled by: That is not that easy. It means that you have to use savepoints enclosing each and every execution of a prepared statement because the query could have sideeffects. Which wouldnt be terribly efficient... Andres
Richard Huxton wrote: >> = Actual-cost threshold = >> >> Also stop using the generic plan if the statement takes a long time to >> run in practice. > > Do you mean: > 1. Rollback the current query and start again > 2. Mark the plan as a bad one and plan again next execute > > If you can figure out how to do #1 then you could probably do it for all > queries, but I'm guessing it's far from simple to implement. I'm talking about #2. As a matter of fact #1 did come up in one of those discussions, but how do you know you're not killing the query juuust before it'd done, and then maybe executing a different plan that's no better? >> = Plan refresh = >> >> Periodically re-plan prepared statements on EXECUTE. This is also a >> chance for queries that were being re-planned every time to go back to a >> generic plan. > > Presumably some score based on update stats and vacuum activity etc. I was thinking of something very simple: re-do whatever we'd do if the statement were only being prepared at that point. > The good side of all these ideas is good indeed. The bad side is plan > instability. Someone somewhere will have a generic plan that turns out > better than the specific plan (due to bad stats or config settings or > just planner limitations). The question is (I guess): How many more > winners will there be than losers? That's a good and surprising point, and therefore I'd like to draw attention away to a different point. :-) Yes, there will be losers in the sense that people may have optimized their use of prepared statements to whatever the current planner does. Maybe somebody out there even deliberately uses them to trick the planner into a different plan. But that is always going to happen; we're aiming for better plans, not for giving more detailed control over them. If you really can't take a change, don't upgrade. The competing point is: people out there may currently be forgoing prepared statements entirely because of erratic performance. To those people, if we can help them, it's like having a new feature. Jeroen
Andres Freund wrote: >> = Actual-cost threshold = >> >> Also stop using the generic plan if the statement takes a long time to >> run in practice. Statistics may have gone bad. It could also be a >> one-off due to a load peak or something, but that's handled by: > That is not that easy. It means that you have to use savepoints enclosing each > and every execution of a prepared statement because the query could have > sideeffects. Which wouldnt be terribly efficient... This is not within an execution of the statement, but across executions. So the next execution learns from the previousresult. So I'm not talking about aborting the ongoing execution. Sorry for being unclear. Jeroen
On 09/02/10 14:25, Jeroen Vermeulen wrote: > Richard Huxton wrote: > >>> = Actual-cost threshold = >>> >>> Also stop using the generic plan if the statement takes a long time to >>> run in practice. >> >> Do you mean: >> 1. Rollback the current query and start again >> 2. Mark the plan as a bad one and plan again next execute >> >> If you can figure out how to do #1 then you could probably do it for >> all queries, but I'm guessing it's far from simple to implement. > > I'm talking about #2. As a matter of fact #1 did come up in one of those > discussions, but how do you know you're not killing the query juuust > before it'd done, and then maybe executing a different plan that's no > better? Ah, you'd need to be smarter when planning and also remember the expected rows from each node. That way if your (index driven) inner node was expecting 3 rows you could mark it to force a cancellation if it returns (say) 30 or more. You'd allow more slack in later processing and less slack earlier on where a bad estimate can explode the final number of rows. Or, there is always the case where we reverse-search an index to find the last 10 messages in a group say, but the particular group in question hasn't had a comment for months, so you trawl half the table. People regularly get bitten by that, and there's not much to be done about it. If we could abort when it looks like we're in worst-case rather than best-case scenarios then it would be one less thing for users to worry about. >>> = Plan refresh = >>> >>> Periodically re-plan prepared statements on EXECUTE. This is also a >>> chance for queries that were being re-planned every time to go back to a >>> generic plan. >> >> Presumably some score based on update stats and vacuum activity etc. > > I was thinking of something very simple: re-do whatever we'd do if the > statement were only being prepared at that point. Yes, I thought so, the scoring was for *when* to decide to cancel the old plan. I suppose total query-time would be another way to decide this plan needs reworking. >> The good side of all these ideas is good indeed. The bad side is plan >> instability. Someone somewhere will have a generic plan that turns out >> better than the specific plan (due to bad stats or config settings or >> just planner limitations). The question is (I guess): How many more >> winners will there be than losers? > > That's a good and surprising point, and therefore I'd like to draw > attention away to a different point. :-) > > Yes, there will be losers in the sense that people may have optimized > their use of prepared statements to whatever the current planner does. > Maybe somebody out there even deliberately uses them to trick the > planner into a different plan. But that is always going to happen; we're > aiming for better plans, not for giving more detailed control over them. > If you really can't take a change, don't upgrade. > > The competing point is: people out there may currently be forgoing > prepared statements entirely because of erratic performance. To those > people, if we can help them, it's like having a new feature. Oh, I'm persuaded, but that doesn't really get you anywhere :-) -- Richard Huxton Archonet Ltd
On 02/09/2010 08:46 AM, Jeroen Vermeulen wrote: >> This sounds like a really nice to have feature. Maybe it'd also be >> possible to skip replanning between executes if the current bound >> values are 'indexwise-equivalent' to the values used at previous >> planning, i.e. nothing in the statistics indicates that execution >> cost would be (much) different. Are there more ways to cut down on >> planning time? Obviously some plannedstatement/plannerinfo structures >> could be kept, but maybe it'd also be possible to plan only that part >> of the join tree where the params are used in a scan/join qual. > > I think we should be careful not to over-think this. Planning isn't > *that* costly, so apply Amdahl's Law liberally. I'm proposing some > easy things we could do without adding much overhead or maintenance > burden; I've been assuming that getting intimate with the planner > would risk those advantages. In a current commercial app we have that uses JDBC and prepared plans for just about everything, it regularly ends up with execution times of 30+ milliseconds when a complete plan + execute would take less than 1 millisecond. PostgreSQL planning is pretty fast. In terms of not over thinking things - I think I would even prefer an option that said "always re-plan prepared statements" as a starting point. If it happened to become smarter over time, such that it would have invalidation criteria that would trigger a re-plan, that would be awesome, but in terms of what would help me *today* - being able to convert prepared plans into just a means to use place holders would help me today on certain real applications in production use right now. Cheers, mark -- Mark Mielke<mark@mielke.cc>
On Tue, Feb 9, 2010 at 7:08 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > = Projected-cost threshold = > > If a prepared statement takes parameters, and the generic plan has a high > projected cost, re-plan each EXECUTE individually with all its parameter > values bound. It may or may not help, but unless the planner is vastly > over-pessimistic, re-planning isn't going to dominate execution time for > these cases anyway. How high is high? > = Actual-cost threshold = > > Also stop using the generic plan if the statement takes a long time to run > in practice. Statistics may have gone bad. It could also be a one-off due > to a load peak or something, but that's handled by: > > = Plan refresh = > > Periodically re-plan prepared statements on EXECUTE. This is also a chance > for queries that were being re-planned every time to go back to a generic > plan. The most common problem here seems to be that (some?) MCVs need different treatment than non-MCVs, so I don't think periodically replanning is going to help very much. What might help is something like plan twice, once assuming you have the most common MCV and once assuming you have a non-MCV. If the two plans are same, you're probably safe. Or if you can somehow determine that one of the plans will still be pretty fast in the other case, you can just use that plan across the board. Otherwise, you have to somehow vary the plan based on the actual parameter value. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Feb 9, 2010 at 7:08 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: >> Periodically re-plan prepared statements on EXECUTE. �This is also a chance >> for queries that were being re-planned every time to go back to a generic >> plan. > The most common problem here seems to be that (some?) MCVs need > different treatment than non-MCVs, so I don't think periodically > replanning is going to help very much. It won't help at all. The only reason for replanning is if something about the schema or the statistics change, and we already have got automatic cached-plan invalidation in both those cases. If you replan simply because some time has elapsed, you'll just get exactly the same plan. The only case that I think still has any merit is where you get a significantly better plan with known parameter values than without. The projected-cost threshold might be a reasonable approach for attacking that, ie, if estimated cost of generic plan exceeds X then take the time to build a custom plan instead. I'm not sure that really will fix the problem, but it would be a very simple change to make to see how much it helps people. regards, tom lane
Jeroen Vermeulen <jtv@xs4all.nl> writes: > I think we should be careful not to over-think this. Planning isn't *that* > costly, so apply Amdahl's Law liberally. I'm proposing some easy things we > could do without adding much overhead or maintenance burden; I've been > assuming that getting intimate with the planner would risk those advantages. On a project where performance is a must (5ms per query is about all you can ask) I have queries for which planning is 40+ ms and execute 2 to 5 ms (dataset fits in RAM, by design). I'm then abusing pgbouncer so that the PREPARE is shared by a lot of clients, all the ones landing into the session (transaction pooling). See preprepare to get a better idea, even if we're yet to run it (it's being used in production elsewhere, though, I've been told). http://preprepare.projects.postgresql.org/ Regards, -- dim
On Tue, 9 Feb 2010, Mark Mielke wrote: > In a current commercial app we have that uses JDBC and prepared plans for > just about everything, it regularly ends up with execution times of 30+ > milliseconds when a complete plan + execute would take less than 1 > millisecond. > > PostgreSQL planning is pretty fast. In terms of not over thinking things - I > think I would even prefer an option that said "always re-plan prepared > statements" as a starting point. If it happened to become smarter over time, > such that it would have invalidation criteria that would trigger a re-plan, > that would be awesome, but in terms of what would help me *today* - being > able to convert prepared plans into just a means to use place holders would > help me today on certain real applications in production use right now. > The JDBC driver has two methods of disabling permanently planned prepared statements: 1) Use the version two frontend/backend protocol via adding protocolVersion=2 to your URL. This interpolates all parameters into the query on the client side. 2) Execute PreparedStatements using the unnamed statement rather than a named statement via adding prepareThreshold=0 to your URL. A named statement is expected to be re-used for later execution and ignores the passed parameters for planning purposes. An unnamed statement may be re-used, but it doesn't expect to be. The unnamed statement uses the passed parameters for planning purposes, but still cannot make certain optimatizations based on the parameter values because it may be re-executed again later with different parameters. For example a LIKE query with a parameter value of 'ABC%' cannot be transformed into range query because the next execution may use a different parameter value for which the transform is not valid. By default the driver switches to using a named statement after the same PreparedStatement object is executed five times. http://jdbc.postgresql.org/documentation/84/connect.html#connection-parameters http://jdbc.postgresql.org/documentation/84/server-prepare.html Kris Jurka
Kris Jurka wrote: > > The JDBC driver has two methods of disabling permanently planned prepared > statements: > > 1) Use the version two frontend/backend protocol via adding > protocolVersion=2 to your URL. This interpolates all parameters into > the query on the client side. > > 2) Execute PreparedStatements using the unnamed statement rather than a > named statement via adding prepareThreshold=0 to your URL. A named > statement is expected to be re-used for later execution and ignores the > passed parameters for planning purposes. An unnamed statement may be > re-used, but it doesn't expect to be. The unnamed statement uses the > passed parameters for planning purposes, but still cannot make certain > optimatizations based on the parameter values because it may be > re-executed again later with different parameters. For example a LIKE > query with a parameter value of 'ABC%' cannot be transformed into range > query because the next execution may use a different parameter value for > which the transform is not valid. By default the driver switches to using > a named statement after the same PreparedStatement object is executed five > times. > > http://jdbc.postgresql.org/documentation/84/connect.html#connection-parameters > http://jdbc.postgresql.org/documentation/84/server-prepare.html Can someone explain to me why we only do "delayed binding" for unnamed prepared queries? Why do we not allow this option for named protocol prepared queries and SQL prepared queries? Here is what our documentation has in the protocols section: The unnamed prepared statement is likewise planned during Parse processing if the Parse message defines no parameters. But if there are parameters, query planning occurs during Bind processing instead. This allows the plannerto make use of the actual values of the parameters provided in the Bind message when planning the query. and here is someone who is having problems with the generic plans we create: http://www.odecee.com.au/blogs/?p=134 Can we not document this better? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Feb 9, 2010 at 7:08 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > >> Periodically re-plan prepared statements on EXECUTE. �This is also a chance > >> for queries that were being re-planned every time to go back to a generic > >> plan. > > > The most common problem here seems to be that (some?) MCVs need > > different treatment than non-MCVs, so I don't think periodically > > replanning is going to help very much. > > It won't help at all. The only reason for replanning is if something > about the schema or the statistics change, and we already have got > automatic cached-plan invalidation in both those cases. If you replan > simply because some time has elapsed, you'll just get exactly the > same plan. > > The only case that I think still has any merit is where you get a > significantly better plan with known parameter values than without. > The projected-cost threshold might be a reasonable approach for > attacking that, ie, if estimated cost of generic plan exceeds X > then take the time to build a custom plan instead. I'm not sure that > really will fix the problem, but it would be a very simple change to > make to see how much it helps people. Ideally we would do late binding (bind on the first supplied parameters, like we do for unnamed protocol prepared queries now), and then replan if the statistics for later parameters significantly differ from the ones used for the the initial planning. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Bruce Momjian <bruce@momjian.us> writes: > Can someone explain to me why we only do "delayed binding" for unnamed > prepared queries? It was a way of shoehorning in some driver control over the behavior without the protocol bump that would be involved in adding an actual option to Parse messages. regards, tom lane
> The only case that I think still has any merit is where you get a > significantly better plan with known parameter values than without. > The projected-cost threshold might be a reasonable approach for > attacking that, ie, if estimated cost of generic plan exceeds X > then take the time to build a custom plan instead. I'm not sure that > really will fix the problem, but it would be a very simple change to > make to see how much it helps people. > > regards, tom lane > It will definitely help with partitioned tables. It's very common case when raw data taken from hardware stored in single table first, and later we start to make partitions for each month/week/day. Feature can improve performance transparently to client apps. regards, Dmitry > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Hi Robert,<br /><br /><div class="gmail_quote">On Tue, Feb 9, 2010 at 17:43, Robert Haas <span dir="ltr"><<a href="mailto:robertmhaas@gmail.com">robertmhaas@gmail.com</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;"><div class="im">On Tue, Feb9, 2010 at 7:08 AM, Jeroen Vermeulen <<a href="mailto:jtv@xs4all.nl">jtv@xs4all.nl</a>> wrote:<br /> > = Projected-costthreshold =<br /> ><br /> > If a prepared statement takes parameters, and the generic plan has a high<br/> > projected cost, re-plan each EXECUTE individually with all its parameter<br /> > values bound. It mayor may not help, but unless the planner is vastly<br /> > over-pessimistic, re-planning isn't going to dominate executiontime for<br /> > these cases anyway.<br /><br /></div>How high is high?<br /></blockquote></div><br />Perhapsthis could be based on a (configurable?) ratio of observed planning time and projected execution time. I mean, ifplanning it the first time took 30 ms and projected execution time is 1 ms, then by all means NEVER re-plan. But if planningthe first time took 1 ms and resulted in a projected execution time of 50 ms, then it's relatively cheap to re-planevery time (cost increase per execution is 1/50 = 2%), and the potential gains are much greater (taking a chunk outof 50 ms adds up quickly).<br /><br />Cheers,<br />Bart<br />
2010/2/11 Bart Samwel <bart@samwel.tk>: > Hi Robert, > > On Tue, Feb 9, 2010 at 17:43, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Tue, Feb 9, 2010 at 7:08 AM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: >> > = Projected-cost threshold = >> > >> > If a prepared statement takes parameters, and the generic plan has a >> > high >> > projected cost, re-plan each EXECUTE individually with all its parameter >> > values bound. It may or may not help, but unless the planner is vastly >> > over-pessimistic, re-planning isn't going to dominate execution time for >> > these cases anyway. >> >> How high is high? > > Perhaps this could be based on a (configurable?) ratio of observed planning > time and projected execution time. I mean, if planning it the first time > took 30 ms and projected execution time is 1 ms, then by all means NEVER > re-plan. But if planning the first time took 1 ms and resulted in a > projected execution time of 50 ms, then it's relatively cheap to re-plan > every time (cost increase per execution is 1/50 = 2%), and the potential > gains are much greater (taking a chunk out of 50 ms adds up quickly). It could be a good idea. I don't belive to sophisticate methods. There can be a very simply solution. The could be a "limit" for price. More expensive queries can be replaned every time when the price will be over limit. Regards Pavel Stehule > > Cheers, > Bart >
On Thu, Feb 11, 2010 at 13:25, Pavel Stehule <pavel.stehule@gmail.com> wrote:
I guess the required complexity depends on how variable planning costs are. If planning is typically <= 2 ms, then a hard limit on estimated price is useful and can be set as low as (the equivalent of) 15 ms. However, if planning costs can be 50 ms, then the lowest reasonable "fixed" limit is quite a bit larger than that -- and that does not solve the problem reported earlier in this thread, where a query takes 30 ms using a generic plan and 1 ms using a specialized plan.
Anyhow, I have no clue how much time the planner takes. Can anybody provide any statistics in that regard?
Cheers,
Bart
2010/2/11 Bart Samwel <bart@samwel.tk>:It could be a good idea. I don't belive to sophisticate methods. There> Perhaps this could be based on a (configurable?) ratio of observed planning
> time and projected execution time. I mean, if planning it the first time
> took 30 ms and projected execution time is 1 ms, then by all means NEVER
> re-plan. But if planning the first time took 1 ms and resulted in a
> projected execution time of 50 ms, then it's relatively cheap to re-plan
> every time (cost increase per execution is 1/50 = 2%), and the potential
> gains are much greater (taking a chunk out of 50 ms adds up quickly).
can be a very simply solution. The could be a "limit" for price. More
expensive queries can be replaned every time when the price will be
over limit.
I guess the required complexity depends on how variable planning costs are. If planning is typically <= 2 ms, then a hard limit on estimated price is useful and can be set as low as (the equivalent of) 15 ms. However, if planning costs can be 50 ms, then the lowest reasonable "fixed" limit is quite a bit larger than that -- and that does not solve the problem reported earlier in this thread, where a query takes 30 ms using a generic plan and 1 ms using a specialized plan.
Anyhow, I have no clue how much time the planner takes. Can anybody provide any statistics in that regard?
Cheers,
Bart
On Thu, Feb 11, 2010 at 7:39 AM, Bart Samwel <bart@samwel.tk> wrote: > On Thu, Feb 11, 2010 at 13:25, Pavel Stehule <pavel.stehule@gmail.com> > wrote: >> >> 2010/2/11 Bart Samwel <bart@samwel.tk>: >> > Perhaps this could be based on a (configurable?) ratio of observed >> > planning >> > time and projected execution time. I mean, if planning it the first time >> > took 30 ms and projected execution time is 1 ms, then by all means NEVER >> > re-plan. But if planning the first time took 1 ms and resulted in a >> > projected execution time of 50 ms, then it's relatively cheap to re-plan >> > every time (cost increase per execution is 1/50 = 2%), and the potential >> > gains are much greater (taking a chunk out of 50 ms adds up quickly). >> >> >> It could be a good idea. I don't belive to sophisticate methods. There >> can be a very simply solution. The could be a "limit" for price. More >> expensive queries can be replaned every time when the price will be >> over limit. > > I guess the required complexity depends on how variable planning costs are. > If planning is typically <= 2 ms, then a hard limit on estimated price is > useful and can be set as low as (the equivalent of) 15 ms. However, if > planning costs can be 50 ms, then the lowest reasonable "fixed" limit is > quite a bit larger than that -- and that does not solve the problem reported > earlier in this thread, where a query takes 30 ms using a generic plan and 1 > ms using a specialized plan. > > Anyhow, I have no clue how much time the planner takes. Can anybody provide > any statistics in that regard? It depends a great deal on the query, which is one of the things that makes implementing this rather challenging. ...Robert
On Thu, Feb 11, 2010 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote:
But I guess you can probably expect it to be on the same order for the same query in generic form and with filled-in parameters? Because that's the underlying assumption of the "ratio" criterion -- that re-planning with filled-in parameters takes about as much time as the initial planning run took.
Cheers,
Bart
It depends a great deal on the query, which is one of the things thatOn Thu, Feb 11, 2010 at 7:39 AM, Bart Samwel <bart@samwel.tk> wrote:
> Anyhow, I have no clue how much time the planner takes. Can anybody provide
> any statistics in that regard?
makes implementing this rather challenging.
But I guess you can probably expect it to be on the same order for the same query in generic form and with filled-in parameters? Because that's the underlying assumption of the "ratio" criterion -- that re-planning with filled-in parameters takes about as much time as the initial planning run took.
Cheers,
Bart
On Thu, Feb 11, 2010 at 7:48 AM, Bart Samwel <bart@samwel.tk> wrote: > On Thu, Feb 11, 2010 at 13:41, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Feb 11, 2010 at 7:39 AM, Bart Samwel <bart@samwel.tk> wrote: >> > Anyhow, I have no clue how much time the planner takes. Can anybody >> > provide >> > any statistics in that regard? >> >> It depends a great deal on the query, which is one of the things that >> makes implementing this rather challenging. > > But I guess you can probably expect it to be on the same order for the same > query in generic form and with filled-in parameters? I think so.... but I wouldn't bet the farm on it without testing. > Because that's the > underlying assumption of the "ratio" criterion -- that re-planning with > filled-in parameters takes about as much time as the initial planning run > took. We only want to replan when replanning is relatively cheap compared to execution, so the other assumption is that the planning-to-execution ratio is more or less constant. Whether that's sufficiently true to make the proposed system useful and reliable is not clear to me. ...Robert
Bart Samwel wrote: > Perhaps this could be based on a (configurable?) ratio of observed > planning time and projected execution time. I mean, if planning it the > first time took 30 ms and projected execution time is 1 ms, then by > all means NEVER re-plan. IMHO looking at ms is bad for this 'possible replan' decision. The only comparable numbers invariant to system load are the planners costs (not in ms but unitless) and maybe actual number of processed tuples, but never actual ms. Regards, Yeb Havinga
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Feb 11, 2010 at 7:48 AM, Bart Samwel <bart@samwel.tk> wrote: >> Because that's the >> underlying assumption of the "ratio" criterion -- that re-planning with >> filled-in parameters takes about as much time as the initial planning run >> took. > We only want to replan when replanning is relatively cheap compared to > execution, Well, no, consider the situation where planning takes 50 ms, the generic plan costs 100ms to execute, but a parameter-specific plan would take 1ms to execute. Planning is very expensive compared to execution but it's still a win to do it. The problem that we face is that we don't have any very good way to tell whether a fresh planning attempt is likely to yield a plan significantly better than the generic plan. I can think of some heuristics --- for example if the query contains LIKE with a parameterized pattern or a partitioned table --- but that doesn't seem like a particularly nice road to travel. A possible scheme is to try it and keep track of whether we ever actually do get a better plan. If, after N attempts, none of the custom plans were ever more than X% cheaper than the generic one, then give up and stop attempting to produce custom plans. Tuning the variables might be challenging though. regards, tom lane
2010/2/11 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> On Thu, Feb 11, 2010 at 7:48 AM, Bart Samwel <bart@samwel.tk> wrote: >>> Because that's the >>> underlying assumption of the "ratio" criterion -- that re-planning with >>> filled-in parameters takes about as much time as the initial planning run >>> took. > >> We only want to replan when replanning is relatively cheap compared to >> execution, > > Well, no, consider the situation where planning takes 50 ms, the generic > plan costs 100ms to execute, but a parameter-specific plan would take 1ms > to execute. Planning is very expensive compared to execution but it's > still a win to do it. > > The problem that we face is that we don't have any very good way to tell > whether a fresh planning attempt is likely to yield a plan significantly > better than the generic plan. I can think of some heuristics --- for > example if the query contains LIKE with a parameterized pattern or a > partitioned table --- but that doesn't seem like a particularly nice > road to travel. > > A possible scheme is to try it and keep track of whether we ever > actually do get a better plan. If, after N attempts, none of the custom > plans were ever more than X% cheaper than the generic one, then give up > and stop attempting to produce custom plans. Tuning the variables might > be challenging though. I afraid so every heuristic is bad. Problem is identification of bad generic plan. And nobody ensure, so non generic plan will be better than generic. Still I thing we need some way for lazy prepared statements - plan is generated everytime with known parameters. Other idea: some special debug/test mod, where pg store generic plan for every prepared statement, and still generate specific plan. When the prices are different, then pg produces a warning. This can be slower, but can identify problematic queries. It could be implemented as contrib module - some like autoexplain. regards Pavel > > regards, tom lane >
Pavel Stehule wrote: > > The problem that we face is that we don't have any very good way to tell > > whether a fresh planning attempt is likely to yield a plan significantly > > better than the generic plan. ?I can think of some heuristics --- for > > example if the query contains LIKE with a parameterized pattern or a > > partitioned table --- but that doesn't seem like a particularly nice > > road to travel. > > > > A possible scheme is to try it and keep track of whether we ever > > actually do get a better plan. ?If, after N attempts, none of the custom > > plans were ever more than X% cheaper than the generic one, then give up > > and stop attempting to produce custom plans. ?Tuning the variables might > > be challenging though. > > I afraid so every heuristic is bad. Problem is identification of bad > generic plan. And nobody ensure, so non generic plan will be better > than generic. Still I thing we need some way for lazy prepared > statements - plan is generated everytime with known parameters. Yea, this opens a whole host of questions for me: 1. Why do we only do bind-level planning for anonymous wire-level queries? 2. I realize we did anonymous-only because that was the only way we had in the protocol to _signal_ bind-time planning, but didn't we think of this when we were implementing the wire-level protocol? 3. Do we have no place to add this cleanly without a protocol version bump? 4. Why don't we just always do planning at first bind time? When is that worse than using generic values? 5. Why have we not added an option for SQL-level prepare to do this? 6. When do our generic columns costs significantly worse than having specific constants? I assume unique columns are fine with generic constants. 7. Why is there no option to do parameterized-queries which replan every time? This just seems like an area that has been neglected, or maybe I am missing something and our current setup is acceptable. We have done a lot of work to generate acceptable optimizer statistics, but we are not using them for a significant part of our user base, particularly JDBC. We do have a TODO item, but it has gotten little attention: Allow finer control over the caching of prepared query plans Currently anonymous (un-named) queries prepared via the libpqAPIare planned at bind time using the supplied parameters --- allow SQLPREPARE to do the same. Also, allow control overreplanning preparedqueries either manually or automatically when statistics for executeparameters differ dramaticallyfrom those used during planning. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Tom Lane wrote: > Well, no, consider the situation where planning takes 50 ms, the generic > plan costs 100ms to execute, but a parameter-specific plan would take 1ms > to execute. Planning is very expensive compared to execution but it's > still a win to do it. I think that's a fun and worthwhile problem. But my limited personal aim right now is a stopgap for pathological cases. I'd like to pick the low-hanging fruit; actually squeezing the fat out of prepared statements is a job I wouldn't get around to completing. Sorry for mixing metaphors. Here's what I like about the really slow plans. (Now why does that sound so strange?) We don't know if re-planning will help, but we do know that (1) it won't hurt much relative to execution time, so we really don't _care_; and (2) there is lots of potential for improvement, so catching just one execution that can be much faster might pay for all the extra time spent re-planning. Where do we draw the line between costly and pathological? I still like Bart's idea of a fixed ratio to planning time, because it reflects a self-tuning sense of proportion. Sure, planning time can vary a lot but we're talking about an order-of-magnitude difference, not an exact 19:21 optimum. We can be sloppy and still expect to win. AFAIC a statement could go to "re-planning mode" if the shortest execution time for the generic plan takes at least 10x longer than the longest planning time. That gives us a decent shot at finding statements where re-planning is a safe bet. A parameter that we or the user would have to tweak would just be a fragile approximation of that. > A possible scheme is to try it and keep track of whether we ever > actually do get a better plan. If, after N attempts, none of the custom > plans were ever more than X% cheaper than the generic one, then give up > and stop attempting to produce custom plans. Tuning the variables might > be challenging though. A simple stopgap implementation may also be a useful experimentation platform for refinements. It shouldn't be too complex to rip out when something better comes along. Jeroen
On Mon, Feb 15, 2010 at 2:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > Pavel Stehule wrote: >> > The problem that we face is that we don't have any very good way to tell >> > whether a fresh planning attempt is likely to yield a plan significantly >> > better than the generic plan. ?I can think of some heuristics --- for >> > example if the query contains LIKE with a parameterized pattern or a >> > partitioned table --- but that doesn't seem like a particularly nice >> > road to travel. >> > >> > A possible scheme is to try it and keep track of whether we ever >> > actually do get a better plan. ?If, after N attempts, none of the custom >> > plans were ever more than X% cheaper than the generic one, then give up >> > and stop attempting to produce custom plans. ?Tuning the variables might >> > be challenging though. >> >> I afraid so every heuristic is bad. Problem is identification of bad >> generic plan. And nobody ensure, so non generic plan will be better >> than generic. Still I thing we need some way for lazy prepared >> statements - plan is generated everytime with known parameters. > > Yea, this opens a whole host of questions for me: > > 1. Why do we only do bind-level planning for anonymous wire-level queries? > > 2. I realize we did anonymous-only because that was the only way we had > in the protocol to _signal_ bind-time planning, but didn't we think of > this when we were implementing the wire-level protocol? > > 3. Do we have no place to add this cleanly without a protocol version > bump? > > 4. Why don't we just always do planning at first bind time? When is > that worse than using generic values? > > 5. Why have we not added an option for SQL-level prepare to do this? > > 6. When do our generic columns costs significantly worse than having > specific constants? I assume unique columns are fine with generic > constants. > > 7. Why is there no option to do parameterized-queries which replan every > time? > > This just seems like an area that has been neglected, or maybe I am > missing something and our current setup is acceptable. No, our current setup is not acceptable, and your questions are all right on target. I have been hoping that someone would take an interest in this problem for years. An option to replan on every execution would be a very, very fine thing. IMHO, there should also be a way to signal to PL/pgsql that you want this behavior for a particular query, short of wrapping it using EXECUTE, which is clunky and also forces a re-parse on every execution. ...Robert
Robert Haas wrote: > > 7. Why is there no option to do parameterized-queries which replan every > > time? > > > > This just seems like an area that has been neglected, or maybe I am > > missing something and our current setup is acceptable. > > No, our current setup is not acceptable, and your questions are all > right on target. I have been hoping that someone would take an > interest in this problem for years. An option to replan on every > execution would be a very, very fine thing. IMHO, there should also > be a way to signal to PL/pgsql that you want this behavior for a > particular query, short of wrapping it using EXECUTE, which is clunky > and also forces a re-parse on every execution. I was hoping I was wrong. :-( -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
On Mon, Feb 15, 2010 at 7:51 PM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > AFAIC a statement could go to "re-planning mode" if the shortest execution > time for the generic plan takes at least 10x longer than the longest > planning time. That gives us a decent shot at finding statements where > re-planning is a safe bet. A parameter that we or the user would have to > tweak would just be a fragile approximation of that. So in principle I agree with this idea. I think a conservative value for the constant would be more like 100x though. If I told you we had an easy way to speed all your queries up by 10% by caching queries but were just choosing not to then I think you would be unhappy. Whereas if I told you we were spending 1% of the run-time planning queries I think most people would not be concerned. There's a second problem though. We don't actually know how long any given query is going to take to plan or execute. We could just remember how long it took to plan and execute last time or how long it took to plan last time and the average execution time since we cached that plan. Perhaps we should track the stddev of the execution plan, or the max execution time of the plan? Ie there are still unanswered questions about the precise heuristic to use but I bet we can come up with something reasonable. -- greg
On Mon, Feb 15, 2010 at 7:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > 1. Why do we only do bind-level planning for anonymous wire-level queries? > > 2. I realize we did anonymous-only because that was the only way we had > in the protocol to _signal_ bind-time planning, but didn't we think of > this when we were implementing the wire-level protocol? Is there any other difference between anonymous and non-anonymous queries? If this is the only major difference do we need to separate them? Is there any particular reason a driver would need two prepared queries if they're both just going to be planned at execution time? Incidentally, can you have two active anonymous portals at the same time? > 4. Why don't we just always do planning at first bind time? When is > that worse than using generic values? > > 6. When do our generic columns costs significantly worse than having > specific constants? I assume unique columns are fine with generic > constants. Well using parameters will always have a better chance of producing a better plan but that's not the only factor people consider important. For a lot of users *predictability* is more important than absolute performance. If my web server could run 10% faster that might be nice but if it's capable of keeping up at its current speed it's not terribly important. But if it means it crashes once a day because some particular combination of parameters causes a bad plan to be used for a specific user that's a bad trade-off. -- greg
> > Well using parameters will always have a better chance of producing a > better plan but that's not the only factor people consider important. > For a lot of users *predictability* is more important than absolute > performance. If my web server could run 10% faster that might be nice > but if it's capable of keeping up at its current speed it's not > terribly important. But if it means it crashes once a day because some > particular combination of parameters causes a bad plan to be used for > a specific user that's a bad trade-off. > +1 Pavel > -- > greg >
Greg Stark wrote: > On Mon, Feb 15, 2010 at 7:11 PM, Bruce Momjian <bruce@momjian.us> wrote: > > 1. Why do we only do bind-level planning for anonymous wire-level queries? > > > > 2. I realize we did anonymous-only because that was the only way we had > > in the protocol to _signal_ bind-time planning, but didn't we think of > > this when we were implementing the wire-level protocol? > > Is there any other difference between anonymous and non-anonymous > queries? If this is the only major difference do we need to separate > them? Is there any particular reason a driver would need two prepared > queries if they're both just going to be planned at execution time? Well, anonymous prepared queries are replanned for _every_ bind, so I don't see a huge value in allowing multiple unnamed queries, except you have to re-send the old query to prepare if you need to reuse it. In fact, this behavior was not totally clear so I updated the documentation a little with the attached patch. > Incidentally, can you have two active anonymous portals at the same time? No, the first one is deleted when the second is created, i.e., our docs have: An unnamed prepared statement lasts only until the next Parse statement specifying the unnamed statement as destination is issued. (Note that a simple Query message also destroys the unnamed statement.) -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. + Index: doc/src/sgml/protocol.sgml =================================================================== RCS file: /cvsroot/pgsql/doc/src/sgml/protocol.sgml,v retrieving revision 1.78 diff -c -c -r1.78 protocol.sgml *** doc/src/sgml/protocol.sgml 3 Feb 2010 09:47:19 -0000 1.78 --- doc/src/sgml/protocol.sgml 16 Feb 2010 20:11:41 -0000 *************** *** 737,745 **** <para> The unnamed prepared statement is likewise planned during Parse processing if the Parse message defines no parameters. But if there are parameters, ! query planning occurs during Bind processing instead. This allows the ! planner to make use of the actual values of the parameters provided in ! the Bind message when planning the query. </para> <note> --- 737,745 ---- <para> The unnamed prepared statement is likewise planned during Parse processing if the Parse message defines no parameters. But if there are parameters, ! query planning occurs every time Bind parameters are supplied. This allows the ! planner to make use of the actual values of the parameters provided by ! each Bind message, rather than use generic estimates. </para> <note> Index: src/backend/tcop/postgres.c =================================================================== RCS file: /cvsroot/pgsql/src/backend/tcop/postgres.c,v retrieving revision 1.588 diff -c -c -r1.588 postgres.c *** src/backend/tcop/postgres.c 13 Feb 2010 01:32:19 -0000 1.588 --- src/backend/tcop/postgres.c 16 Feb 2010 20:11:45 -0000 *************** *** 1469,1475 **** } else { ! /* special-case the unnamed statement */ psrc = unnamed_stmt_psrc; if (!psrc) ereport(ERROR, --- 1469,1475 ---- } else { ! /* Unnamed statements are re-prepared for every bind */ psrc = unnamed_stmt_psrc; if (!psrc) ereport(ERROR,
On Tue, Feb 16, 2010 at 8:17 PM, Bruce Momjian <bruce@momjian.us> wrote: >> Incidentally, can you have two active anonymous portals at the same time? > > No, the first one is deleted when the second is created, i.e., our docs > have: > > An unnamed prepared statement lasts only until the next Parse statement > specifying the unnamed statement as destination is issued. (Note that a > simple Query message also destroys the unnamed statement.) I had to reread it myself but I think you've confused portals with prepared statements. You can have an unnamed prepared statement -- which is planned at Bind time -- but execute it with a named portal allowing you to keep it active while you execute a second query. If you have code like: $sth = $dbh->execute('select * from tab'); while ($sth->fetch) { $dbh->execute('insert into tab2'); } The first execute needs to prepare and execute the first statement. It doesn't need to keep a named prepared statement handle around because it'll never be able to re-execute it anyways. But the resulting portal does need to be a named portal because otherwise the driver will be up a creek when it comes to the second execute. The case that's fairly awkward at the moment -- but still not impossible to handle -- is when the driver sees a prepare and bind but no execute for a while. Coding like: $sth1 = $dbh->prepare('select * from tab where id = ?'); $sth2 = $dbh->prepare('select * from tab2 where id = ?'); $sth1->bind(1); $sth2->bind(2); $sth1->execute(); $sth2->execute(); In that case the driver is kind of stuck. It can't use the unnamed prepared statement when the prepare() calls are done. If it wants the plan-at-bind semantics then It would have to create a "fake" prepared statement which it doesn't actually send the prepare message for until the bind arrives. -- greg
On Tue, 16 Feb 2010 15:22:00 +0100, Greg Stark <gsstark@mit.edu> wrote: > There's a second problem though. We don't actually know how long any > given query is going to take to plan or execute. We could just > remember how long it took to plan and execute last time or how long it > took to plan last time and the average execution time since we cached > that plan. Perhaps we should track the stddev of the execution plan, > or the max execution time of the plan? Ie there are still unanswered > questions about the precise heuristic to use but I bet we can come up > with something reasonable. This could be an occasion to implement plan caching... Web 2.0 = AJAX means less need for heavy webpage reloads with (usually) lots of queries, and more small simple queries like selects returning 1 or a few rows every time the user clicks on something. See benchmark here : (PG 8.4.2, MYSQL 5.1.37) http://purity.bobfuck.net/posts/postgres/2010-02-Prep/ If prepared statements are used, MySQL is not faster for "small, simple selects"... However, when not using prepared statements, most of the postmaster CPU time is spent parsing & planning. Problem with prepared statements is they're a chore to use in web apps, especially PHP, since after grabbing a connection from the pool, you don't know if it has prepared plans in it or not. The postmaster could keep a hash of already prepared plans, using the $-parameterized query as a hash key, and when it receives parse+bind message, look up in this cache and fetch plans for the query, avoiding planning entirely. This could be done by the connection pooler too, but it doesn't have the information to decide wether it's wise to cache a plan or not. Of course all the subtility is to determine if the plan is reusable with other parameters... - after planning and executing the query, only cache it if the plan time is a significant part of the query time (as said previously). - only simple queries should be automatically cached like this - perhaps some measure of "plan volatility" ? For the examples I give in the link above, it's quite easy at least in 2 of the cases : searching UNIQUE columns can't return more than 1 row, so volatility is zero. It only depends on the table size.
"Pierre C" <lists@peufeu.com> writes: > Problem with prepared statements is they're a chore to use in web apps, > especially PHP, since after grabbing a connection from the pool, you don't > know if it has prepared plans in it or not. Have you met preprepare yet? http://preprepare.projects.postgresql.org/README.html http://packages.debian.org/source/sid/preprepare Regards, -- dim
On Thu, 18 Feb 2010 16:09:42 +0100, Dimitri Fontaine <dfontaine@hi-media.com> wrote: > "Pierre C" <lists@peufeu.com> writes: >> Problem with prepared statements is they're a chore to use in web apps, >> especially PHP, since after grabbing a connection from the pool, you >> don't >> know if it has prepared plans in it or not. > > Have you met preprepare yet? > > http://preprepare.projects.postgresql.org/README.html > http://packages.debian.org/source/sid/preprepare > > Regards, Hey, this thing is nice. How hard would it be to put a hook in pg so that, instead of raising an error and cancelling the txn when EXECUTing a statement that is not prepared, it would call a user function (of the user's choice) which would, if possible, prepare said statement, or if not, raise the error ?
"Pierre C" <lists@peufeu.com> writes: > On Thu, 18 Feb 2010 16:09:42 +0100, Dimitri Fontaine > <dfontaine@hi-media.com> wrote: >> http://preprepare.projects.postgresql.org/README.html >> http://packages.debian.org/source/sid/preprepare > > Hey, this thing is nice. Thanks :) > How hard would it be to put a hook in pg so that, instead of raising an > error and cancelling the txn when EXECUTing a statement that is not > prepared, it would call a user function (of the user's choice) which would, > if possible, prepare said statement, or if not, raise the error ? What about catching the error in the application and INSERT'ing into the current preprepare.relation table? The aim would be to do that in dev or in pre-prod environments, then copy the table content in production. Notice you can also SET preprepare.relation TO 'any.table_or_view', so that you can have more than one table for different applications or parts. I wonder how far you'd get using a VIEW instead and calling a user function there, but arranging preprepare to support a function instead is indeed possible. I'll try and see about it this evening, maybe tomorrow afternoon. Regards, -- dim
> What about catching the error in the application and INSERT'ing into the > current preprepare.relation table? The aim would be to do that in dev or > in pre-prod environments, then copy the table content in production. Yep, but it's a bit awkward and time-consuming, and not quite suited to ORM-generated requests since you got to generate all the plan names, when the SQL query itself would be the most convenient "unique identifier"... A cool hack would be something like that : pg_execute( "SELECT ...", arguments... ) By inserting a hook which calls a user-specified function on non-existing plan instead of raising an error, this could work. However, this wouldn't work as-is since the plan name must be <= NAMEDATALEN, but you get the idea ;)
On Feb 18, 2010, at 2:19 PM, Pierre C wrote: > >> What about catching the error in the application and INSERT'ing >> into the >> current preprepare.relation table? The aim would be to do that in >> dev or >> in pre-prod environments, then copy the table content in production. > > Yep, but it's a bit awkward and time-consuming, and not quite suited > to ORM-generated requests since you got to generate all the plan > names, when the SQL query itself would be the most convenient > "unique identifier"... > > A cool hack would be something like that : > > pg_execute( "SELECT ...", arguments... ) > > By inserting a hook which calls a user-specified function on non- > existing plan instead of raising an error, this could work. > However, this wouldn't work as-is since the plan name must be <= > NAMEDATALEN, but you get the idea ;) How about the SHA1 hash of the query? Hey, it works for git... :-) Regards, David -- David Christensen End Point Corporation david@endpoint.com
On Thu, Feb 18, 2010 at 08:31:05PM -0600, David Christensen wrote: > > On Feb 18, 2010, at 2:19 PM, Pierre C wrote: > >> >>> What about catching the error in the application and INSERT'ing into the >>> current preprepare.relation table? The aim would be to do that in dev or >>> in pre-prod environments, then copy the table content in production. >> >> Yep, but it's a bit awkward and time-consuming, and not quite suited to >> ORM-generated requests since you got to generate all the plan names, when >> the SQL query itself would be the most convenient "unique identifier"... >> >> A cool hack would be something like that : >> >> pg_execute( "SELECT ...", arguments... ) >> >> By inserting a hook which calls a user-specified function on non-existing >> plan instead of raising an error, this could work. >> However, this wouldn't work as-is since the plan name must be <= >> NAMEDATALEN, but you get the idea ;) > > How about the SHA1 hash of the query? Hey, it works for git... :-) > > Regards, > > David > -- > David Christensen > End Point Corporation > david@endpoint.com > > Hi David, Not to beat out own drum, but we already include a hashing function that can be used for this purpose and is much faster than SHA-1. We would want to use all 64-bits for this use instead of just the 32-bits we currently use for the internal DB hashing. Here is an article comparing the Jenkins' Hash (the one we use) and SHA-1: http://home.comcast.net/~bretm/hash/ Regards, Ken
"Pierre C" <lists@peufeu.com> writes: > Yep, but it's a bit awkward and time-consuming, and not quite suited to > ORM-generated requests since you got to generate all the plan names, when > the SQL query itself would be the most convenient "unique > identifier"... The SHA1 proposal seems better to me. Now you still have to fill the table with statements and names, and I stand on the opinion that having a development-only layer in the ORM to do just that is better. That's supposing you don't generate so many different normalized queries that you can't generate them all in dev or from the unit tests… > A cool hack would be something like that : > > pg_execute( "SELECT ...", arguments... ) > > By inserting a hook which calls a user-specified function on non-existing > plan instead of raising an error, this could work. This I'm not helping with. It's definitely not the same skills and time requirements as offering a simple function based alternative to the table lookup in preprepare, in the "load them all" phase. > However, this wouldn't work as-is since the plan name must be <= > NAMEDATALEN, but you get the idea ;) SHA1 or even MD5 would do, the later having the advantage of being already supported by PostgreSQL core. Regards, -- dim
Greg Stark wrote: > So in principle I agree with this idea. I think a conservative value > for the constant would be more like 100x though. If I told you we had > an easy way to speed all your queries up by 10% by caching queries but > were just choosing not to then I think you would be unhappy. Whereas > if I told you we were spending 1% of the run-time planning queries I > think most people would not be concerned. Makes sense. The main thing is that there be an order-of-magnitude difference to hide the potential extra planning cost in. If that includes a switched SSL connection, 10% of execution is probably reasonable because it's a much smaller portion of overall response time--but on a local connection it's a liability. > There's a second problem though. We don't actually know how long any > given query is going to take to plan or execute. We could just > remember how long it took to plan and execute last time or how long it > took to plan last time and the average execution time since we cached > that plan. Perhaps we should track the stddev of the execution plan, > or the max execution time of the plan? Ie there are still unanswered > questions about the precise heuristic to use but I bet we can come up > with something reasonable. I may have cut this out of my original email for brevity... my impression is that the planner's estimate is likely to err on the side of scalability, not best-case response time; and that this is more likely to happen than an optimistic plan going bad at runtime. If that is true, then the cost estimate is at least a useful predictor of statements that deserve re-planning. If it's not true (or for cases where it's not true), actual execution time would be a useful back-up at the cost of an occasional slow execution. Yeb points out a devil in the details though: the cost estimate is unitless. We'd have to have some orders-of-magnitude notion of how the estimates fit into the picture of real performance. Jeroen
On Wed, Feb 17, 2010 at 5:52 PM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > I may have cut this out of my original email for brevity... my impression is > that the planner's estimate is likely to err on the side of scalability, not > best-case response time; and that this is more likely to happen than an > optimistic plan going bad at runtime. Interestingly, most of the mistakes that I have seen are in the opposite direction. > Yeb points out a devil in the details though: the cost estimate is unitless. > We'd have to have some orders-of-magnitude notion of how the estimates fit > into the picture of real performance. I'm not sure to what extent you can assume that the cost is proportional to the execution time. I seem to remember someone (Peter?) arguing that they're not related by any fixed ratio, partly because things like page costs vs. cpu costs didn't match physical reality, and that in fact some attempts to gather better empirically better values for things like random_page_cost and seq_page_cost actually ended up making the plans worse rather than better. It would be nice to see some research in this area... ...Robert
Robert Haas wrote: > On Wed, Feb 17, 2010 at 5:52 PM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: >> I may have cut this out of my original email for brevity... my impression is >> that the planner's estimate is likely to err on the side of scalability, not >> best-case response time; and that this is more likely to happen than an >> optimistic plan going bad at runtime. > > Interestingly, most of the mistakes that I have seen are in the > opposite direction. I guess there's not much we can do about those, except decide after running that it's worth optimizing for specific values. >> Yeb points out a devil in the details though: the cost estimate is unitless. >> We'd have to have some orders-of-magnitude notion of how the estimates fit >> into the picture of real performance. > > I'm not sure to what extent you can assume that the cost is > proportional to the execution time. I seem to remember someone > (Peter?) arguing that they're not related by any fixed ratio, partly > because things like page costs vs. cpu costs didn't match physical > reality, and that in fact some attempts to gather better empirically > better values for things like random_page_cost and seq_page_cost > actually ended up making the plans worse rather than better. It would > be nice to see some research in this area... Getting representative workloads and machine configurations may make that hard. :/ But all we really want is a check for really obscene costs, as an extra stopgap so we don't have to wait for the thing to execute before we decide it's too costly. Surely there must be some line we can draw. Jeroen
On Thu, Feb 25, 2010 at 9:48 PM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: > Robert Haas wrote: >> >> On Wed, Feb 17, 2010 at 5:52 PM, Jeroen Vermeulen <jtv@xs4all.nl> wrote: >>> >>> I may have cut this out of my original email for brevity... my impression >>> is >>> that the planner's estimate is likely to err on the side of scalability, >>> not >>> best-case response time; and that this is more likely to happen than an >>> optimistic plan going bad at runtime. >> >> Interestingly, most of the mistakes that I have seen are in the >> opposite direction. > > I guess there's not much we can do about those, except decide after running > that it's worth optimizing for specific values. > > >>> Yeb points out a devil in the details though: the cost estimate is >>> unitless. >>> We'd have to have some orders-of-magnitude notion of how the estimates >>> fit >>> into the picture of real performance. >> >> I'm not sure to what extent you can assume that the cost is >> proportional to the execution time. I seem to remember someone >> (Peter?) arguing that they're not related by any fixed ratio, partly >> because things like page costs vs. cpu costs didn't match physical >> reality, and that in fact some attempts to gather better empirically >> better values for things like random_page_cost and seq_page_cost >> actually ended up making the plans worse rather than better. It would >> be nice to see some research in this area... > > Getting representative workloads and machine configurations may make that > hard. :/ > > But all we really want is a check for really obscene costs, as an extra > stopgap so we don't have to wait for the thing to execute before we decide > it's too costly. Surely there must be some line we can draw. I actually think there isn't any clean line. Obscene is in the eye of the beholder. Frankly, I think this discussion is getting off into the weeds. It would be nice, perhaps, to have a feature that will detect when the generic plan is the suxxor and attempt to find a better one, but that's really, really hard for a whole bunch of reasons. Bruce's suggestion that we should provide some user control over whether we plan at bind time or execute time seems likely to be (1) considerably simpler to implement, (2) considerably easier to get consensus on, and (3) capable of giving 90% of the benefit for an only higher inconvenience factor. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > I actually think there isn't any clean line. Obscene is in the eye of > the beholder. Frankly, I think this discussion is getting off into > the weeds. It would be nice, perhaps, to have a feature that will > detect when the generic plan is the suxxor and attempt to find a > better one, but that's really, really hard for a whole bunch of > reasons. Bruce's suggestion that we should provide some user control > over whether we plan at bind time or execute time seems likely to be > (1) considerably simpler to implement, (2) considerably easier to get > consensus on, and (3) capable of giving 90% of the benefit for an only > higher inconvenience factor. It's not going to be easier to implement. Yeah, it would be easy to provide a global switch via a GUC setting, but that's not going to be helpful, because this is the sort of thing that really needs to be managed per-query. Almost any nontrivial application is going to have some queries that really need the custom plan and many that don't. If people just turn the GUC on we might as well throw away the plan caching mechanism altogether. But putting support for a per-query level of control into the protocol (and then every client library) as well as every PL is going to be painful to implement, and even more painful to use. I still like the idea of automatically replanning with the known parameter values, and noting whether the result plan was estimated to be noticeably cheaper than the generic plan, and giving up on generating custom plans if we didn't observe any such win over N tries. regards, tom lane
On Thu, Feb 25, 2010 at 10:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I actually think there isn't any clean line. Obscene is in the eye of >> the beholder. Frankly, I think this discussion is getting off into >> the weeds. It would be nice, perhaps, to have a feature that will >> detect when the generic plan is the suxxor and attempt to find a >> better one, but that's really, really hard for a whole bunch of >> reasons. Bruce's suggestion that we should provide some user control >> over whether we plan at bind time or execute time seems likely to be >> (1) considerably simpler to implement, (2) considerably easier to get >> consensus on, and (3) capable of giving 90% of the benefit for an only >> higher inconvenience factor. > > It's not going to be easier to implement. Yeah, it would be easy to > provide a global switch via a GUC setting, but that's not going to be > helpful, because this is the sort of thing that really needs to be > managed per-query. Almost any nontrivial application is going to have > some queries that really need the custom plan and many that don't. > If people just turn the GUC on we might as well throw away the plan > caching mechanism altogether. I agree. A GUC is a really bad idea. > But putting support for a per-query level > of control into the protocol (and then every client library) as well as > every PL is going to be painful to implement, and even more painful to > use. I suppose I should have learned by now not to argue with you over technical points, but I don't see why this should be painful. I mean, it'll be a lot of work and it'll in the end touch a lot of different parts of the code, but work != pain, and I don't see any reason why the problem can't be attacked incrementally. I'm also deeply unconvinced that any other solution will be as satisfactory. > I still like the idea of automatically replanning with the known > parameter values, and noting whether the result plan was estimated to be > noticeably cheaper than the generic plan, and giving up on generating > custom plans if we didn't observe any such win over N tries. Isn't part of the problem here precisely that the cost estimates for the generic plan might not be too accurate? The only instances of this problem I've run across are the ones where MCVs need a different treatment, and the problem isn't necessarily that the new estimate is cheaper so much as that the old estimate isn't going to turn out as predicted. Also, there's no guarantee that the distribution of values tried will be random - there's the case where non-MCVs are tried for the first N times and then a non-MCV is tried on try N+1. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Thu, Feb 25, 2010 at 10:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I still like the idea of automatically replanning with the known >> parameter values, and noting whether the result plan was estimated to be >> noticeably cheaper than the generic plan, and giving up on generating >> custom plans if we didn't observe any such win over N tries. > Isn't part of the problem here precisely that the cost estimates for > the generic plan might not be too accurate? No, the estimates for the generic plan are typically fine *in themselves*; they only look bad when you compare them to what you can do with knowledge of specific parameter values. An example is that the default selectivity estimate for a range query (WHERE x > something AND x < somethingelse) is 0.005. In a large number of real cases, the actual selectivity is way smaller, and you can determine that if you know the actual comparison constants. But it's tough to argue for decreasing the default guess --- it's already small enough that you could get screwed badly in the other direction if you queried a wide range. There may be some cases where the generic plan is wrongly estimated to be cheaper than a custom plan that's actually better, but I haven't seen many. If that were happening a lot then people would be reporting that the advice to force a replan via EXECUTE or whatever doesn't help. I don't think that there is any body of evidence at all that would justify undertaking extremely expensive development of an extremely painful-to-use feature to deal with that type of case. > Also, there's no guarantee that the distribution of values > tried will be random - there's the case where non-MCVs are tried for > the first N times and then a non-MCV is tried on try N+1. Sure, there are always going to be cases where you lose. Pushing the responsibility onto the user doesn't really fix that though. It's not apparent to me that users are in that much better position than we are to determine when a custom plan is helpful. BTW, if it wasn't clear, I would be in favor of allowing the cutoff N to be adjustable, as well as the cost ratio that's considered to constitute a win. So there would be some wiggle room to deal with that type of situation. regards, tom lane
On Thu, Feb 25, 2010 at 20:40, Tom Lane <tgl@sss.pgh.pa.us> wrote: > It's not going to be easier to implement. Yeah, it would be easy to > provide a global switch via a GUC setting, but that's not going to be > helpful, because this is the sort of thing that really needs to be > managed per-query. Almost any nontrivial application is going to have > some queries that really need the custom plan and many that don't. > If people just turn the GUC on we might as well throw away the plan > caching mechanism altogether. But putting support for a per-query level > of control into the protocol (and then every client library) as well as > every PL is going to be painful to implement, and even more painful to > use. Not to mention you can already do this more or less client side with a nice driver. For example with DBD::Pg i can say: $sth = $dbh->prepare('select * from foo where x = ?', {'pg_server_prepare'=>1}); To get a prepared plan (it is also the default). If for a particular query I know that I will get a better plan without prepare, I can just change that 1 to a 0. Or I can set it globally via $dbh->{'pg_server_prepare'} = 0; In other words im not quite sure what this would buy us.
On Thu, Feb 25, 2010 at 21:28, Alex Hunsaker <badalex@gmail.com> wrote: > Not to mention you can already do this more or less client side with a > nice driver. > [ uninformed noise ... ] I did seem to miss the part where everyone thinks this is a crock... But I don't remember seeing numbers on parse time or how much bandwidth this would potentially save. People seem to think it would be a big savings for just those 2 reasons? Or did I miss some other benefit?
Alex Hunsaker <badalex@gmail.com> writes: > I did seem to miss the part where everyone thinks this is a crock... > But I don't remember seeing numbers on parse time or how much > bandwidth this would potentially save. People seem to think it would > be a big savings for just those 2 reasons? Or did I miss some other > benefit? Uh, no, this isn't about saving either parse time or bandwidth. The discussion is about when to expend more planning time in hopes of getting better plans. regards, tom lane
On Thu, Feb 25, 2010 at 22:11, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex Hunsaker <badalex@gmail.com> writes: > Uh, no, this isn't about saving either parse time or bandwidth. > The discussion is about when to expend more planning time in hopes > of getting better plans. This is what im tripping over: > > Bruce's suggestion that we should provide some user control over whether we plan at bind time or execute time Let me see if I can sum up what I was trying to say: Prepared plans + bind plan (what we have now): Use: when you have a query that takes a long time to plan Problems: if youuse parameters you might no get a good plan Solution: if you have stable parameters dont pass them as such, inline themBetter: If we could figure out and make we could make better plans on the fly and use them [ aka if you have a good driver you can easily control if its a prepared statement or not without changing how you quote or inline your sql ] Prepared plans + exec plan (new guc/ protocol thing): Use: not quite sure Problems: slow because it would replan every timeSolutions: use a prepared plan with the appropriate things not parametrized...? [ aka we already have this, its called dont use a prepared statement ] Whats the benefit of prepare + exec plan vs just inlining? What do you save? Some parse time? Some bandwidth? Yeah if its a smart knob that does things like "Oh, I see I have an mvc for that param, ill give you a better plan". OK, But the knob no longer means plan at execute time. It more in the realm of what Tom is suggesting IMHO. Anyway I feel like im probably just violently agreeing.
My preference is to deal with the specific value vs generic value issue. For this issue, it can affect performance even if PREPARE/EXECUTE is execute exactly once. In the last case I saw, a certain query was executing once every second, and with a specific value it would take < 1 ms, and with a generic value it would take > 50 ms. That's 5% system load for one CPU core to do nothing. After analysis, it was clearly a "common value" vs "not common value" problem. For this particular table, it stored an integer, but only used two values across something like 100k rows. The query was for a third value that did not exist. The difference was a sequential scan vs an index lookup. I do not know whether the application was doing PREPARE/EXECUTE each time, or whether it was doing PREPARE once in advance and then EXECUTE each time after that, but I don't think it matters, either, as I think both cases deserve attention, and the problem is the same in both cases. Even one generic plan run costs 50+ the cost of both planning and execution. Re-planning a generic plan with another generic plan may generate zero benefit, with a measurable cost. More on this after... All the points about ms seem invalid to me. There are many reason why ms could increase, and many of them have nothing to do with plan efficiency. Again, re-planning due to a high ms, or a high ratio of ms, does not indicate that re-planning will improve the success of the plan. The planning process does not measure ms or predict ms. My idea of an optimal system is as follows: 1) Prepare gathers and caches data about the tables involved in the query, including column statistics that are likely to be required during the planning process, but prepare does not running the planning process. 2) Execute runs the planning process re-using data cached by prepare, and then executes the plan. 3) Advanced: Execute may cache the selected plan for re-use only if it can identify a set of criteria that would allow the selected plan to be tested and invalidated if the parameter nature has changed such that a re-planning would likely choose another plan. Execute may cache multiple plans against a prepared statement, provided that each cached plan identify invalidation criteria. 4) Even more Advanced: Prepare may identify that elements of the plan that will always be the same, no matter what parameter is specified, and cache these results for substitution into the planning phase when execute is run. (Effectively lifting the planning from execute to prepare, but only where it makes obvious [= cheap to detect] sense) This treats the whole statement planning and execution as a pipeline, lengthening the pipeline, and adjusting some of the pipeline elements from prepare to execute. It has the benefit of having fast prepare/execute whether execute is invoked only once or many times. The effect is that all statements are specifically planned, but specific plans are re-used wherever possible. To support the case of changing data, I think the analyze process should be able to force invalidation of cached plans, and force the cached column statistics for prepared statements to be invalidated and re-queried on demand, or push new statistics directly into the prepared statements. It makes no sense (to me) to re-plan for the same parameters until an analyze is done, so this tells me that analyze is the event that should cause the re-plan to occur. I think anything less than the above will increasing the performance of some queries while describing the performance of other queries. It might be possible to guess which queries are more valuable to people than others, and hard code solutions for these specific queries, but hard coding solutions will probably always be a "lowest hanging fruit" solution. After writing this, I'm pretty sure that implementation of the above into PostgreSQL would be difficult, and it could be a valid concern that the investment is not worth the benefit at this time. It's a tough problem. My $0.01 CDN. :-) Cheers, mark
Mark Mielke wrote: > Re-planning a generic plan with another generic plan may generate zero > benefit, with a measurable cost. More on this after... Nobody's talking about doing that any more. I proposed it initially because I didn't know about changes that made it unnecessary. > All the points about ms seem invalid to me. There are many reason why ms > could increase, and many of them have nothing to do with plan > efficiency. Again, re-planning due to a high ms, or a high ratio of ms, > does not indicate that re-planning will improve the success of the plan. > The planning process does not measure ms or predict ms. That's true, but missing some very basic points about the idea: one, if we can tell that a query is going to be expensive, then the cost of re-planning it is marginal. Two, if we can tell that a query is going to be expensive, then we stand a lot to gain if re-planning turns out to be useful. It follows that we can afford to re-plan on the off-chance, without anything more than a vague orders-of-magnitude idea of what "expensive" means. What Tom said validates a big assumption I've been making: that we do in fact have a decent shot at telling in advance that a query is going to be expensive. Which means we have a decent shot at stopping your 100ms query from taking seconds just because you prepared it and are missing out on that tiny partial index. That would be worth the extra planning time at a 1% hit rate, and there's not much downside if we don't reach that. > My idea of an optimal system is as follows: > > 1) Prepare gathers and caches data about the tables involved in the > query, including column statistics that are likely to be required during > the planning process, but prepare does not running the planning process. It sounds to me like you're in the process of inventing another planning process. Developer time aside, how much CPU time can you afford to throw at this? I don't see any reason to argue over what would be optimal when so much information is still missing. It just makes the problem look harder than it is. To me, our best shot at getting something useful is to stay simple and defensive. After that, if there is still a need, we'll have code to help us gather more data and figure out how to make it better. Nothing wrong with the lowest-hanging fruit. Jeroen
On 02/26/2010 05:20 AM, Jeroen Vermeulen wrote: > Mark Mielke wrote: > >> All the points about ms seem invalid to me. There are many reason why >> ms could increase, and many of them have nothing to do with plan >> efficiency. Again, re-planning due to a high ms, or a high ratio of >> ms, does not indicate that re-planning will improve the success of >> the plan. The planning process does not measure ms or predict ms. > > That's true, but missing some very basic points about the idea: one, > if we can tell that a query is going to be expensive, then the cost of > re-planning it is marginal. Two, if we can tell that a query is going > to be expensive, then we stand a lot to gain if re-planning turns out > to be useful. It follows that we can afford to re-plan on the > off-chance, without anything more than a vague orders-of-magnitude > idea of what "expensive" means. > > What Tom said validates a big assumption I've been making: that we do > in fact have a decent shot at telling in advance that a query is going > to be expensive. Which means we have a decent shot at stopping your > 100ms query from taking seconds just because you prepared it and are > missing out on that tiny partial index. That would be worth the extra > planning time at a 1% hit rate, and there's not much downside if we > don't reach that. You trimmed most of my concerns. :-) Problems: 1) If I do a PREPARE/EXECUTE, the above lengthens the process from 1 generic planning plus 1 generic plan execute to 1 generic planning, 1 specific planning, and 1 specific plan execution. This is still overall longer than a regular statement and it still may be longer than the original generic plan on its own. The hope is that the analysis is somehow detecting the scenario where a generic plan makes no sense, but the criteria is not about whether the generic plan actually does make sense - the criteria is "can the customer afford to wait longer for us to second guess ourselves?" It's a guess. As a guess, it means sometimes it will be right, and sometimes it will be wrong. 2) Only the "order of magnitude" (by estimate) plans will benefit. If you set the number to 100X, then most plans won't benefit. If you set it to less than 100X, you increase the chance of guessing wrong in other cases. In any case, there is still no guarantee that a specific plan will be faster, so even in the 100X case, the overall results could be slower - it's just that you've decided the customer can afford to wait longer. >> My idea of an optimal system is as follows: >> >> 1) Prepare gathers and caches data about the tables involved in the >> query, including column statistics that are likely to be required >> during the planning process, but prepare does not running the >> planning process. > > It sounds to me like you're in the process of inventing another > planning process. Developer time aside, how much CPU time can you > afford to throw at this? I already said I don't think PostgreSQL could easily evolve here. However, I wanted to point out that the problem may be architectural. As for developer time and CPU time, that's not really relevant. If PREPARE/EXECUTE could be reliably sped up, than the savings is probably measure in millions of dollars or more, as it is widely used by many applications throughout the day on hundreds of thousands of computers. Oh, you mean is it worth scratching my itch? :-) Not really. I was thinking about it yesterday and decided that such a major change might just as easily result in a new database engine, and I didn't want to go there. Still, if some clever person agrees with me that it is an architecture problem, and that PostgreSQL could benefit from a clean "from scratch" caching mechanism for statements (note that what I described could probably be extended to support automatic prepare of every statement, and matching of query to prepared statement based on text, similar to MySQL query caching), and can come up with a way to do this using the existing architecture - that would be great. Or, they can tell me "too hard" as you are. That's fine too... :-) > I don't see any reason to argue over what would be optimal when so > much information is still missing. It just makes the problem look > harder than it is. To me, our best shot at getting something useful > is to stay simple and defensive. After that, if there is still a > need, we'll have code to help us gather more data and figure out how > to make it better. Nothing wrong with the lowest-hanging fruit. What information is missing? PREPARE sucks in many known situations. It is a documented fact. :-) Will "guessing" at when the user can afford to wait longer improve the situation? Maybe or often, but not always. Cheers, mark
Alex Hunsaker <badalex@gmail.com> writes: > Let me see if I can sum up what I was trying to say: > [ this can be solved by using or avoiding prepared statements ] Not really. The place where that argument really fails is inside server-side functions: you don't get to use query submission protocol there. But even for client-submitted queries, it's often the case that there is only one convenient way to do it given a particular software stack on the client side. If there is more than one way they're usually radically different notationally, and some of them might be vulnerable to SQL injection. The comparable thing in plpgsql is EXECUTE versus direct execution of a query: they're very different notationally, so unpleasant to convert between, and it's easy to create a security hole when interpolating parameters in EXECUTE. I think this is basically a planner problem and should be fixed in the planner, not by expecting users to make significant changes in application logic in order to create an indirect effect. regards, tom lane
On Fri, Feb 26, 2010 at 08:07, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alex Hunsaker <badalex@gmail.com> writes: >> Let me see if I can sum up what I was trying to say: >> [ this can be solved by using or avoiding prepared statements ] > > Not really. The place where that argument really fails is inside > server-side functions: you don't get to use query submission protocol > there. | You look around, its dark... You notice someone coming towards you. He tosses you a torch. You kindly thank him. Ahh. ISTM re-planning those every time is entirely the wrong knob. > If there is more than one way > they're usually radically different notationally, and some of them > might be vulnerable to SQL injection. Good point. However this is why I was talking about client side drivers, they can provide an api to do this correctly and conveniently. The fact that some have not might hint that its not really worth it most of the time? > I think this is basically a planner problem and should be fixed in the > planner, not by expecting users to make significant changes in > application logic in order to create an indirect effect. Well, of course you do. We all know how you feel about planner hints :) [ Tongue-in-cheek ] Anyway, I completely agree-- its a much more elegant and workable solution.
Mark Mielke <mark@mark.mielke.cc> writes: > Will "guessing" at when the user can afford to wait longer improve the > situation? Maybe or often, but not always. There is no way to eliminate "guessing". The entire point here is that we don't know whether generating a custom plan will provide a win over not doing so, until after we've done it (and already taken the planning time hit). We could possibly put in heuristic tests based on the query's use of certain features; but that's still guessing, and would take nonzero time in itself. I concur with Jeroen's feeling that going for a simple approach first is the way to attack this. We could design and build something vastly more complex, then find out that it doesn't actually work much better. Also, I think there is a lot of confusion here over two different issues: generic plan versus parameter-specific plan, and bad planner estimates leading to a wrong plan choice. While the latter is certainly an issue sometimes, there is no reason to believe that it affects prepared statements worse than non-prepared ones. So I think that designing a fix for prepared statements on the assumption that you can't trust the planner's estimates is solving the wrong problem. regards, tom lane
On Fri, Feb 26, 2010 at 1:29 AM, Alex Hunsaker <badalex@gmail.com> wrote: > Prepared plans + exec plan (new guc/ protocol thing): > Use: not quite sure > Problems: slow because it would replan every time > Solutions: use a prepared plan with the appropriate things not > parametrized...? > > [ aka we already have this, its called dont use a prepared statement ] The point is sometimes you'd like to replan every time, but not reparse every time. There's no way to do that ATM. ...Robert
On Fri, Feb 26, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think this is basically a planner problem and should be fixed in the > planner, not by expecting users to make significant changes in > application logic in order to create an indirect effect. I would agree if I thought that were possible, but I'm skeptical about your proposed solution. ...Robert
On Fri, Feb 26, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Also, I think there is a lot of confusion here over two different > issues: generic plan versus parameter-specific plan, and bad planner > estimates leading to a wrong plan choice. While the latter is certainly > an issue sometimes, there is no reason to believe that it affects > prepared statements worse than non-prepared ones. So I think that > designing a fix for prepared statements on the assumption that you can't > trust the planner's estimates is solving the wrong problem. The two issues seem intimately connected to me. Of course, estimates can be inaccurate for many reasons, but ONE of those reasons is that the planner is optimizing for an unknown parameter value which has very different distribution from the actually supplied value. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 26, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Also, I think there is a lot of confusion here over two different >> issues: generic plan versus parameter-specific plan, and bad planner >> estimates leading to a wrong plan choice. �While the latter is certainly >> an issue sometimes, there is no reason to believe that it affects >> prepared statements worse than non-prepared ones. �So I think that >> designing a fix for prepared statements on the assumption that you can't >> trust the planner's estimates is solving the wrong problem. > The two issues seem intimately connected to me. Of course, estimates > can be inaccurate for many reasons, but ONE of those reasons is that > the planner is optimizing for an unknown parameter value which has > very different distribution from the actually supplied value. Right, but if the parameter is unknown then its distribution is also unknown. In any case that's just nitpicking, because the solution is to create a custom plan for the specific value supplied. Or are you suggesting that we should create a way for users to say "here is the expected distribution of this parameter", and then try to fold that into the planner estimates? I think the uptake on that would be about nil; not to mention that it fails to fix several of the most pressing problems, such as LIKE indexing and partition selection, where all the distribution information in the world doesn't help. You need the exact value. regards, tom lane
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 26, 2010 at 10:07 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I think this is basically a planner problem and should be fixed in the >> planner, not by expecting users to make significant changes in >> application logic in order to create an indirect effect. > I would agree if I thought that were possible, but I'm skeptical about > your proposed solution. Fair enough --- maybe it will work well enough, or maybe it won't. But the same can be said of every other proposal that's been made. I'm in favor of trying the simpler approaches first. regards, tom lane
On 02/26/2010 11:27 AM, Tom Lane wrote: > Also, I think there is a lot of confusion here over two different > issues: generic plan versus parameter-specific plan, and bad planner > estimates leading to a wrong plan choice. While the latter is certainly > an issue sometimes, there is no reason to believe that it affects > prepared statements worse than non-prepared ones. So I think that > designing a fix for prepared statements on the assumption that you can't > trust the planner's estimates is solving the wrong problem. > Just to point out that I agree, and as per my original post, I think the only time prepared statements should be re-planned for the statistics case, is after 'analyze' has run. That sounds like a quicker solution, and a much smaller gain. After 'analyze' of an object, invalidate all cached plans for prepared statements that rely on that object and require a re-plan. I doubt this will help me or many others very often. It's something that should be done some day, but I don't recall ever concluding that a performance problem I was experiencing was related to using prepared statements too long. Also, the client is able to figure this out. The client can choose to free prepared statements after 1 minute or 1000 calls. It's not really a problem. It also has nothing to do with trust of the planner's estimates. Given the same criteria, the planner should come up with the same best plan most or all of the time. Trial and error planning, with the exception of hugely complicated plans that cannot be produced in a finite time frame, does not appeal to me at all. I do trust the planner's estimates. The issue of specific parameter is the one I think most of us would benefit from, and I think the most effective benefit is to not create generic plans. I would prefer a prepare with specific plan and re-plan when the specific plan does not apply, over generic plan, every time. This has nothing to do with "time to prepare" or a ratio of "time to prepare" vs "time to execute", or plans that are expected to take some time to execute. The fact that I can run a PREPARE/EXECUTE, and SELECT, and with only one invocation see a difference of over 100X shows that generic plans is just not the right approach. It works according to spec, but it is not practical under the current model. Generic plans is the problem. My post was to bring attention to this, as I see most comments focusing on an assumption that generic plans provide value, and specific plans should only be used when generic plans are expected to take a while to execute. It's walking around the problem that the idea of a generic plan is just wrong. The only time a generic plan is right, is when the specific plan would result in the same. Cheers, mark
Mark Mielke <mark@mark.mielke.cc> writes: > Just to point out that I agree, and as per my original post, I think the > only time prepared statements should be re-planned for the statistics > case, is after 'analyze' has run. That sounds like a quicker solution, > and a much smaller gain. After 'analyze' of an object, invalidate all > cached plans for prepared statements that rely on that object and > require a re-plan. Please note that that has been happening since 8.3, which is probably why you haven't detected a problem. > ... It's walking around the problem > that the idea of a generic plan is just wrong. The only time a generic > plan is right, is when the specific plan would result in the same. I think that's a significant overstatement. There are a large number of cases where a custom plan isn't worth it, even if it doesn't generate exactly the same plan. regards, tom lane
On 02/26/2010 01:59 PM, Tom Lane wrote: >> ... It's walking around the problem >> that the idea of a generic plan is just wrong. The only time a generic >> plan is right, is when the specific plan would result in the same. >> > I think that's a significant overstatement. There are a large number > of cases where a custom plan isn't worth it, even if it doesn't generate > exactly the same plan. > There must be some way to lift the cost of planning out of the plan enumeration and selection phase, such that only plan enumeration and selection is run at execute time. In most cases, plan enumeration and selection, provided that all data required to make these decisions is all cached in data structures ready to go, should be very fast? Right? Wrong? If right, my original post suggested that prepare should do the parts of planning which are fixed, and not change based on the input parameters, while execute should do the dynamic parts that would change based on the input parameters. By "not worth it", do you mean development effort or run time? For development effort, it would definitely be worth it in the grand scheme of things, but perhaps not worth it to specific individuals. For run time, I've having trouble seeing the situation where it would not be worth it. In the case that the resulting plan is the same (custom vs generic) there should be no cost. In the case that the plan is different, I think the difference proves that it is worth it. The case where it wouldn't be worth it would be if a prepared statement was called many times with many different parameters, and each set of parameters required a re-plan - but my experience in this regard tells me that the current model is to choose a sub-optimal plan, and the entire query will run much slower than the planning time, on every execute. We wouldn't be having this discussion if generic plans were considered adequate. So, I feel that it is worth it in this case as well. It's the development effort that is the problem. I can't do it, and I can't make you do it. If you say "too hard", there isn't anything I can do about it. :-) Cheers, mark
On 02/26/2010 01:59 PM, Tom Lane wrote: > Mark Mielke<mark@mark.mielke.cc> writes: > >> Just to point out that I agree, and as per my original post, I think the >> only time prepared statements should be re-planned for the statistics >> case, is after 'analyze' has run. That sounds like a quicker solution, >> and a much smaller gain. After 'analyze' of an object, invalidate all >> cached plans for prepared statements that rely on that object and >> require a re-plan. >> > Please note that that has been happening since 8.3, which is probably > why you haven't detected a problem. > Excellent, and sorry for missing the release note on this. Thanks, mark
Mark Mielke <mark@mark.mielke.cc> writes: > There must be some way to lift the cost of planning out of the plan > enumeration and selection phase, such that only plan enumeration and > selection is run at execute time. In most cases, plan enumeration and > selection, provided that all data required to make these decisions is > all cached in data structures ready to go, should be very fast? Right? Huh? What exactly do you think the cost of planning is, if not enumeration and selection? There isn't very much that's cacheable, at least not in any sanely-sized cache. > By "not worth it", do you mean development effort or run time? Run time. The development cost of what you are proposing is negligible: just rip out the plan cache altogether. I don't believe it would be a performance win though. regards, tom lane
On Fri, Feb 26, 2010 at 09:50, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Feb 26, 2010 at 1:29 AM, Alex Hunsaker <badalex@gmail.com> wrote: >> Prepared plans + exec plan (new guc/ protocol thing): >> Use: not quite sure >> Problems: slow because it would replan every time >> Solutions: use a prepared plan with the appropriate things not >> parametrized...? >> >> [ aka we already have this, its called dont use a prepared statement ] > > The point is sometimes you'd like to replan every time, but not > reparse every time. There's no way to do that ATM. So what you save on parse time? Maybe that's worth it. I've never run the numbers nor have I seen them in this thread. I probably missed em... My _hunch_ is planning will on average take significantly longer than parse time (read in the noise of plan time).But that's unfounded :) I can certainly imagine caseswhere you have HUGE queries where the parse time too slow-- wont the plan most of the time be an order of magnitude slower? Anyway Ill stop until I get a chance to do _some_ kind of benchmarking, I'm really quite clueless here.
Tom Lane wrote: > Right, but if the parameter is unknown then its distribution is also > unknown. In any case that's just nitpicking, because the solution is > to create a custom plan for the specific value supplied. Or are you > suggesting that we should create a way for users to say "here is the > expected distribution of this parameter", and then try to fold that into > the planner estimates? Or instead of letting users give the distribution, gather it automatically in some plan statistics catalog? I suspect in most applications queries stay the same for months and maybe years, so after some number of iterations it is possible to have decent call statistics / parameter distributions. Maybe the the parameter value distribution could even be annotated with actual cached plans. regards, Yeb Havinga
On 02/26/2010 02:57 PM, Tom Lane wrote: > Mark Mielke<mark@mark.mielke.cc> writes: > >> There must be some way to lift the cost of planning out of the plan >> enumeration and selection phase, such that only plan enumeration and >> selection is run at execute time. In most cases, plan enumeration and >> selection, provided that all data required to make these decisions is >> all cached in data structures ready to go, should be very fast? Right? >> > Huh? What exactly do you think the cost of planning is, if not > enumeration and selection? There isn't very much that's cacheable, > at least not in any sanely-sized cache. > I think most operations, including this one, can be broken into a fixed portion and a dynamic portion. The PREPARE should concern itself only with the fixed portion, and should leave the dynamic portion to EXECUTE. At present, the "planning process" is one big blob. Here are parts that can be done "fixed": 1) Statement parsing and error checking. 2) Identification of tables and columns involved in the query. 3) Query the column statistics for involved columns, to be used in plan cost estimation now and later. 4) Determine plan constraints under which elements of the plan must be executed a certain way (something like constant folding for a compiler), or for which parameter substitution would not impact the outcome. 5) Identify the elements of the plan that still require plan enumeration and plan selection, to be used in a later part of the pipeline. At a minimum, I am suggesting that 1), 2), and 3) should take a chunk out of the planning process. I think 4) and 5) are more complex but still valuable in terms of extracting the fixed portion out of the planning process. I think an assumption is being made that the planning process is an atomic unit that cannot be turned into a pipeline or assembly line. I think this assumption was what originally tied PREPARE = PLAN, and EXECUTE = RUN. I think this assumption is leading to the conclusion that EXECUTE should re-plan. I also expect that this assumption is tightly woven into the current implementation and changing it would require some amount of re-architecture. :-) >> By "not worth it", do you mean development effort or run time? >> > Run time. The development cost of what you are proposing is negligible: > just rip out the plan cache altogether. I don't believe it would be a > performance win though. > That's not my proposal, though. I'm suspecting you didn't read it. :-) I'm fine with you saying "too hard and not worth my development effort" after you read it. I agree it would be a lot of work. But if the conclusion is that the current architecture is the best that can be had, and the decision is only about when to do a custom re-plan or when to use the generic plan, I am putting my opinion out there that the generic plan has always been a compromise, and it will always be a compromise, and that this discussion exists primarily because the compromise is not adequate in many real world scenarios. And that all said, I think I am challenging the status quo and ticking people off. So while my intent is to challenge the status quo, it is not to tick people off. So, please let me know if you would like me to continue, or if you have already written this off. :-) Cheers, mark
On 02/26/2010 03:11 PM, Yeb Havinga wrote: > Tom Lane wrote: >> Right, but if the parameter is unknown then its distribution is also >> unknown. In any case that's just nitpicking, because the solution is >> to create a custom plan for the specific value supplied. Or are you >> suggesting that we should create a way for users to say "here is the >> expected distribution of this parameter", and then try to fold that into >> the planner estimates? > Or instead of letting users give the distribution, gather it > automatically in some plan statistics catalog? I suspect in most > applications queries stay the same for months and maybe years, so > after some number of iterations it is possible to have decent call > statistics / parameter distributions. Maybe the the parameter value > distribution could even be annotated with actual cached plans. The problem with the last - actual cached plans - is that it implies the other aspect I have been suggesting: In order to have a custom cached plan, the primary model must be to use custom plans. If PREPARE/EXECUTE uses generic plans normally, than the only cached plans available will be generic plans. Cheers, mark
Mark Mielke wrote: > On 02/26/2010 03:11 PM, Yeb Havinga wrote: >> Or instead of letting users give the distribution, gather it >> automatically in some plan statistics catalog? I suspect in most >> applications queries stay the same for months and maybe years, so >> after some number of iterations it is possible to have decent call >> statistics / parameter distributions. Maybe the the parameter value >> distribution could even be annotated with actual cached plans. > > The problem with the last - actual cached plans - is that it implies > the other aspect I have been suggesting: In order to have a custom > cached plan, the primary model must be to use custom plans. If > PREPARE/EXECUTE uses generic plans normally, than the only cached > plans available will be generic plans. I should have been clearer, with 'actual cached plans' I meant 'cached plans planned with actual parameters' or 'cached custom plans'. It makes no sense to annotate points or intervals in a gathered value distribution with generic plans. regards, Yeb Havinga
Mark Mielke <mark@mark.mielke.cc> writes: > Here are parts that can be done "fixed": > 1) Statement parsing and error checking. > 2) Identification of tables and columns involved in the query. The above two are done in the parser, not the planner. > 3) Query the column statistics for involved columns, to be used in plan > cost estimation now and later. What makes you think that these can be cached any more effectively than they are now? > 4) Determine plan constraints under which elements of the plan must be > executed a certain way (something like constant folding for a compiler), > or for which parameter substitution would not impact the outcome. Some marginal savings achievable there, perhaps. But you can't just push const-folding upstream, because one of the purposes of it is to const-fold expressions involving parameter values, if the planner is allowed to assume the parameter values are known. So while we could do a pass of const-folding upstream (in the rewriter say), we would still need a pass at plan time. Not very clear that this nets out as a win. > 5) Identify the elements of the plan that still require plan enumeration > and plan selection, to be used in a later part of the pipeline. [ shrug... ] In practice that will typically be "all of them". The only time it wouldn't be is when a parameter is only used in a join condition, but I think the majority of practical cases have parameters that are involved in relation scan conditions. Even if you could cache the path results for some of the relations involved in a query, the required size of the cache would be staggering. > And that all said, I think I am challenging the status quo and ticking > people off. So while my intent is to challenge the status quo, it is not > to tick people off. So, please let me know if you would like me to > continue, or if you have already written this off. :-) It would be more productive if you spent awhile studying the code as it actually exists now, and then started to propose rearchitecting. What I'm reading here seems to be largely uninformed speculation. regards, tom lane
On Fri, Feb 26, 2010 at 12:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Feb 26, 2010 at 11:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Also, I think there is a lot of confusion here over two different >>> issues: generic plan versus parameter-specific plan, and bad planner >>> estimates leading to a wrong plan choice. While the latter is certainly >>> an issue sometimes, there is no reason to believe that it affects >>> prepared statements worse than non-prepared ones. So I think that >>> designing a fix for prepared statements on the assumption that you can't >>> trust the planner's estimates is solving the wrong problem. > >> The two issues seem intimately connected to me. Of course, estimates >> can be inaccurate for many reasons, but ONE of those reasons is that >> the planner is optimizing for an unknown parameter value which has >> very different distribution from the actually supplied value. > > Right, but if the parameter is unknown then its distribution is also > unknown. I must be losing my mind. The distribution of the parameter will be known at execution time because a specific value will be supplied. Of course, it's not known at prepare time. > In any case that's just nitpicking, because the solution is > to create a custom plan for the specific value supplied. Or are you > suggesting that we should create a way for users to say "here is the > expected distribution of this parameter", and then try to fold that into > the planner estimates? I think the uptake on that would be about nil; > not to mention that it fails to fix several of the most pressing > problems, such as LIKE indexing and partition selection, where all the > distribution information in the world doesn't help. You need the exact > value. No, I'm not suggesting that. Basically, what I really want here is some kind of keyword or other syntax that I can stick into a PL/pgsql query that requests a replan on every execution. It's easy to identify the cases where this is needed: they are precisely the cases where my function runs for a small eternity. I realize that you can use EXECUTE for this, but as you pointed out upthread, that's not without its share of pain. Exposing the same functionality via other interfaces (e.g. PQprepare, other PLs) would be nice too, but for me personally, PL/pgsql is the one that keeps biting me in the rear end. One particularly ugly misadventure in this area had me write a function which updated a bunch of data in user tables either for one particular widget, if a widget id was passed in as an argument, or for all widgets, if NULL was passed in. I did this by writing ...some big hairy query... WHERE ... various conditions ... AND (id = $1 OR $1 IS NULL). The point was that sometimes the caller knew that only a particular widget needed to have its stuff recomputed, and sometimes it didn't know for sure what might need updating so it just wanted to update everything. Of course it turned out that this didn't really work: the same plan was used in both cases, so the version with an id specified took just as long to run as the generic version. ...Robert
Robert Haas <robertmhaas@gmail.com> writes: > Basically, what I really want here is some kind of keyword or other > syntax that I can stick into a PL/pgsql query that requests a replan > on every execution. Wouldn't it be better if it just did the right thing automatically? The sort of heuristic I'm envisioning would essentially do "replan every time" for some number of executions, and give up only if it noticed that it wasn't getting anything better than the generic plan. So you'd have a fixed maximum overhead per session when the custom plan was useless, and the Right Thing when it wasn't. regards, tom lane
On 02/26/2010 07:03 PM, Tom Lane wrote: > Robert Haas<robertmhaas@gmail.com> writes: > >> Basically, what I really want here is some kind of keyword or other >> syntax that I can stick into a PL/pgsql query that requests a replan >> on every execution. >> > Wouldn't it be better if it just did the right thing automatically? > Yes please. :-) Often, we are just users of the application, and we do not have the freedom to change it. > The sort of heuristic I'm envisioning would essentially do "replan every > time" for some number of executions, and give up only if it noticed that > it wasn't getting anything better than the generic plan. So you'd have > a fixed maximum overhead per session when the custom plan was useless, > and the Right Thing when it wasn't. My other comments aside - I think generic plan + specific plan where specific plan continues to beat generic plan, will meet the cases that really annoyed me, and would make a lot of us very happy... Thanks. Cheers, mark
On Fri, Feb 26, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Basically, what I really want here is some kind of keyword or other >> syntax that I can stick into a PL/pgsql query that requests a replan >> on every execution. > > Wouldn't it be better if it just did the right thing automatically? > > The sort of heuristic I'm envisioning would essentially do "replan every > time" for some number of executions, and give up only if it noticed that > it wasn't getting anything better than the generic plan. So you'd have > a fixed maximum overhead per session when the custom plan was useless, > and the Right Thing when it wasn't. Which is likely useless for my use case. ...Robert
On 26/02/2010 11:40 AM, Tom Lane wrote: > But putting support for a per-query level > of control into the protocol (and then every client library) as well as > every PL is going to be painful to implement, and even more painful to > use. You mean something like 'EXECUTE REPLAN' and protocol/PL-level equivalents? That's what people on -GENERAL often seem to need. A way, for a particular query, to say "replan this every time, because the stats are very signifcant here". > I still like the idea of automatically replanning with the known > parameter values, and noting whether the result plan was estimated to be > noticeably cheaper than the generic plan, and giving up on generating > custom plans if we didn't observe any such win over N tries. That risks making prepared statements less efficient when re-used between 2 and N times. People would end up asking for a 'no replan' knob, which I'm not sure is any improvement over a 'force replan' knob. OTOH, a GUC to turn that off would be (IMO) fairly harmless - if you know you don't have any problematic prepared queries, turn it off to save some cycles. Another thought: I wonder if this should be viewed from one step back. Many of these issues come from people who don't actually want prepared statements for performance, they're just using them to get convienient and secure parameter placement and server-side caching of the query text. Essentially, you have: 1) People preparing statements to save on parse+plan time; and 2) People preparing statements to get convenenient param placement. I suspect that most of (1) also want (2), but many of (2) don't care much about (1) and are just preparing statements for sql-injection safety (param placement), because they've been told to by someone, because their library does it for them, etc. So: Would it be easier to handle control of replan vs no-replan at PREPARE time? Or would that have very much the same protocol/pl change issues? -- Craig Ringer
Robert Haas <robertmhaas@gmail.com> writes: > On Fri, Feb 26, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Wouldn't it be better if it just did the right thing automatically? >> >> The sort of heuristic I'm envisioning would essentially do "replan every >> time" for some number of executions, and give up only if it noticed that >> it wasn't getting anything better than the generic plan. �So you'd have >> a fixed maximum overhead per session when the custom plan was useless, >> and the Right Thing when it wasn't. > Which is likely useless for my use case. [ shrug... ] You'd better explain exactly why, if you want me to take that objection seriously. regards, tom lane
On 02/27/2010 11:20 PM, Craig Ringer wrote: > Essentially, you have: > > 1) People preparing statements to save on parse+plan time; and > 2) People preparing statements to get convenenient param placement. > > I suspect that most of (1) also want (2), but many of (2) don't care > much about (1) and are just preparing statements for sql-injection > safety (param placement), because they've been told to by someone, > because their library does it for them, etc. > > So: Would it be easier to handle control of replan vs no-replan at > PREPARE time? Or would that have very much the same protocol/pl change > issues? I think if SQL hints were sufficient, that clients would only need to remove the prepared statement and re-create it whenever required. It should do the right thing automatically. I'm convinced that means generic plans are always wrong, and that some combination of performing fixed operations in PREPARE and variable operations in EXECUTE, combined with a plan caching against the prepared statement with criteria to determine whether or not the parameters match the assumptions made when creating one of the cached plans. Tom says extracting the fixed part of the planning out to PREPARE would be difficult or less valuable than I think. And the multi-plan caching with criteria seems to have been brought up and not commented on much by several people. So, it doesn't look like I will get this unless I learn how to implement it myself - which is probably not feasible at this time. :-) Not getting this, I think I'd be happy if PREPARE/EXECUTE can *easily* detect the worst cases (i.e. not slower in the general case), and generic plan plus custom plan plus custom execution is still significantly faster than generic plan plus generic execution. Adding SQL to indicate whether it should be re-planned or not is completely unappealing. If I could change the code, today, I'd just turn off or choose not to use PREPARE/EXECUTE. Today, PREPARE/EXECUTE seems like it should always be considered slower unless one can prove it is actually faster in a specific case, which is the exact opposite of what people expect. Cheers, mark
Craig Ringer <craig@postnewspapers.com.au> writes: > 1) People preparing statements to save on parse+plan time; and > 2) People preparing statements to get convenenient param placement. > > I suspect that most of (1) also want (2), but many of (2) don't care much > about (1) and are just preparing statements for sql-injection safety (param > placement), because they've been told to by someone, because their library > does it for them, etc. > > So: Would it be easier to handle control of replan vs no-replan at PREPARE > time? Or would that have very much the same protocol/pl change issues? http://www.postgresql.org/docs/8.4/static/libpq-exec.html#LIBPQ-EXEC-MAIN PQexecParams Submits a command to the server and waits for the result, with the ability to pass parameters separately from the SQLcommand text. So I think what you're talking about is already in there. -- dim
On Fri, Feb 26, 2010 at 4:01 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> It's not going to be easier to implement. Yeah, it would be easy to >> provide a global switch via a GUC setting, but that's not going to be >> helpful, because this is the sort of thing that really needs to be >> managed per-query. Almost any nontrivial application is going to have >> some queries that really need the custom plan and many that don't. >> If people just turn the GUC on we might as well throw away the plan >> caching mechanism altogether. > > I agree. A GUC is a really bad idea. I'm not sure I see why. GUCs can be turned on and off per-query. Even if you have protocol-level support a GUC could be more convenient in some cases. Specifically it would be useful for testing explain plans on queries that you expect to be planned without parameters. Currently you have to rewrite the query using PREPARE QUERY which is far from intuitive for users and even once you know how to do it prevents you from just copying and pasting queries. >> But putting support for a per-query level >> of control into the protocol (and then every client library) as well as >> every PL is going to be painful to implement, and even more painful to >> use. > > I suppose I should have learned by now not to argue with you over > technical points, but I don't see why this should be painful. I mean, > it'll be a lot of work and it'll in the end touch a lot of different > parts of the code, but work != pain, and I don't see any reason why > the problem can't be attacked incrementally. I'm also deeply > unconvinced that any other solution will be as satisfactory. I'm not sure we should be so conservative about adding features to the protocol. Sure it'll take time to filter through to all the drivers and the better ones will support it before the less well maintained ones. But that's just a question of time and nobody will be too surprised by that. I think we should have a general purpose options field for all our messages. We can define an enum of options keys and pass an integer value for each option. In some cases they'll actually be boolean and other cases they'll be tunables but it seems simpler to make it so the server can parse all the options out of the message even if it doesn't understand them all. And that means proxies can pass them on without understanding what they all mean. We discussed a while back marking all "latency-critical" queries so the backend knows that any buffers touched by that query should be marked as more important to stay in cache. If we got more complex priority handling we would be able to pass an option for the time budget for a query or its nice level. -- greg
On Sat, Feb 27, 2010 at 11:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Fri, Feb 26, 2010 at 7:03 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Wouldn't it be better if it just did the right thing automatically? >>> >>> The sort of heuristic I'm envisioning would essentially do "replan every >>> time" for some number of executions, and give up only if it noticed that >>> it wasn't getting anything better than the generic plan. So you'd have >>> a fixed maximum overhead per session when the custom plan was useless, >>> and the Right Thing when it wasn't. > >> Which is likely useless for my use case. > > [ shrug... ] You'd better explain exactly why, if you want me to take > that objection seriously. Hmm... on further thought, maybe it *would* work in that case. I'm still not convinced this is going to be generally satisfactory. It seems like it depends a great deal on how many times the function figures to be called per session and in what percentage of those cases a non-generic plan figures to be better. The appeal of a user-controllable knob is that I am pretty sure from experience that I can set it correctly, but hey... ...Robert
On Sun, Feb 28, 2010 at 2:52 AM, Mark Mielke <mark@mark.mielke.cc> wrote: > On 02/27/2010 11:20 PM, Craig Ringer wrote: >> >> Essentially, you have: >> >> 1) People preparing statements to save on parse+plan time; and >> 2) People preparing statements to get convenenient param placement. >> >> I suspect that most of (1) also want (2), but many of (2) don't care much >> about (1) and are just preparing statements for sql-injection safety (param >> placement), because they've been told to by someone, because their library >> does it for them, etc. >> >> So: Would it be easier to handle control of replan vs no-replan at PREPARE >> time? Or would that have very much the same protocol/pl change issues? > > I think if SQL hints were sufficient, that clients would only need to remove > the prepared statement and re-create it whenever required. > > It should do the right thing automatically. > > I'm convinced that means generic plans are always wrong, and that some > combination of performing fixed operations in PREPARE and variable > operations in EXECUTE, combined with a plan caching against the prepared > statement with criteria to determine whether or not the parameters match the > assumptions made when creating one of the cached plans. Tom says extracting > the fixed part of the planning out to PREPARE would be difficult or less > valuable than I think. And the multi-plan caching with criteria seems to > have been brought up and not commented on much by several people. So, it > doesn't look like I will get this unless I learn how to implement it myself > - which is probably not feasible at this time. :-) > > Not getting this, I think I'd be happy if PREPARE/EXECUTE can *easily* > detect the worst cases (i.e. not slower in the general case), and generic > plan plus custom plan plus custom execution is still significantly faster > than generic plan plus generic execution. > > Adding SQL to indicate whether it should be re-planned or not is completely > unappealing. If I could change the code, today, I'd just turn off or choose > not to use PREPARE/EXECUTE. Today, PREPARE/EXECUTE seems like it should > always be considered slower unless one can prove it is actually faster in a > specific case, which is the exact opposite of what people expect. I don't really understand most of what you're saying here, but there's definitely some truth to your last sentence. This has easily got to be one of the top ten questions on -performance. ...Robert
How about a totally different approach? What if all queries and plans of all queries, simple and prepared, were pre-planned and cached always, persistent? For prepared statements with >= 1 parameters, histogram and mcv information could be used to search the plan space for interesting plans. Maybe with some heuristics to cut down on search space (i.e. when operator is '=' and there is a unique index, skip that clause / parameter from the search space). Since processors keep getting more and more cores, and most database activity is IO bound, why not keep one core busy with query analysis? good: - with the several hooks available it could be implemented as optional contrib - if offers plan stability - nice info for management user interface - might be a solution for prepared queries - for queries with large joins, plans might be considered with exhaustive search, so also here there could be an improvement. - it might even be possible to 'test' plans during low-usage hours bad: - unknown how big space for cached plans should be - if big cached plan space doesn't fit in memory, actual planning probably better than fetching from disk, ~= 5 to 10ms. regards, Yeb Havinga
Dimitri Fontaine wrote: > Craig Ringer <craig@postnewspapers.com.au> writes: > > 1) People preparing statements to save on parse+plan time; and > > 2) People preparing statements to get convenenient param placement. > > > > I suspect that most of (1) also want (2), but many of (2) don't care much > > about (1) and are just preparing statements for sql-injection safety (param > > placement), because they've been told to by someone, because their library > > does it for them, etc. > > > > So: Would it be easier to handle control of replan vs no-replan at PREPARE > > time? Or would that have very much the same protocol/pl change issues? > > http://www.postgresql.org/docs/8.4/static/libpq-exec.html#LIBPQ-EXEC-MAIN > > PQexecParams > > Submits a command to the server and waits for the result, with the > ability to pass parameters separately from the SQL command text. > > So I think what you're talking about is already in there. There are three levels, SQL, libpq, and the wire protocol. The wire protocol offers this via unnamed statements. libpq offers it via PQexecParams(). SQL does not, as far as I can tell. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
Robert Haas wrote: > > Adding SQL to indicate whether it should be re-planned or not is completely > > unappealing. If I could change the code, today, I'd just turn off or choose > > not to use PREPARE/EXECUTE. Today, PREPARE/EXECUTE seems like it should > > always be considered slower unless one can prove it is actually faster in a > > specific case, which is the exact opposite of what people expect. > > I don't really understand most of what you're saying here, but there's > definitely some truth to your last sentence. This has easily got to > be one of the top ten questions on -performance. It seems it is the problem everyone knows about but no one fixes. :-( -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com PG East: http://www.enterprisedb.com/community/nav-pg-east-2010.do
On Tue, Mar 2, 2010 at 6:54 PM, Bruce Momjian <bruce@momjian.us> wrote: > Robert Haas wrote: >> > Adding SQL to indicate whether it should be re-planned or not is completely >> > unappealing. If I could change the code, today, I'd just turn off or choose >> > not to use PREPARE/EXECUTE. Today, PREPARE/EXECUTE seems like it should >> > always be considered slower unless one can prove it is actually faster in a >> > specific case, which is the exact opposite of what people expect. >> >> I don't really understand most of what you're saying here, but there's >> definitely some truth to your last sentence. This has easily got to >> be one of the top ten questions on -performance. > > It seems it is the problem everyone knows about but no one fixes. :-( I'd work on it, but Tom doesn't like my proposed fix. *shrug* ...Robert