Thread: Plan invalidation vs. unnamed prepared statements
[ cc'd to pgsql-jdbc which seems the group most likely to be affected by any protocol change ] So I've been working on a plan cache module per my earlier proposal, and I've run up against a problem with getting exec_parse_message to use it. The problem is that the current rather hackish handling of unnamed prepared statements doesn't fit in. Per the documentation, unnamed statements are supposed to be "optimized for the case of executing a query only once and then discarding it". In the current code this largely just means that we avoid copying the parse/plan trees into the normal PreparedStatement cache, preferring to let them sit in the context where they were generated (which means that any detritus generated by the parser/planner can't be recovered until we discard the unnamed statement, but that seems a good tradeoff in this situation). To use the plan cache for unnamed statements, there's going to have to be more overhead (more tree-copying) in this code path; moreover having the unnamed statement's plan in the cache will result in distributed overhead for checking it to see if it's still valid. This overhead is largely going to be wasted if the statement is always discarded immediately after use. I can think of several options for dealing with this: A. Just accept the extra overhead, thereby preserving the current behavior of unnamed statements, and gaining the benefit that plan invalidation will work correctly in the few cases where an unnamed statement's plan lasts long enough to need replanning. B. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, forcibly discard the unnamed statement after execution. This would get rid of a lot of overhead but would mean a significant change in the protocol-level behavior. It's hard to guess how many clients might be broken by it --- conceivably not any, but that seems too optimistic :-( C. Don't store the unnamed statement in the plan cache. To make sure it's not used anymore when the plan might be stale, don't analyze or plan at Parse-message time, but postpone *all* that work until Bind; and always discard the plan after Execute. We could still do "raw parsing" at Parse time, since that's independent of database state, but all but the most trivial syntactic errors would now occur at Bind not Parse time, as well as the majority of the time expenditure. This still amounts to a change in the protocol semantics, although it's a lot more subtle than plan B. Also there's a problem if the client does Describe Statement before Bind: we still have to run parse analysis before we can answer, and if we then throw that away, we have no very good way to guarantee that the statement still has the same description when it's subsequently executed; plus we end up doing parse analysis twice. D. Don't store the unnamed statement in the plan cache, and just ignore the possibility that its plan might become stale before use. That's exactly what happens now, but considering that the whole point of the plan inval work is to seal off such pitfalls, I can't say that I care for this alternative. Comments? I'm leaning to plan A but wanted to see if anyone would support plan B or sees a way to fix plan C. regards, tom lane
"Tom Lane" <tgl@sss.pgh.pa.us> writes: > B. Don't store the unnamed statement in the plan cache. To make sure > it's not used anymore when the plan might be stale, forcibly discard > the unnamed statement after execution. This would get rid of a lot > of overhead but would mean a significant change in the protocol-level > behavior. It's hard to guess how many clients might be broken by it > --- conceivably not any, but that seems too optimistic :-( Can we forcibly discard it if *any* messages are received that might invalidate a plan? So basically it would work fine unless anyone in the system does any DDL at all? I guess that has the downside of introducing random unpredictable failures. Or stash the query string and replan it (possibly in the query cache this time) if someone executes it a second time? Can't say I like either of those options much, just trying to brainstorm. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
Gregory Stark <stark@enterprisedb.com> writes: > Can we forcibly discard it if *any* messages are received that might > invalidate a plan? So basically it would work fine unless anyone in the system > does any DDL at all? I guess that has the downside of introducing random > unpredictable failures. Ugh :-( > Or stash the query string and replan it (possibly in the query cache this > time) if someone executes it a second time? I think that's either my plan A or C. The main problem with uncontrolled replanning is that there's no way to detect a change in the query properties. For example suppose the query is "SELECT * FROM foo" and we've already told the client (via Describe Statement) that that returns two integer columns. If an inval now arrives because of "ALTER TABLE foo ADD COLUMN" (or perhaps worse, ALTER COLUMN TYPE), we've got a problem. If we just blindly replan then we'll return tuples that do not match the previously given row description, which will certainly break most clients. The plan caching module has enough infrastructure to detect and complain about these sorts of situations, and it also knows how to manage lock acquisition so that once we've decided a plan is still good, the tables won't change underneath us while we use the plan. I don't see any way to make comparable guarantees without the overhead that goes with the cache manager. regards, tom lane
I think C is how the JDBC driver is written. We name the statements if they have been used more than prepareThreshold times. So we have a mechanism by which to allow statements to be cached, or not. Dave On 6-Mar-07, at 1:14 PM, Tom Lane wrote: > Gregory Stark <stark@enterprisedb.com> writes: >> Can we forcibly discard it if *any* messages are received that might >> invalidate a plan? So basically it would work fine unless anyone >> in the system >> does any DDL at all? I guess that has the downside of introducing >> random >> unpredictable failures. > > Ugh :-( > >> Or stash the query string and replan it (possibly in the query >> cache this >> time) if someone executes it a second time? > > I think that's either my plan A or C. > > The main problem with uncontrolled replanning is that there's no > way to > detect a change in the query properties. For example suppose the > query > is "SELECT * FROM foo" and we've already told the client (via Describe > Statement) that that returns two integer columns. If an inval now > arrives because of "ALTER TABLE foo ADD COLUMN" (or perhaps worse, > ALTER > COLUMN TYPE), we've got a problem. If we just blindly replan then > we'll > return tuples that do not match the previously given row description, > which will certainly break most clients. > > The plan caching module has enough infrastructure to detect and > complain > about these sorts of situations, and it also knows how to manage lock > acquisition so that once we've decided a plan is still good, the > tables > won't change underneath us while we use the plan. I don't see any way > to make comparable guarantees without the overhead that goes with the > cache manager. > > regards, tom lane > > ---------------------------(end of > broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster >
On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote: > A. Just accept the extra overhead, thereby preserving the current > behavior of unnamed statements, and gaining the benefit that plan > invalidation will work correctly in the few cases where an unnamed > statement's plan lasts long enough to need replanning. With connection pooling, multiple sessions will execute each statement. If we check the cache each time this does seem more expensive for each individual session, but we should gain synergy from other similar sessions. Taken across multiple sessions, A will be a win because it will reduce planning overhead by ~99%. > C. Don't store the unnamed statement in the plan cache. To make sure > it's not used anymore when the plan might be stale, don't analyze or > plan at Parse-message time, but postpone *all* that work until Bind; > and always discard the plan after Execute. We could still do "raw > parsing" at Parse time, since that's independent of database state, > but all but the most trivial syntactic errors would now occur at Bind > not Parse time, as well as the majority of the time expenditure. ISTM there will be some cases where the current behaviour will not be maintained if we implement A exactly. One thing I've not seen mentioned is the effect of constants on various plans. The current system plans at Bind time so it can make longer term decisions based upon the values of initial parameters. So I'd say we need to check the cache at Parse time, but if we do need to plan, continue to do this at Bind time (and so don't write to plan cache until that point). That might mean we end up giving some of our benefit away if multiple sessions all concurrently plan a previously unplanned query. That does seem less likely and in any case much better than taking a step backwards in query planning of parameterised queries. Also, some of those plans are only currently possible with actual constants, specifically predicate proving for partial indexes and constraint exclusion. Parameter to constant folding may change the plan completely and make it non-reusable anyhow. How would we cope with that type of prepared query with plan inval? -- Simon Riggs EnterpriseDB http://www.enterprisedb.com
"Simon Riggs" <simon@2ndquadrant.com> writes: > On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote: >> A. Just accept the extra overhead, thereby preserving the current >> behavior of unnamed statements, and gaining the benefit that plan >> invalidation will work correctly in the few cases where an unnamed >> statement's plan lasts long enough to need replanning. > With connection pooling, multiple sessions will execute each statement. > If we check the cache each time this does seem more expensive for each > individual session, but we should gain synergy from other similar > sessions. It seems fairly unlikely to me that client code would try to share an unnamed statement across multiple application threads; the entire point is that it's for one-off queries. Or did you miss the point that the plan cache is local per-backend? > ISTM there will be some cases where the current behaviour will not be > maintained if we implement A exactly. One thing I've not seen mentioned > is the effect of constants on various plans. There is none. regards, tom lane
I am from pgsql-jdbc, so I may not be "in the thread", so please ignore places where my misunderstanding goes out. The main two questions, IMHO, is: 1) What is the key to plan cache. Current option is some statement key (id). Another option would be statement text (you still need to store it if you want to replan at some point). In this case you can use same plan for multiple statements going from different sessions. That's the point Simon was talking about. This should significantly reduce planning, especially on multiple similar clients. Now, as I understand, every connection prepare same statements and plan then independent. Such change would make Application servers prepare new connections much faster (given they prepare a number of same statements for each connection, which is the case for my engine). This should work for both named and unnamed. Note that adding unnamed statements to cache (and not removing on statement disposal) may require much larger cache. BTW: This is used by IBM DB2 UDB. 2) Specific plans when parameters are known. This is the point about using partial index(and sometimes even using full index- i.e. specifying frequent value of some index or one of two tables in a join). I'd say the best would be to have generic plan and try to replan, starting from generic plan results (dispose any possibility that gives values worse then generic plan). Such a replan should be much faster then original planning because you have rather high starting point. Another option is to catch possibilities at original planning and select correct plan when parameters are known - you check all possible uses with "this will be frequent value, this will match this partial index, ..." the question is the number of such plans. But since all of them must be better then generic (and it is possible to make a three, i.e. "A and B are not frequent" -> "A is frequent" -> "A is frequent and B meets partial index" and children must be better then parent), I'd say there won't be many (and you can always limit it's number and leave only the best if one goes out of number or even collect usages and leave the plans that are used).