Thread: Plan invalidation vs. unnamed prepared statements

Plan invalidation vs. unnamed prepared statements

From
Tom Lane
Date:
[ cc'd to pgsql-jdbc which seems the group most likely to be affected
  by any protocol change ]

So I've been working on a plan cache module per my earlier proposal,
and I've run up against a problem with getting exec_parse_message
to use it.  The problem is that the current rather hackish handling
of unnamed prepared statements doesn't fit in.  Per the documentation,
unnamed statements are supposed to be "optimized for the case of
executing a query only once and then discarding it".  In the current
code this largely just means that we avoid copying the parse/plan trees
into the normal PreparedStatement cache, preferring to let them sit
in the context where they were generated (which means that any detritus
generated by the parser/planner can't be recovered until we discard the
unnamed statement, but that seems a good tradeoff in this situation).

To use the plan cache for unnamed statements, there's going to have to
be more overhead (more tree-copying) in this code path; moreover having
the unnamed statement's plan in the cache will result in distributed
overhead for checking it to see if it's still valid.  This overhead is
largely going to be wasted if the statement is always discarded
immediately after use.

I can think of several options for dealing with this:

A. Just accept the extra overhead, thereby preserving the current
behavior of unnamed statements, and gaining the benefit that plan
invalidation will work correctly in the few cases where an unnamed
statement's plan lasts long enough to need replanning.

B. Don't store the unnamed statement in the plan cache.  To make sure
it's not used anymore when the plan might be stale, forcibly discard
the unnamed statement after execution.  This would get rid of a lot
of overhead but would mean a significant change in the protocol-level
behavior.  It's hard to guess how many clients might be broken by it
--- conceivably not any, but that seems too optimistic :-(

C. Don't store the unnamed statement in the plan cache.  To make sure
it's not used anymore when the plan might be stale, don't analyze or
plan at Parse-message time, but postpone *all* that work until Bind;
and always discard the plan after Execute.  We could still do "raw
parsing" at Parse time, since that's independent of database state,
but all but the most trivial syntactic errors would now occur at Bind
not Parse time, as well as the majority of the time expenditure.  This
still amounts to a change in the protocol semantics, although it's a
lot more subtle than plan B.  Also there's a problem if the client
does Describe Statement before Bind: we still have to run parse analysis
before we can answer, and if we then throw that away, we have no very
good way to guarantee that the statement still has the same description
when it's subsequently executed; plus we end up doing parse analysis
twice.

D. Don't store the unnamed statement in the plan cache, and just ignore
the possibility that its plan might become stale before use.  That's
exactly what happens now, but considering that the whole point of the
plan inval work is to seal off such pitfalls, I can't say that I care
for this alternative.

Comments?  I'm leaning to plan A but wanted to see if anyone would
support plan B or sees a way to fix plan C.

            regards, tom lane

Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
Gregory Stark
Date:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> B. Don't store the unnamed statement in the plan cache.  To make sure
> it's not used anymore when the plan might be stale, forcibly discard
> the unnamed statement after execution.  This would get rid of a lot
> of overhead but would mean a significant change in the protocol-level
> behavior.  It's hard to guess how many clients might be broken by it
> --- conceivably not any, but that seems too optimistic :-(

Can we forcibly discard it if *any* messages are received that might
invalidate a plan? So basically it would work fine unless anyone in the system
does any DDL at all? I guess that has the downside of introducing random
unpredictable failures.

Or stash the query string and replan it (possibly in the query cache this
time) if someone executes it a second time?

Can't say I like either of those options much, just trying to brainstorm.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
Tom Lane
Date:
Gregory Stark <stark@enterprisedb.com> writes:
> Can we forcibly discard it if *any* messages are received that might
> invalidate a plan? So basically it would work fine unless anyone in the system
> does any DDL at all? I guess that has the downside of introducing random
> unpredictable failures.

Ugh :-(

> Or stash the query string and replan it (possibly in the query cache this
> time) if someone executes it a second time?

I think that's either my plan A or C.

The main problem with uncontrolled replanning is that there's no way to
detect a change in the query properties.  For example suppose the query
is "SELECT * FROM foo" and we've already told the client (via Describe
Statement) that that returns two integer columns.  If an inval now
arrives because of "ALTER TABLE foo ADD COLUMN" (or perhaps worse, ALTER
COLUMN TYPE), we've got a problem.  If we just blindly replan then we'll
return tuples that do not match the previously given row description,
which will certainly break most clients.

The plan caching module has enough infrastructure to detect and complain
about these sorts of situations, and it also knows how to manage lock
acquisition so that once we've decided a plan is still good, the tables
won't change underneath us while we use the plan.  I don't see any way
to make comparable guarantees without the overhead that goes with the
cache manager.

            regards, tom lane

Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
Dave Cramer
Date:
I think C is how the JDBC driver is written. We name the statements
if they have been used more than prepareThreshold times.

So we have a mechanism by which to allow statements to be cached, or
not.

Dave

On 6-Mar-07, at 1:14 PM, Tom Lane wrote:

> Gregory Stark <stark@enterprisedb.com> writes:
>> Can we forcibly discard it if *any* messages are received that might
>> invalidate a plan? So basically it would work fine unless anyone
>> in the system
>> does any DDL at all? I guess that has the downside of introducing
>> random
>> unpredictable failures.
>
> Ugh :-(
>
>> Or stash the query string and replan it (possibly in the query
>> cache this
>> time) if someone executes it a second time?
>
> I think that's either my plan A or C.
>
> The main problem with uncontrolled replanning is that there's no
> way to
> detect a change in the query properties.  For example suppose the
> query
> is "SELECT * FROM foo" and we've already told the client (via Describe
> Statement) that that returns two integer columns.  If an inval now
> arrives because of "ALTER TABLE foo ADD COLUMN" (or perhaps worse,
> ALTER
> COLUMN TYPE), we've got a problem.  If we just blindly replan then
> we'll
> return tuples that do not match the previously given row description,
> which will certainly break most clients.
>
> The plan caching module has enough infrastructure to detect and
> complain
> about these sorts of situations, and it also knows how to manage lock
> acquisition so that once we've decided a plan is still good, the
> tables
> won't change underneath us while we use the plan.  I don't see any way
> to make comparable guarantees without the overhead that goes with the
> cache manager.
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>


Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
"Simon Riggs"
Date:
On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote:

> A. Just accept the extra overhead, thereby preserving the current
> behavior of unnamed statements, and gaining the benefit that plan
> invalidation will work correctly in the few cases where an unnamed
> statement's plan lasts long enough to need replanning.

With connection pooling, multiple sessions will execute each statement.
If we check the cache each time this does seem more expensive for each
individual session, but we should gain synergy from other similar
sessions. Taken across multiple sessions, A will be a win because it
will reduce planning overhead by ~99%.

> C. Don't store the unnamed statement in the plan cache.  To make sure
> it's not used anymore when the plan might be stale, don't analyze or
> plan at Parse-message time, but postpone *all* that work until Bind;
> and always discard the plan after Execute.  We could still do "raw
> parsing" at Parse time, since that's independent of database state,
> but all but the most trivial syntactic errors would now occur at Bind
> not Parse time, as well as the majority of the time expenditure.

ISTM there will be some cases where the current behaviour will not be
maintained if we implement A exactly. One thing I've not seen mentioned
is the effect of constants on various plans.

The current system plans at Bind time so it can make longer term
decisions based upon the values of initial parameters. So I'd say we
need to check the cache at Parse time, but if we do need to plan,
continue to do this at Bind time (and so don't write to plan cache until
that point). That might mean we end up giving some of our benefit away
if multiple sessions all concurrently plan a previously unplanned query.
That does seem less likely and in any case much better than taking a
step backwards in query planning of parameterised queries.

Also, some of those plans are only currently possible with actual
constants, specifically predicate proving for partial indexes and
constraint exclusion. Parameter to constant folding may change the plan
completely and make it non-reusable anyhow. How would we cope with that
type of prepared query with plan inval?

--
  Simon Riggs
  EnterpriseDB   http://www.enterprisedb.com



Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
Tom Lane
Date:
"Simon Riggs" <simon@2ndquadrant.com> writes:
> On Tue, 2007-03-06 at 12:22 -0500, Tom Lane wrote:
>> A. Just accept the extra overhead, thereby preserving the current
>> behavior of unnamed statements, and gaining the benefit that plan
>> invalidation will work correctly in the few cases where an unnamed
>> statement's plan lasts long enough to need replanning.

> With connection pooling, multiple sessions will execute each statement.
> If we check the cache each time this does seem more expensive for each
> individual session, but we should gain synergy from other similar
> sessions.

It seems fairly unlikely to me that client code would try to share an
unnamed statement across multiple application threads; the entire point
is that it's for one-off queries.

Or did you miss the point that the plan cache is local per-backend?

> ISTM there will be some cases where the current behaviour will not be
> maintained if we implement A exactly. One thing I've not seen mentioned
> is the effect of constants on various plans.

There is none.

            regards, tom lane

Re: [HACKERS] Plan invalidation vs. unnamed prepared statements

From
Тимчишин Виталий
Date:
I am from pgsql-jdbc, so I may not be "in the thread", so please ignore
places where my misunderstanding goes out.
The main two questions, IMHO, is:
1) What is the key to plan cache. Current option is some statement key
(id). Another option would be statement text (you still need to store it
if you want to replan at some point). In this case you can use same plan
for multiple statements going from different sessions. That's the point
Simon was talking about. This should significantly reduce planning,
especially on multiple similar clients. Now, as I understand, every
connection prepare same statements and plan then independent. Such
change would make Application servers prepare new connections much
faster (given they prepare a number of same statements for each
connection, which is the case for my engine). This should work for both
named and unnamed. Note that adding unnamed statements to cache (and not
removing on statement disposal) may require much larger cache.
BTW: This is used by IBM DB2 UDB.
2) Specific plans when parameters are known. This is the point about
using partial index(and sometimes even using full  index- i.e.
specifying frequent value of some index or one of two tables in a join).
I'd say the best would be to have generic plan and try to replan,
starting from generic plan results (dispose any possibility that gives
values worse then generic plan). Such a replan should be much faster
then original planning because you have rather high starting point.
Another option is to catch possibilities at original planning and select
correct plan when parameters are known - you check all possible uses
with "this will be frequent value, this will match this partial index,
..." the question is the number of such plans. But since all of them
must be better then generic (and it is possible to make a three, i.e. "A
and B are not frequent" -> "A is frequent" -> "A is frequent and B meets
partial index" and children must be better then parent), I'd say there
won't be many (and you can always limit it's number and leave only the
best if one goes out of number or even collect usages and leave the
plans that are used).