Thread: Optimizing prepared statements

Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
I've rigged up a simple simulator for the scheme I described for detecting
pseudo-constant parameters to prepared statements.  It withstands simple
tests, and it neatly picks up cases where some parameters are
pseudo-constants and others aren't--even if some of them are more "pseudo"
while others are more "constant."

What I need now is some realistic test data.  Does anyone have realistic
SQL logs that use prepared statements?

For now, I'll summarize some results I got from randomized input data.  I
used very simple traces, with 11 prepared statements, each taking a
different number of parameters (0 through 10, inclusive).  All calls were
uniformly randomized.  I used LRU replacement of cached plans, with up to
4 retained plans per statement.  Confidence counters ran from 0 to 3
inclusive, with the confidence threshold lying between 1 and 2.

Per simulation run, 20,000 statement invocations were processed.  Runs of
20,000 took about 3.5 seconds of wall-clock time each, or 0.175
milliseconds per statement, on a lightly-loaded 1.8 GHz 32-bit Athlon XP. 
That's for simulation in Python 2.4, with no effort to optimize and no
precompilation, and several lines of information composed and printed to
/dev/null for every invocation.  So I think the overhead of the matching
and comparing that the algorithm does is not a performance worry.

In my first test, parameters were uniformly-distributed integers in the
range [0, 999].  Over this test, 104 plans were generated for the 11
plans, for an average 192 calls per generated plan.  Only 133 calls out of
20,000 used optimized plans, in this case optimizing out only one
pseudo-constant each.

When parameters were made to follow the normal distribution with mean 500
and standard deviation 100 (rounded to integers), the number of generated
plans went up to 305 as more patterns were recognized, and of course the
number of calls per generated plan dropped to 65.  Of the 20,000
invocations here, 770 used plans with one parameter optimized away, and 2
used plans with two.

These results don't look very good, but bear in mind this is for
randomized data.  Can't expect to exploit many patterns in random inputs! 
Real-life use is probably going to be much more favourable.  If we want to
guard against fluke patterns in highly-variable parameters, we can always
increase the range of the confidence counters.  That would make the
predictor more skeptical when it comes to accepting reptitions as
patterns.  Just how we tune the counters would be a tradeoff between the
cost of additional planning and the benefits of optimizing out more
parameters.

So once again, does anyone know of any realistic logs that I can use for
more useful simulations?


Jeroen




Re: Optimizing prepared statements

From
Gregory Stark
Date:
"Jeroen T. Vermeulen" <jtv@xs4all.nl> writes:

> For now, I'll summarize some results I got from randomized input data.  I
> used very simple traces, with 11 prepared statements, each taking a
> different number of parameters (0 through 10, inclusive).  All calls were
> uniformly randomized.  I used LRU replacement of cached plans, with up to
> 4 retained plans per statement.  Confidence counters ran from 0 to 3
> inclusive, with the confidence threshold lying between 1 and 2.

I'm confused, what exactly are you trying to predict? Whether each parameter
will be some cached value? Or whether the cached plan was correct?

> So once again, does anyone know of any realistic logs that I can use for
> more useful simulations?

You might look at the DBT test suite, iirc the TPCC spec it implements
intentionally mixes random queries with predictable queries.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Sun, September 3, 2006 18:41, Gregory Stark wrote:

> I'm confused, what exactly are you trying to predict? Whether each
> parameter
> will be some cached value? Or whether the cached plan was correct?

That's described in more detail in a separate thread ("prepared statements
considered harmful").  In a nutshell, the algorithm detects pseudoconstant
parameters to prepared statements, and keeps a small set of different
plans optimized for recurring combinations of constants.


>> So once again, does anyone know of any realistic logs that I can use for
>> more useful simulations?
>
> You might look at the DBT test suite, iirc the TPCC spec it implements
> intentionally mixes random queries with predictable queries.

I considered the TPC benchmarks, but they're still benchmarks.  When seen
from one angle they try to look like real applications, but I'm worried
that in testing this algorithm, I may be looking at them from a very
different angle.  I'd still need actual application logs to validate them!


Jeroen




Re: Optimizing prepared statements

From
Gregory Stark
Date:
"Jeroen T. Vermeulen" <jtv@xs4all.nl> writes:

> On Sun, September 3, 2006 18:41, Gregory Stark wrote:
>
>> I'm confused, what exactly are you trying to predict? Whether each
>> parameter
>> will be some cached value? Or whether the cached plan was correct?
>
> That's described in more detail in a separate thread ("prepared statements
> considered harmful").  In a nutshell, the algorithm detects pseudoconstant
> parameters to prepared statements, and keeps a small set of different
> plans optimized for recurring combinations of constants.

I read that but apparently I misunderstood it since it would not have been
doable the way I understood it. I thought you wanted the predictor bits to
correspond to particular plans. If a plan was "wrong" then you marked it as a
bad guess. I don't think that can be made to work though for the reasons I
stated then.

But if you have something working clearly that's not what you're doing. So
what are you doing? Storing up a list of arguments seen for each parameter
when executed and use the predictor bits to determine if any of those
arguments are constants? Storing up a list of selectivity estimates?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Sun, September 3, 2006 21:52, Gregory Stark wrote:

> I read that but apparently I misunderstood it since it would not have been
> doable the way I understood it. I thought you wanted the predictor bits to
> correspond to particular plans. If a plan was "wrong" then you marked it
> as a
> bad guess. I don't think that can be made to work though for the reasons I
> stated then.

Oh, sorry--I guess I haven't been too systematic about it.  In the
algorithm's current incarnation, the confidence counters don't mark
anything as bad ideas, as such.  Instead, whenever a new invocation has a
set of parameters that are (1) correctly and (2) confidently predicted by
the counters, the predictor decides that it's probably a good idea to plan
for calls with that particular set of parameters built in as constants.

Assuming we didn't already have a plan for the particular combination of
values at hand, the algorithm generates a new one.  The new plan is
optimized on the assumption that those predicted parameters are constants.

We keep a small cache of recently-used plans, possibly including the
original plan where all parameters are truly variable.  Every plan also
remembers a list of the predicted parameter values, so on any next call,
we can check whether a particular cached plan actually applies to the
call.  If it doesn't (because of a mismatch between the incoming
parameters and the plan's assumed pseudo-constants), we just pick another
plan.

If multiple cached plans can be applied to a given call, we prefer the one
that optimizes away the most parameters.  Age is used as a tiebreaker, on
the assumption that more recent planning information is likely to be more
accurate.

The tricky part is deciding when to generate a new, more specialized plan
when we already have a matching one that may not be optimal.  Without this
step we'd never get beyond that first, completely generic plan--it applies
to every call.  The way I've approached it is this: when the predictor's
current state correctly and confidently predicts more of the invocation's
parameter values than any of the cached plans did, then it's time to
generate a new plan.

So let's say your statement has a parameter x that's always the same
value, say x==0, and another parameter y that's a randomly alternating
Boolean value, and another one z that varies randomly between lots of
values.  What order they're in doesn't matter, and they needn't be the
only parameters.  You're probably going to see four plans generated for
this example:

1. Either on the first call or during definition, you get the generic
plan.  This is the same plan you'd get in the existing backend, with
placeholders for variable x, y, and z.

2. Pretty soon, the algorithm is going to detect that x is always zero. 
It will generate a new plan, substituting the constant value 0 for x, and
hopefully getting better optimization because of it.

3. Sooner or later (probably fairly soon) you'll see a run of consecutive
calls where y happens to be "true."  A new plan is generated with the
assumption that y==true.  The new plan will also still assume that x==0.

4. The same is going to happen for y==false.  Yet another specialized plan
is generated.  If we keep up to 3 plans per statement, say, then this new
plan overflows the cache.  The least recently used plan is flushed to make
room for the new one--in this case, the generic one because we haven't
seen any cases where x!=0 recently.

More complex scenarios will also happen, of course, such as "if y==true
then x will usually be 0, but otherwise x will be highly variable" or "if
y==true then x is pseudo-constant and z is highly variable, but if
y==false then it's the other way around" or "if y==false then z is usually
the empty string," and so on.  The predictor as I've simulated it is "too
dumb to be intimidated" by the complexity.  It should work reasonably well
for all those scenarios, assuming of course that its cache is large enough
to remember the most frequent patterns.

Right now I'm using the plans' time since last use as the only eviction
criterion when the cache overflows.  There may be a better policy; the one
Achilles heel of LRU is the "big loop" where every cache entry is used
once, then evicted shortly before it is needed again.  (If the loop is so
big that entries are flushed long before they're needed again, well, then
it's just a big job and you stop blaming LRU :-)


> But if you have something working clearly that's not what you're doing. So
> what are you doing? Storing up a list of arguments seen for each parameter
> when executed and use the predictor bits to determine if any of those
> arguments are constants? Storing up a list of selectivity estimates?

The former.  I'm keeping a single predictor with a single "more or less
last-seen value" per parameter; plus a checklist of pseudoconstants for
every cached plan.  It's pretty simple, really, with no cost functions or
spanning trees or other intelligent logic--and certainly nothing original.Which is what makes me cautiously optimistic:
it'snot so hard to come up
 
with good or original ideas, but ones that are good *and* original are
exceedingly rare.  :)

This particular algorithm is based entirely on standard processor
architecture tricks.  One (to me) surprising lesson I learned in that
field was that simple, dynamic schemes with obvious flaws are often more
effective than what the combination of a smart programmer, an expert user,
and an aggressive compiler can come up with beforehand.  Another one was
that performance breakthroughs often come from applying these standard
tricks to problems you'd think they'd already been applied to.


Jeroen




Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Sun, September 3, 2006 23:28, Jeroen T. Vermeulen wrote:
> On Sun, September 3, 2006 21:52, Gregory Stark wrote:

>> I read that but apparently I misunderstood it since it would not have
>> been
>> doable the way I understood it. I thought you wanted the predictor bits
>> to
>> correspond to particular plans. If a plan was "wrong" then you marked it
>> as a
>> bad guess. I don't think that can be made to work though for the reasons
>> I
>> stated then.

Come to think of it, I still haven't answered your question very clearly...

I keep one predictor for any given statement.  The predictor keeps two
pieces of state *per statement parameter*:

1. The predictor value (usually the last-seen value for that parameter,
though it does take a few mismatches to make it drop a value that was
repeated a lot).

2. A saturating counter measuring confidence in the current predictor
value.  As long as this counter is below the confidence threshold, the
predictor value has no other meaning than "let's see if this value comes
back."

No information flows between these pieces of state.  All logic in the
predictor itself is entirely local to individual parameters.  But any new
plan is generated based on a "snapshot" of all predictor values that (i)
match their respective parameter values in the ongoing call, and (ii) have
their counters above the confidence threshold.  It is that combination of
parameters, for that combination of values, that is taken as a set of
pseudo-constants.

The cache can hold any number of plans optimized for the same combinations
of parameters but with different values; or for different combinations of
parameters but the same values; subsets and supersets of parameters with
either the same or different values; completely disjoint sets; anything. 
The cache neither knows nor cares.  The only thing that *won't* happen is
two plans in the same cache covering the same set of parameters with the
same set of values--because there is never any need to generate that
second plan while the first is still in cache.


Jeroen




Re: Optimizing prepared statements

From
Tom Lane
Date:
"Jeroen T. Vermeulen" <jtv@xs4all.nl> writes:
> If multiple cached plans can be applied to a given call, we prefer the one
> that optimizes away the most parameters.

What exactly do you mean by "optimize away a parameter"?  The way you
described the mechanism, there are no parameters that are "optimized
away", you've merely adjusted selectivity predictions using some assumed
values.  Actually converting a parameter to a constant is a whole
'nother matter --- it allows constant-folding for example.  But then you
*cannot* use the plan unless there's an exact match to the assumed
value.  These two approaches provide very different tradeoffs of plan
quality vs plan specificity, so it makes me uncomfortable that you're
failing to clarify what you mean.
        regards, tom lane


Re: Optimizing prepared statements

From
Gregory Stark
Date:
"Jeroen T. Vermeulen" <jtv@xs4all.nl> writes:

> Oh, sorry--I guess I haven't been too systematic about it.  In the
> algorithm's current incarnation, ...

Thanks, that cleared things up enormously. I'm trying to figure how it would
react to some of the queries I've written in the past. In particular I'm
thinking of queries like

WHERE (? OR category = ?) AND (? OR cost < ?) AND (? OR description like ?||'%')

Where I then pass flags in from the application to indicate which search
constraints to apply. If it notices that most searches are for a particular
set of constraints it would be able to cache plans with the unused constraints
removed.

It would not however be able to notice that the last parameter never contains
a % and therefore can use an index scan.

I'm also wondering how this interacts with plan stability. Obviously the
direct effect is to throw out any chance of it. But in the long run they may
be two complementary sides of the same thing.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Sun, September 3, 2006 23:52, Tom Lane wrote:

> What exactly do you mean by "optimize away a parameter"?  The way you
> described the mechanism, there are no parameters that are "optimized
> away", you've merely adjusted selectivity predictions using some assumed
> values.

I'm using "optimized away" as shorthand for "replaced with a literal
constant in the statement definition used to generate the plan."  So if a
parameter $n is found to be, say, always equal to the string 'foo', then
we might want to generate a specialized plan as if the statement's
definition contained the literal string 'foo' wherever it really says $n. 
I've been calling that "optimized for $n=='foo'" or "$n is optimized away
in this plan."


>  Actually converting a parameter to a constant is a whole
> 'nother matter --- it allows constant-folding for example.  But then you
> *cannot* use the plan unless there's an exact match to the assumed
> value.  These two approaches provide very different tradeoffs of plan
> quality vs plan specificity, so it makes me uncomfortable that you're
> failing to clarify what you mean.

Right.  When I said "optimized for" a certain parameter value, I meant
actual substitution the whole time.  I'm sorry if I didn't make that
clear; it seemed so basic that I must have forgotten to mention it.  I
guess the principle would also work otherwise, but it's intended to allow
constant folding.

So for any given statement, there would be a cache of frequently-needed
plans for different sets of constant substitutions.  As you point out, a
call could only use a plan if the plan's substitutions were consistent
with the call's parameter values (but within that constraint, the more
substitutions the merrier).  That's why I talked so much about comparing
and matching: that part is for correctness, not optimization.

As I've said before, all this falls down if there is a significant cost to
keeping one or two extra plans per prepared statement.  You mentioned
something about "tracking" plans.  I don't know what that means, but it
sounded like it might impose a runtime cost on keeping plans around. 
Could you elaborate?


Jeroen




Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Mon, September 4, 2006 03:56, Gregory Stark wrote:

> Thanks, that cleared things up enormously. I'm trying to figure how it
> would
> react to some of the queries I've written in the past. In particular I'm
> thinking of queries like
>
> WHERE (? OR category = ?)
>   AND (? OR cost < ?)
>   AND (? OR description like ?||'%')
>
> Where I then pass flags in from the application to indicate which search
> constraints to apply. If it notices that most searches are for a
> particular
> set of constraints it would be able to cache plans with the unused
> constraints
> removed.

Right.  That's pretty much the problem as Peter originally described it, I
think.


> It would not however be able to notice that the last parameter never
> contains a % and therefore can use an index scan.

If I understand you correctly, then no.  If the algorithm sees highly
variable values for that last parameter, it will never decide to assume
that that parameter will never contain '%'--and I'm not sure how that
could be done safely.

I do see two glimmers of hope, however:

1. If that last parameter is usually some specific value, then you'll
probably end up using specialized plans with that specific value in the
parameter's place.  If that value is a string without wildcards, you can
use your index on description (assuming you have one).  If it's '%' or
null, the optimizer can decide to ignore the "like" clause.  It's only
when the scheme finds that it cannot predict what the parameter's value is
going to be that you get the generic, poorly-performing code.

2. Once we have a predictor, and assuming it works, it could be tied in
with the planner a bit more.  As I believe Tom said, the planner can't
afford to chase down lots of scenarios just in case they ever happen.  But
when a parameter is used only for simple matches or inserts on non-indexed
columns, for example, the planner might find in the course of its normal
activities that there's nothing useful it can do with that parameter and
deliver this information with its plan, so that the predictor can ignore
the parameter.


> I'm also wondering how this interacts with plan stability. Obviously the
> direct effect is to throw out any chance of it. But in the long run they
> may be two complementary sides of the same thing.

Well, it'll cause some plans to be re-generated, surely.  But the
impression I've gotten from the discussion so far is some that plans were
getting too old anyway.


Jeroen




Re: Optimizing prepared statements

From
Martijn van Oosterhout
Date:
On Mon, Sep 04, 2006 at 11:12:13AM +0700, Jeroen T. Vermeulen wrote:
> As I've said before, all this falls down if there is a significant cost to
> keeping one or two extra plans per prepared statement.  You mentioned
> something about "tracking" plans.  I don't know what that means, but it
> sounded like it might impose a runtime cost on keeping plans around.

I think what he meant is tracking plans during the planning process.
Currently at the end of each step you weed out all the plans that arn't
the best for each path-key. To track multiple results at that stage
would be expensive.

However, just running the planner over the same query multiple times
with different estimates shouldn't be too expensive to store.

However, you're discussing the process of replanning based on changes
in variables. At the moment we really need to work on replanning
generally, it isn't done at all currently...

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> From each according to his ability. To each according to his ability to litigate.

Re: Optimizing prepared statements

From
Tom Lane
Date:
"Jeroen T. Vermeulen" <jtv@xs4all.nl> writes:
> On Sun, September 3, 2006 23:52, Tom Lane wrote:
>> What exactly do you mean by "optimize away a parameter"?  The way you
>> described the mechanism, there are no parameters that are "optimized
>> away", you've merely adjusted selectivity predictions using some assumed
>> values.

> I'm using "optimized away" as shorthand for "replaced with a literal
> constant in the statement definition used to generate the plan."

Ah.  I think you're confusing the spectators by using "predict" when you
should say "match".  You're looking for previously generated plans that
have assumed parameter values matching the current query --- saying that
the plan "predicts" a parameter value is just a really poor choice of
wording.
        regards, tom lane


Re: Optimizing prepared statements

From
"Jeroen T. Vermeulen"
Date:
On Mon, September 4, 2006 23:03, Tom Lane wrote:

> Ah.  I think you're confusing the spectators by using "predict" when you
> should say "match".  You're looking for previously generated plans that
> have assumed parameter values matching the current query --- saying that
> the plan "predicts" a parameter value is just a really poor choice of
> wording.

I guess so...  It's been over a decade since I last busied myself with
database optimization, so I don't know any of the jargon.  Had to use
processor architecture jargon instead.

So with that out of the way, can anyone think of some good real-life
examples of prepared statement usage that I can test against?  Suggestions
I've had include TPC, DBT2 (based on TPC-C), and pgbench, but what I'm
really looking for is traces of invocations by real applications.

If we don't have a body of application traces available, can anyone think
of a convenient, nonintrusive way I could extract them out of applications
I'm running?  If there is, I could write a filter to eliminate irrelevant
information.  The filter would ignore everything other than prepared
statement invocations.  It could randomize statement names, strip out
their definitions, hide the ordering between different statements, and
replace parameter values with meaningless numbers; so it would be
relatively safe for others to volunteer traces of their own applications. 
None of those transformations would affect my simulation results.


Jeroen




Re: Optimizing prepared statements

From
Josh Berkus
Date:
Jeroen,

> So with that out of the way, can anyone think of some good real-life
> examples of prepared statement usage that I can test against?  Suggestions
> I've had include TPC, DBT2 (based on TPC-C), and pgbench, but what I'm
> really looking for is traces of invocations by real applications.

I've requested that the Sun SpecJApp benchmark engineer harvest a log for us.

-- 
Josh Berkus
PostgreSQL @ Sun
San Francisco