Thread: Re: [PERFORM] Query much slower when run from postgres function

Re: [PERFORM] Query much slower when run from postgres function

From
Guillaume Cottenceau
Date:
Tom Lane <tgl 'at' sss.pgh.pa.us> writes:

> Mario Splivalo <mario.splivalo@megafon.hr> writes:
>> Now I'm confused, why is 'sql' function much slower than 'direct' SELECT?
>
> Usually the reason for this is that the planner chooses a different plan
> when it has knowledge of the particular value you are searching for than
> when it does not.

Yes, and since Mario is coming from JDBC, I'll share my part on
this: I also noticed some very wrong plans in JDBC because of the
"optimization" in prepared statements consisting of planning once
for all runs, e.g. without any parameter values to help planning.

My understanding is that practically, it's difficult for the
planner to opt for an index (or not) because the selectivity of a
parameter value may be much different when the actual value
changes.

Normally, the planner "thinks" that planning is so costly that
it's better to plan once for all runs, but practically for our
use, this is very wrong (it may be very good for some uses,
though it would be interesting to know the actual uses share).

Until it's possible to specifically tell the JDBC driver (and/or
PG?) to not plan once for all runs (or is there something better
to think of?), or the whole thing would be more clever (off the
top of my head, PG could try to replan with the first actual
values - or first xx actual values - and if the plan is
different, then flag that prepared statement for replanning each
time if the overall time estimate is different enough), I've
opted to tell the JDBC driver to use the protocol version 2, as
prepared statements were not so much prepared back then (IIRC
parameter interpolation is performed in driver and the whole SQL
query is passed each time, parsed, and planned) using
protocolVersion=2 in the JDBC URL. So far it worked very well for
us.

--
Guillaume Cottenceau

Re: [PERFORM] Query much slower when run from postgres function

From
Guillaume Smet
Date:
On Mon, Mar 9, 2009 at 5:51 PM, Guillaume Cottenceau <gc@mnc.ch> wrote:
> Until it's possible to specifically tell the JDBC driver (and/or
> PG?) to not plan once for all runs (or is there something better
> to think of?), or the whole thing would be more clever (off the
> top of my head, PG could try to replan with the first actual
> values - or first xx actual values - and if the plan is
> different, then flag that prepared statement for replanning each
> time if the overall time estimate is different enough), I've
> opted to tell the JDBC driver to use the protocol version 2, as
> prepared statements were not so much prepared back then (IIRC
> parameter interpolation is performed in driver and the whole SQL
> query is passed each time, parsed, and planned) using
> protocolVersion=2 in the JDBC URL. So far it worked very well for
> us.

Unnamed prepared statements are planned after binding the values,
starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
versions were partially broken on this behalf.

It's not always possible to use protocol version 2 as it's quite
limited (especially considering the exceptions returned).

--
Guillaume

Re: [PERFORM] Query much slower when run from postgres function

From
Tom Lane
Date:
Guillaume Smet <guillaume.smet@gmail.com> writes:
> Unnamed prepared statements are planned after binding the values,
> starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
> versions were partially broken on this behalf.

No, 8.2 did it too (otherwise we wouldn't have considered 8.3.0 to be
broken...).  The thing I'm not too clear about is what "use of an
unnamed statement" translates to for a JDBC user.

            regards, tom lane

Re: [PERFORM] Query much slower when run from postgres function

From
Andreas Wenk
Date:
Tom Lane schrieb:
> Guillaume Smet <guillaume.smet@gmail.com> writes:
>> Unnamed prepared statements are planned after binding the values,
>> starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
>> versions were partially broken on this behalf.
>
> No, 8.2 did it too (otherwise we wouldn't have considered 8.3.0 to be
> broken...).  The thing I'm not too clear about is what "use of an
> unnamed statement" translates to for a JDBC user.
>
>             regards, tom lane
>
I followed another post in the PHP List. Andrew McMillan was talking
about his experiences with udf's in Oracle and PG (--> look for subject:
Re: [PHP] pl/php for windows). He was writing that, by using udf's, the
planner sometimes uses strange and not performant plans. So generally I
understood that using udf's is a good idea - compared with the work I
have to do when I code that e.g in PHP and also compared to the better
resulting performance with udf's. So what is your experience with using
udf's (plpgsql)? Is there something like "use it in this case but not in
that case"?

Your answers are very welcome ...

Cheers

Andy



Re: [PERFORM] Query much slower when run from postgres function

From
Mario Splivalo
Date:
Guillaume Cottenceau wrote:
>>> Now I'm confused, why is 'sql' function much slower than 'direct' SELECT?
>> Usually the reason for this is that the planner chooses a different plan
>> when it has knowledge of the particular value you are searching for than
>> when it does not.
>
> Yes, and since Mario is coming from JDBC, I'll share my part on
> this: I also noticed some very wrong plans in JDBC because of the
> "optimization" in prepared statements consisting of planning once
> for all runs, e.g. without any parameter values to help planning.
>

For what is worth:

When I call postgres function via JDBC, I have almost the same execution
time as when calling function from psql.

When I call SELECT COUNT(*)... WHERE... query from JDBC, I again have
almost the same execution time as when executing query from psql.

Postgres function takes around 200ms, and SELECT query takes around 2-4ms.

    Mike

Re: [PERFORM] Query much slower when run from postgres function

From
Dave Cramer
Date:


On Mon, Mar 9, 2009 at 1:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Guillaume Smet <guillaume.smet@gmail.com> writes:
> Unnamed prepared statements are planned after binding the values,
> starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
> versions were partially broken on this behalf.

No, 8.2 did it too (otherwise we wouldn't have considered 8.3.0 to be
broken...).  The thing I'm not too clear about is what "use of an
unnamed statement" translates to for a JDBC user.

Tom,

The driver will use unnamed statements for all statements until it sees the same statement N times where N is 5 I believe, after that it uses a named statement.

Dave

Re: [PERFORM] Query much slower when run from postgres function

From
James Mansion
Date:
>
> The driver will use unnamed statements for all statements until it
> sees the same statement N times where N is 5 I believe, after that it
> uses a named statement.
>
>
Shame there's no syntax for it to pass the a table of the parameters to
the server when it creates the named statement as planner hints.

James



Re: [PERFORM] Query much slower when run from postgres function

From
Scott Carey
Date:

  1. And how do you do that from JDBC?  There is no standard concept of ‘unnamed’ prepared statements in most database APIs, and if there were the behavior would be db specific.  Telling PG to plan after binding should be more flexible than unnamed prepared statements — or at least more transparent to standard APIs.  E.g. SET plan_prepared_postbind=’true’.
  2. How do you use those on a granularity other than global from jdbc?    ( — I tried setting max_prepared_transactions to 0 but this didn’t seem to work either, and it would be global if it did).  Prepared statements are still great for things like selecting off a primary key, or especially inserts.  Controls at the connection or user level would be significantly more valuable than global ones.
  3. Is it possible to test them from psql? (documentation is weak, PREPARE requires a name, functions require names, etc .. C api has docs but that’s not of use for most).

I’d love to know if there were answers to the above that were workable.

In the end, we had to write our own client side code to deal with sql injection safely and avoid jdbc prepared statements to get acceptable performance in many cases (all cases involving partitioned tables, a few others).  At least dollar-quotes are powerful and useful for dealing with this.  Since the most important benefit of prepared statements is code clarity and sql injection protection, its sad to see weakness in control/configuration over prepared statement behavior at the parse/plan level get in the way of using them for those benefits.  



On 3/9/09 9:04 AM, "Guillaume Smet" <guillaume.smet@gmail.com> wrote:

On Mon, Mar 9, 2009 at 5:51 PM, Guillaume Cottenceau <gc@mnc.ch> wrote:
> Until it's possible to specifically tell the JDBC driver (and/or
> PG?) to not plan once for all runs (or is there something better
> to think of?), or the whole thing would be more clever (off the
> top of my head, PG could try to replan with the first actual
> values - or first xx actual values - and if the plan is
> different, then flag that prepared statement for replanning each
> time if the overall time estimate is different enough), I've
> opted to tell the JDBC driver to use the protocol version 2, as
> prepared statements were not so much prepared back then (IIRC
> parameter interpolation is performed in driver and the whole SQL
> query is passed each time, parsed, and planned) using
> protocolVersion=2 in the JDBC URL. So far it worked very well for
> us.

Unnamed prepared statements are planned after binding the values,
starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
versions were partially broken on this behalf.

It's not always possible to use protocol version 2 as it's quite
limited (especially considering the exceptions returned).

--
Guillaume

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: [PERFORM] Query much slower when run from postgres function

From
Oliver Jowett
Date:
Dave Cramer wrote:
>
>
> On Mon, Mar 9, 2009 at 1:16 PM, Tom Lane <tgl@sss.pgh.pa.us
> <mailto:tgl@sss.pgh.pa.us>> wrote:
>
>     Guillaume Smet <guillaume.smet@gmail.com
>     <mailto:guillaume.smet@gmail.com>> writes:
>     > Unnamed prepared statements are planned after binding the values,
>     > starting with 8.3, or more precisely starting with 8.3.2 as early 8.3
>     > versions were partially broken on this behalf.
>
>     No, 8.2 did it too (otherwise we wouldn't have considered 8.3.0 to be
>     broken...).  The thing I'm not too clear about is what "use of an
>     unnamed statement" translates to for a JDBC user.
>
>
> Tom,
>
> The driver will use unnamed statements for all statements until it sees
> the same statement N times where N is 5 I believe, after that it uses a
> named statement.

Right, with the caveat that "the same statement" means "exactly the same
PreparedStatement object". If you happen to run the same (textual) query
via two different PreparedStatement objects, they're still considered
different queries for the purposes of this threshold.

You can also tune the threshold via the prepareThreshold parameter in
the driver URL, or use org.postgresql.PGStatement.setPrepareThreshold
(an extension interface implemented by the driver on its Statement
objects) on a per-statement basis.

prepareThreshold=0 is a special value that means "never use a named
statement".

The idea behind the threshold is that if a PreparedStatement object is
reused, that's a fairly good indication that the application wants to
run the same query many times with different parameters (since it's
going to the trouble of preserving the statement object for reuse). But
it's all tunable if needed.

Also see http://jdbc.postgresql.org/documentation/head/server-prepare.html

-O

Re: [PERFORM] Query much slower when run from postgres function

From
Oliver Jowett
Date:
Scott Carey wrote:
>
>    1. And how do you do that from JDBC?  There is no standard concept of
>       ‘unnamed’ prepared statements in most database APIs, and if there
>       were the behavior would be db specific.  Telling PG to plan after
>       binding should be more flexible than unnamed prepared statements —
>       or at least more transparent to standard APIs.  E.g. SET
>       plan_prepared_postbind=’true’.

I've suggested that as a protocol-level addition in the past, but it
would mean a new protocol version. The named vs. unnamed statement
behaviour was an attempt to crowbar it into the protocol without
requiring a version change. If it's really a planner behaviour thing,
maybe it does belong at the SET level, but I believe that there's
usually an aversion to having to SET anything per query to get
reasonable plans.

>    2. How do you use those on a granularity other than global from jdbc?

prepareThreshold=N (as part of a connection URL),
org.postgresql.PGConnection.setPrepareThreshold() (connection-level
granularity), org.postgresql.PGStatement.setPrepareThreshold()
(statement-level granularity). See the driver docs.

>          ( — I tried setting max_prepared_transactions to 0 but this
>       didn’t seem to work either, and it would be global if it did).

max_prepared_transactions is to do with two-phase commit, not prepared
statements.

> In the end, we had to write our own client side code to deal with sql
> injection safely and avoid jdbc prepared statements to get acceptable
> performance in many cases (all cases involving partitioned tables, a few
> others).  At least dollar-quotes are powerful and useful for dealing
> with this.  Since the most important benefit of prepared statements is
> code clarity and sql injection protection, its sad to see weakness in
> control/configuration over prepared statement behavior at the parse/plan
> level get in the way of using them for those benefits.

It's unfortunate that ended up doing this, because it >is< all
configurable on the JDBC side. Did you ask on pgsql-jdbc?

-O

Re: [PERFORM] Query much slower when run from postgres function

From
Scott Carey
Date:
On 3/9/09 1:40 PM, "Oliver Jowett" <oliver@opencloud.com> wrote:

Scott Carey wrote:
>
>    1. And how do you do that from JDBC?  There is no standard concept of

I've suggested that as a protocol-level addition in the past, but it
would mean a new protocol version. The named vs. unnamed statement
behaviour was an attempt to crowbar it into the protocol without
requiring a version change. If it's really a planner behaviour thing,
maybe it does belong at the SET level, but I believe that there's
usually an aversion to having to SET anything per query to get
reasonable plans.

There’s a strong aversion, but I find myself re-writing queries to get good plans, a de-facto hint really.  Its mandatory in the land of partitioned tables and large aggregates, much more rare elsewhere.  I have a higher aversion to rewriting queries then telling the planner to use more information or to provide it with more information.

>    2. How do you use those on a granularity other than global from jdbc?

prepareThreshold=N (as part of a connection URL),
org.postgresql.PGConnection.setPrepareThreshold() (connection-level
granularity), org.postgresql.PGStatement.setPrepareThreshold()
(statement-level granularity). See the driver docs.

I know I’ve tried the connection URL thing one time and that did not fix the performance problem.  I did not know if it was user error.  Without knowing how to trace what the query really was or if the setting was working properly, or having any other easy avenue to see if an unnamed prepared statement even fixed my problem, I had to resort to what would clearly fix it (there was only 1 day to fix it, and there was one proven way to fix it).  I would love to be able to try out an unnamed prepared statement in psql, to prove that it even works to solve the query planning issue or not.  In the end, it was simpler to change the code and probably less time consuming than all the options other than the connection URL setting.


>          ( — I tried setting max_prepared_transactions to 0 but this
>       didn’t seem to work either, and it would be global if it did).

max_prepared_transactions is to do with two-phase commit, not prepared
statements.

Thanks! Good to know, the configuration documentation could be more clear... I got the two prepares confused.

> In the end, we had to write our own client side code to deal with sql
> injection safely and avoid jdbc prepared statements to get acceptable

It's unfortunate that ended up doing this, because it >is< all
configurable on the JDBC side. Did you ask on pgsql-jdbc?

-O

I searched the archives, and did find a reference to the connection URL setting and recall trying that but not seeing the expected result.  Rather than debugging, a decision was made to go with the solution that worked and be done with it.  This was also when we were in production on 8.3.1 or  8.3.2 or so, so the bugs there might have caused some confusion in the rush to solve the issue.

I’m still not sure that unnamed prepared statements will help my case.  If the driver is using unnamed prepared statements for the first 5 uses of a query then naming it, I should see the first 5 uses significantly faster than those after.  I’ll keep an eye out for that in the places where we are still using prepared statements that can cause problems and in the old log files.  Until another issue comes up, there isn’t sufficient motivation to fix what is no longer broken for us.

Thanks for the good info on dealing with configuring unnamed prepared statements with the jdbc driver.  That may come in very handy later.

Re: [PERFORM] Query much slower when run from postgres function

From
Guillaume Cottenceau
Date:
Oliver Jowett <oliver 'at' opencloud.com> writes:

> The idea behind the threshold is that if a PreparedStatement object is
> reused, that's a fairly good indication that the application wants to
> run the same query many times with different parameters (since it's
> going to the trouble of preserving the statement object for reuse). But

Or it may just need the safeness of driver/database parameter
"interpolation", to get a "free" efficient safeguard against SQL
injection. As for myself, I have found no other way to obtain
driver/database parameter interpolation. So sometimes I use
prepared statements even for running a query only once. I am
unsure it is a widely used pattern, but SQL injection being quite
important to fight against, I think I may not be the only one.

--
Guillaume Cottenceau

Re: [PERFORM] Query much slower when run from postgres function

From
Oliver Jowett
Date:
Guillaume Cottenceau wrote:
> Oliver Jowett <oliver 'at' opencloud.com> writes:
>
>> The idea behind the threshold is that if a PreparedStatement object is
>> reused, that's a fairly good indication that the application wants to
>> run the same query many times with different parameters (since it's
>> going to the trouble of preserving the statement object for reuse). But
>
> Or it may just need the safeness of driver/database parameter
> "interpolation", to get a "free" efficient safeguard against SQL
> injection.

In which case, the application usually throws the PreparedStatement
object away after executing it once, and the threshold is never reached.
As I said, the application has to do extra work to preserve exactly the
same PreparedStatement object for reuse before the threshold applies, at
which point it's reasonable to assume that it could be a
performance-sensitive query that would benefit from preserving the query
plan and avoiding parse/plan costs on every execution.

It's just a heuristic because there *is* a tradeoff and many/most
applications are not going to be customized specifically to know about
that tradeoff. And it's configurable because the tradeoff is not the
same in every case.

Do you have a suggestion for a better way to decide when to use a named
statement?

-O

Re: [PERFORM] Query much slower when run from postgres function

From
Guillaume Cottenceau
Date:
Oliver Jowett <oliver 'at' opencloud.com> writes:

> Guillaume Cottenceau wrote:
>> Oliver Jowett <oliver 'at' opencloud.com> writes:
>>
>>> The idea behind the threshold is that if a PreparedStatement object is
>>> reused, that's a fairly good indication that the application wants to
>>> run the same query many times with different parameters (since it's
>>> going to the trouble of preserving the statement object for reuse). But
>>
>> Or it may just need the safeness of driver/database parameter
>> "interpolation", to get a "free" efficient safeguard against SQL
>> injection.
>
> In which case, the application usually throws the PreparedStatement
> object away after executing it once, and the threshold is never reached.
> As I said, the application has to do extra work to preserve exactly the
> same PreparedStatement object for reuse before the threshold applies, at
> which point it's reasonable to assume that it could be a
> performance-sensitive query that would benefit from preserving the query
> plan and avoiding parse/plan costs on every execution.

Thanks for the clarification!

That may just be me, but I see two issues here: first, parsing
and planning are tied together, but parsing should be always done
first time only as I see no point in reparsing in subsequent uses
of the PreparedStatement?; second, it's still questionable that a
"performance-sensitive" query should mean benefiting from
preserving the query plan: I have seen dramatic use cases where
the preserved query plan opted for a seqscan and then the query
was orders of magnitude slower than it should because the actual
then used values would have qualified for an indexscan.

> It's just a heuristic because there *is* a tradeoff and many/most
> applications are not going to be customized specifically to know about
> that tradeoff. And it's configurable because the tradeoff is not the
> same in every case.

Yes, and it's well documented, actually. I obviously didn't read
it carefully enough last time :/ I guess my approach of using the
protocol version 2 should be replaced by unsetting the prepared
threshold.. I think I came up with that workaround after that
post from Kris:

http://archives.postgresql.org/pgsql-jdbc/2008-03/msg00070.php

because strangely, you and I intervened in that thread, but the
prepared threshold issue was not raised, so I followed the
protocolVersion=2 path. Did I miss something - e.g. is the topic
today different from the topic back then, for some reason? Am I
wrong in assuming that your "please replan this statement every
time you get new parameters" suggestion is nearly-achievable with
unsetting the prepared threshold ("nearly" being the difference
between replanning always, and replanning only when parameters
are new)?

Anyway, documentation-wise, I've tried to think of how the
documentation could be a little more aggressive with the warning:

http://zarb.org/~gc/t/jdbc-more-cautious-preparedstatements.diff

That said, there's something more: when the documentation says:

     There are a number of ways to enable server side prepared
     statements depending on your application's needs. The general
     method is to set a threshold for a PreparedStatement.

I assume that by default server side prepared statements are
*not* enabled, although it seems to be the case, with a threshold
of 5 as a simple test shows when using driver 8.3-604.jdbc3 (on
PG 8.3.6).

I think that either they should not be enabled by default
(really, it could be better with, but it could be so much worse
that is it really a good idea to make a "dropin" use of the
driver use it?), or the documentation should clearly state they
are, and add even more warnings about potential drawbacks. WDYT?

http://zarb.org/~gc/t/jdbc-more-cautious-preparedstatements2.diff

Btw, how can the doc be built? "ant doc" failed on missing
docbook.stylesheet but I was unable to find how to set that
value.

> Do you have a suggestion for a better way to decide when to use a named
> statement?

Oh, I feel I don't have the qualifications to answer that
question, sorry! The only thing I could think of, was what I
talked about in a previous mail, e.g. save all plans of the first
xx queries before reaching the threshold, and then when the
threshold is reached, compare the global cost estimates of the
saved plans, and do not activate server side prepare if they are
too different, as caching the plan for that query would probably
yield too slow results sometimes. Ideally, I guess a new
PG-specific method should be added to activate that feature (and
set the value for "are the plans too different?"). But bear in
mind that it may be a stupid idea :)

--
Guillaume Cottenceau

Re: [PERFORM] Query much slower when run from postgres function

From
Kris Jurka
Date:

On Tue, 10 Mar 2009, Guillaume Cottenceau wrote:

> Btw, how can the doc be built? "ant doc" failed on missing
> docbook.stylesheet but I was unable to find how to set that
> value.
>

Create a file named build.local.properties and put something like the
following in it:

docbook.stylesheet=/usr/share/sgml/docbook/stylesheet/xsl/nwalsh/xhtml/chunk.xsl
docbook.dtd=/usr/share/sgml/docbook/dtd/xml/4.2

Kris Jurka