Thread: Cached Query Plans (was: global prepared statements)
Well, I realized the idea of global prepared statements actually sucked, so I set on another approach thanks to ideas from this list, this is caching query plans. First, let's see if there is low hanging fruit with the typical small, often-executed queries that are so frequent on websites. Tables test, test2 and test3 contain id (integer primary key) and another integer field. There are 100K rows in each. First, the simplest query : SELECT * FROM test WHERE id = $1 110 us : Send query as text (PHP:pg_query -> PQexec) 125 us : Parse+Bind (PHP:pg_query_params -> PQexecParams) 67 us : Execute a previously prepared statement (PHP:pg_execute-> PQexecPrepared) A slightly more complex one but still pretty classic : SELECT * FROM (SELECT * FROM test WHERE id>$1 ORDER BY id LIMIT 5) AS a NATURAL LEFT JOIN test2 NATURAL LEFT JOIN test3 ORDER BY id 523 us : Send query as text (PHP:pg_query -> PQexec) 580 us : Parse+Bind (PHP:pg_query_params -> PQexecParams) 148 us : Execute a previously prepared statement (PHP:pg_execute -> PQexecPrepared) OK, so there is low hanging fruit since the parsing+planning time of those is longer than doing the query itself. Since the Parse message includes a $-parameterized query that is to be prepared, it seems logical to put the caching logic there : the query string (without parameters) makes a nice cache key. So I made a really quick and really dirty experimentation without changing the wire protocol between client and server. This is only "proof of concept". Try #1 : in exec_parse_message(), if the statement is named, look it up in the prepared statements cache, if it is found, return at once and do nothing else. To exploit this, I issue a pg_prepare() followed by pg_execute() at every query, wether or not the statement exists. If it already exists, pg_prepare() now does nothing (except losing a little time). Results : 88 us : simple query 173 us : complex query So, the timings are between a simple execute and a plan+execute. It provides a nice performance gain versus replanning every time, but not perfect. Try #2 : again, in exec_parse_message(), if the statement is unnamed, I use the query string as the statement name, search the plan in the prepared statements hash table. If it is not found, then it is prepared. Then I make the unnamed statement plan point to this. Of course, this is dangerous since it probably introduces a dozen crash bugs, but for this proof of concept, it's OK. Client code is unchanged, PQexecParams will benefit from the plan caching, since it always sends a Parse+Bind message using the unnamed statement. Results are identical to executing an execute on a prepared statement, modulo a few microseconds. This means the overhead of sending the Parse message, and of the server ignoring it when the statement is cached, is negligible. So, where to go from that ? I don't see a way to implement this without a (backwards-compatible) change to the wire protocol, because the clients will want to specify when a plan should be cached or not. Since the user should not have to name each and every one of the statements they want to use plan caching, I see the following choices : - Add a new Parse+Bind command, which gets the $-parameterized SQL and the parameters. If the plan is cached, grab it and execute, else prepare and execute. Add a flag to allow the client to specify if he wants caching or not.Pros : Only one message, fasterCons : SQL is transmitted in full, useless most of the time, but this overhead is rather small. - Send the SQL with Bind as statement name, add a flag to Bind telling it to report a cache miss instead of raising an error, then have the client send a Parse and Bind again. - Should there be one common hashtable for named prepared statements and cached plans, or two hashtables ? Using the SQL string as the statement name is not clean.
PFC wrote: > So, where to go from that ? I don't see a way to implement this without > a (backwards-compatible) change to the wire protocol, because the clients > will want to specify when a plan should be cached or not. Since the user > should not have to name each and every one of the statements they want to > use plan caching, I see the following choices : I don't understand the point here. We already have cached plans: you send a Parse. You can then Bind/Execute many times. Maybe what we need is support for this in libpq, so that PHP can use it? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > PFC wrote: >> So, where to go from that ? I don't see a way to implement this without >> a (backwards-compatible) change to the wire protocol, because the clients >> will want to specify when a plan should be cached or not. Since the user >> should not have to name each and every one of the statements they want to >> use plan caching, I see the following choices : > I don't understand the point here. We already have cached plans: you > send a Parse. You can then Bind/Execute many times. > Maybe what we need is support for this in libpq, so that PHP can use it? We already have that, too ... regards, tom lane
"Alvaro Herrera" <alvherre@commandprompt.com> writes: > PFC wrote: > >> So, where to go from that ? I don't see a way to implement this without >> a (backwards-compatible) change to the wire protocol, because the clients >> will want to specify when a plan should be cached or not. Since the user >> should not have to name each and every one of the statements they want to >> use plan caching, I see the following choices : > > I don't understand the point here. We already have cached plans: you > send a Parse. You can then Bind/Execute many times. I think what he's referring to is persistently caching plans so that new connections can use them. That makes a lot more sense if you have lots of short-lived connections like a stock php server without persistent connections turned on or a connection pooler. You can prepare queries but they only live for a single web page so you don't get any benefit. Personally I would like to see this, not primarily for the performance gains, but for the possibility of managing when plans change -- ie, plan stability. But there is resistance from other quarters about the reliability hit of having the plan data structures in shared memory. A bug in one backend could cause other backends to crash or corrupt their memory. The possibility exists with other shared data structures but, arguably, plans are much more complex data structures than PGPROC entries and buffers. I still don't see why you would need a wire protocol change. You would just have clients prepare plans normally and stash them in shared memory for other backends in a hash table keyed by, well, something, perhaps the original query text. Then whenever you're asked to prepare a query you go check if someone else has already done it for you and find an already generated plan in the shared memory hash table. The contention on the shared cache is likely to negate much of the planning savings but I think it would still be a win. But what's really interesting to me is then providing an interface to see and manipulate that cache. Then you could see what plans other backends are using for queries, mark plans as being acceptable or not, and even revoke users' permissions to execute queries which aren't already present and marked as being acceptable. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
In article <20080411170609.GB4392@alvh.no-ip.org>,> PFC wrote: > > > So, where to go from that ? I don't see a way to implement this without > > a (backwards-compatible) change to the wire protocol, because the clients > > will want to specify when a plan should be cached or not. Since the user > > should not have to name each and every one of the statements they want to > > use plan caching, I see the following choices : > Doesn't Oracle do this now transparently to clients? That, I believe Oracle keeps a statement/plan cache in its shared memory segment (SGA) that greatly improves its performance at running queries that don't change very often. From that point of view, Oracle at least sees benefits in doing this. From my POV a transparent performance enhancer for all those PHP and Rails apps out there. With plan invalidation in 8.3 this becomes feasible for pgSQL to do as well. -arturo
On Sat, Apr 12, 2008 at 2:44 PM, Perez <arturo@ethicist.net> wrote: > In article <20080411170609.GB4392@alvh.no-ip.org>, > > > PFC wrote: > > > > > So, where to go from that ? I don't see a way to implement this without > > > a (backwards-compatible) change to the wire protocol, because the clients > > > will want to specify when a plan should be cached or not. Since the user > > > should not have to name each and every one of the statements they want to > > > use plan caching, I see the following choices : > > > Doesn't Oracle do this now transparently to clients? That, I believe > Oracle keeps a statement/plan cache in its shared memory segment (SGA) > that greatly improves its performance at running queries that don't > change very often. > > From that point of view, Oracle at least sees benefits in doing this. > From my POV a transparent performance enhancer for all those PHP and > Rails apps out there. There are other benefits as well. Oracle lets you see the statistics associated with given plans. So you can see how many times given (cached) query was executed, how much resources did it consume and do on. Right now the only way of getting such information from PostgreSQL is by logging all queries and analyzing logs. The current_query column of pg_stat_activity is useless as the (prepared) queries are usually so short lived that you will see one execution out of thousands happening. Nooow, suppose we do have cached plans. Then we can have a view pg_stat_queries + a stats collector which will track number of executions, number of blocks hit, blocks read, etc. Would be great! :) Regards, Dawid
On Fri, Apr 11, 2008 at 12:34 PM, PFC <lists@peufeu.com> wrote: > Well, I realized the idea of global prepared statements actually > sucked, so I set on another approach thanks to ideas from this list, this is > caching query plans. Well, that's a blatantly bad realization. Perhaps you should do more research. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
On Sat, Apr 12, 2008 at 8:44 AM, Perez <arturo@ethicist.net> wrote: > Doesn't Oracle do this now transparently to clients? Of course it does, and it has since the late 80's I believe. > Oracle keeps a statement/plan cache in its shared memory segment (SGA) > that greatly improves its performance at running queries that don't > change very often. Yep. > From that point of view, Oracle at least sees benefits in doing this. Yes, it is also a bit more advanced than we're discussing here, so I'll just leave it as. > From my POV a transparent performance enhancer for all those PHP and > Rails apps out there. Yes. > > With plan invalidation in 8.3 this becomes feasible for pgSQL to do as > well. > > -arturo > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
On Sat, Apr 12, 2008 at 2:19 PM, Dawid Kuroczko <qnex42@gmail.com> wrote: > There are other benefits as well. Oracle lets you see the statistics associated > with given plans. So you can see how many times given (cached) query was > executed, how much resources did it consume and do on. Yes, and it also uses that data at both the statement and column level to determine what needs more analysis to help build better plans in the future. > Right now the only way of getting such information from PostgreSQL is by > logging all queries and analyzing logs. The current_query column of > pg_stat_activity is useless as the (prepared) queries are usually so short > lived that you will see one execution out of thousands happening. Yes, this is worthless on large active databases. The logging overhead alone starts to affect performance. > Nooow, suppose we do have cached plans. Then we can have a view > pg_stat_queries + a stats collector which will track number of executions, > number of blocks hit, blocks read, etc. Would be great! :) With the exception that the stats collector itself needs a bit of work to handle larger volumes of statistics, I agree. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
"Jonah H. Harris" <jonah.harris@gmail.com> writes: > On Sat, Apr 12, 2008 at 2:19 PM, Dawid Kuroczko <qnex42@gmail.com> wrote: >> There are other benefits as well. Oracle lets you see the statistics associated >> with given plans. So you can see how many times given (cached) query was >> executed, how much resources did it consume and do on. > Yes, and it also uses that data at both the statement and column level > to determine what needs more analysis to help build better plans in > the future. >> Right now the only way of getting such information from PostgreSQL is by >> logging all queries and analyzing logs. The current_query column of >> pg_stat_activity is useless as the (prepared) queries are usually so short >> lived that you will see one execution out of thousands happening. > Yes, this is worthless on large active databases. The logging > overhead alone starts to affect performance. But somehow, all that stuff with cached plans is free? regards, tom lane
On Sat, Apr 12, 2008 at 10:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Yes, this is worthless on large active databases. The logging > > overhead alone starts to affect performance. > > But somehow, all that stuff with cached plans is free? Of course not. The first time you execute a query, it is cached... so you pay the same penalty you do in PG, but in many cases, only once. In regards to plan re-use, sure there's going to be some contention on the hash buckets... but that can be mitigated in a lot of ways. In addition to that, Oracle collects over two thousand other statistics in real-time... yet somehow Oracle is quite fast. So, I would say that the usual complaint about collecting stats should be more an issue of proper implementation than a complaint about the act of collection itself. -- Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 EnterpriseDB Corporation | fax: 732.331.1301 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com Edison, NJ 08837 | http://www.enterprisedb.com/
Another issue with plan caches, besides contention, in Oracle at least, is shared memory fragmentation (as plans aren't all the same size in memory ...) But this cache is very helpful for developments where every query is done via prepare/execute/deallocate. I've seen it a lot on java apps, the purpose being to benefit from the advantages of prepared statements, but without having to deal with storing those prepared statements somewhere. And of course, as said before, the statistics associated with those plans can be very helpful, mostly for all those very small queries that are run very frequently (a badly developped app most of the time, but that happens). Le Sunday 13 April 2008 06:21:41 Jonah H. Harris, vous avez écrit : > On Sat, Apr 12, 2008 at 10:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > Yes, this is worthless on large active databases. The logging > > > overhead alone starts to affect performance. > > > > But somehow, all that stuff with cached plans is free? > > Of course not. The first time you execute a query, it is cached... so > you pay the same penalty you do in PG, but in many cases, only once. > In regards to plan re-use, sure there's going to be some contention on > the hash buckets... but that can be mitigated in a lot of ways. > > In addition to that, Oracle collects over two thousand other > statistics in real-time... yet somehow Oracle is quite fast. So, I > would say that the usual complaint about collecting stats should be > more an issue of proper implementation than a complaint about the act > of collection itself. > > -- > Jonah H. Harris, Sr. Software Architect | phone: 732.331.1324 > EnterpriseDB Corporation | fax: 732.331.1301 > 499 Thornall Street, 2nd Floor | jonah.harris@enterprisedb.com > Edison, NJ 08837 | http://www.enterprisedb.com/
Would it be possible to store plans with an indication of the search path that was used to find tables, and for temp tables some snapshot of the statistics for the table if any? My suspicions are that: * where you have a lot of short-lived connections then actually they will often use the default search path - or a similarone * if a temp table is in use then normally these will be small or contain 'similar' data There is a danger that these heuristics will be poor if long-running connections are in play - but they have no excuse not to do their own preparation. Perhaps you could have named cache segments and connections could 'opt in' to a cache segment if they want such sharing? James
> On Fri, Apr 11, 2008 at 12:34 PM, PFC <lists@peufeu.com> wrote: >> Well, I realized the idea of global prepared statements actually >> sucked, so I set on another approach thanks to ideas from this list, >> this is >> caching query plans. > > Well, that's a blatantly bad realization. Perhaps you should do more > research. No, what I meant is that the "global prepared statements" as I tried to implement them before weren't that good...I think simple caching based on the query text itself is preferable to having to name each of your queries, extract them from your programs and replace them by executes, issue a "create statement" command for each of them, etc. Few people would actually use that feature because it would mean lots of modifications to the application, so all the applications that have to be compatible with other databases would not use the feature (*)It could be useful for permissions and fine access control, though, but views and stored procs already provide that functionality... (*) = Note that caching the plans based on the query text (with $ params) from a parse message will not provide cachingfor oldskool queries with params inside in the form of escaped strings. This is good, because it means the safer solution (using $-quoted params) will also be the faster solution. And in the application, only a very small part of the code needs to be changed, that's the DB abstraction layer. >> Doesn't Oracle do this now transparently to clients? > > Of course it does, and it has since the late 80's I believe. > >> Oracle keeps a statement/plan cache in its shared memory segment (SGA) >> that greatly improves its performance at running queries that don't >> change very often. Can we have more details on how Oracle does it ? For "inspiration"... Here is what I'm thinking about :Don't flame me too much about implementation issues, this is just throwing ideas in the air to see where they'll fall ;) * global plan cache in shared memory, implemented as hashtable, hash key being the (search_path, query_string) Doubt : Can a plan be stored in shared memory ? Will it have to be copied to local memory before being executed ? This stores : - the plans (not for all keys, see below) - the stats :- number of times this query has been executed,- total, min and max wallclock time and CPU time spent planningthis query,- total, min and max wallclock time, CPU time and RAM spent executing this query,- total, min and max number of rows returned,- last timestamp of execution of this query, There should be separate GUCs to control this :- should the whole thing be activated ?- should the cache be active ? or justthe stats ? and what stats ? There should be also a way to query this to display the statistics (ie "what query is killing my server ?"), and a way to purge old plans. * every time a Parse message comes up : - look if the (search_path, query_string) is in the cache - if it is in the cache :- if there is a cached plan, make the unnamed statement point to it, and we're done.- if there is no cached plan, prepare the query, and put it in the unnamed statement. Now, the query has been parsed, so we can decide if it is cacheable. Should this be done in Parse, in Bind, or somewhere else ? I have no idea. For instance, queries which contain VALUES() or IN( list of consts ) should not be cached, since the IN() is likely to change all the time, it would just trash the cache. Using =ANY( $1 ) instead will work with cached plans. Also, will a plan to be cached have to be prepared with or without the parameters ? That's also an interesting question... Perhaps the user should also be able to specify wether to cache a plan or not, or wether to use the params or not, with hint flags in the query string ? (like mysql, /* flags */ SELECT blah ) Now, if the query is cacheable, store it in the cache, and update the stats. If we decided to store the plan, do that too. For instance we might decide to store the plan only if this query has been executed a certain number of times, etc. * In the Execute message, if a cached plan was used, execute it and update the stats (time spent, etc). Now, about contention, since this is one shared hashtable for everyone, it will be fought for...However, the lock on it is likely to be held during a very small time (much less than a microsecond), so would it be that bad ?Also, GUC can be used to mitigate the contention, for instance ifthe user is not interested in the stats, the thing becomes mostly read-only
On Sun, Apr 13, 2008 at 02:26:04PM +0200, PFC wrote: > * global plan cache in shared memory, implemented as hashtable, hash key > being the (search_path, query_string) > Doubt : Can a plan be stored in shared memory ? Will it have to be copied > to local memory before being executed ? Frankly, I think you're better off storing them in a table. Shared memory is a limited resource and you cannot change how much you've allocated after the server has started. It does mean you'll have to serialise/deserialise them, but this will be cheaper than replanning, right? I liked your global prepared statements idea much better. Named the statements is no problem: DB frontends do that for you anyway sometimes. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
On Sun, Apr 13, 2008 at 2:26 PM, PFC <lists@peufeu.com> wrote: > > > Oracle keeps a statement/plan cache in its shared memory segment (SGA) > > > that greatly improves its performance at running queries that don't > > > change very often. > Can we have more details on how Oracle does it ? For > "inspiration"... Why limit ourselves with Oracle? How all major proprietary RDBMSs do it. Here is a nice presentation I've found on DB2, they call it "Dynamic Statement Cache": http://www.tbrug.com/TB%20UG%20Dynamic%20Statement%20Cache.ppt > Here is what I'm thinking about : > Don't flame me too much about implementation issues, this is just > throwing ideas in the air to see where they'll fall ;) > > * global plan cache in shared memory, implemented as hashtable, hash key > being the (search_path, query_string) > Doubt : Can a plan be stored in shared memory ? Will it have to be copied > to local memory before being executed ? Well, Oracle uses terms "hard parse" and "soft parse", the former being preparing the whole query, the latter reusing query plan prepared by some other session. More or less. See this link for more detailed description: http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:2588723819082 (this is quite interesting read) > This stores : > - the plans (not for all keys, see below) > - the stats : [...] I am not too sure that plans and statistical counters should be stored together... Probably plans should go in one place, and statistics should go to the stats collector (I know he's not quite ready for this ;)). > There should be also a way to query this to display the statistics (ie > "what query is killing my server ?"), and a way to purge old plans. Hm, a limit on how much memory can be used for plans (query_plan_cache_size GUC?), and a LRU/LFU expiration of old plans? > * every time a Parse message comes up : > - look if the (search_path, query_string) is in the cache > - if it is in the cache : > - if there is a cached plan, make the unnamed statement point to it, > and we're done. > - if there is no cached plan, prepare the query, and put it in the > unnamed statement. > > Now, the query has been parsed, so we can decide if it is cacheable. Should > this be done in Parse, in Bind, or somewhere else ? I have no idea. > > For instance, queries which contain VALUES() or IN( list of consts ) should > not be cached, since the IN() is likely to change all the time, it would > just trash the cache. Using =ANY( $1 ) instead will work with cached plans. Perhaps a GUC for controlling query cache should heve three values:none -- don't cache any statementsmart -- use heuristicsfor deciding whether to cache itall -- force caching all queries -- for uncommon/statistical/testing purposes. >> Also, will a plan to be cached have to be prepared with or without the > parameters ? That's also an interesting question... > Perhaps the user should also be able to specify wether to cache a plan or > not, or wether to use the params or not, with hint flags in the query string > ? > (like mysql, /* flags */ SELECT blah ) I don't like the hint flags. They tend to haunt later on (when the database gets smarter, but application forces it to be dumb). I would say a GUC. GUC gives freedom of change to the application, and can also be set per user with ALTER USER. > Now, if the query is cacheable, store it in the cache, and update > the stats. If we decided to store the plan, do that too. For instance we > might decide to store the plan only if this query has been executed a > certain number of times, etc. Interesting idea. I think I like it. > * In the Execute message, if a cached plan was used, execute it and update > the stats (time spent, etc). > > Now, about contention, since this is one shared hashtable for > everyone, it will be fought for... > However, the lock on it is likely to be held during a very small > time (much less than a microsecond), so would it be that bad ? > Also, GUC can be used to mitigate the contention, for instance if > the user is not interested in the stats, the thing becomes mostly read-only I would say: keep the stats separate. For evey plan cached generate some unique id (Perhaps OID? I am not convinced), and use this ID as the key for the statistics. I tend to think of it as a "temporary table, and temporary table stats". :) Regards, Dawid
> Why limit ourselves with Oracle? How all major proprietary RDBMSs do it. Thanks for the links. Very interesting. The DB2 document especially mentions an important point : in order to make their planner/optimizer smarter, they had to make it slower, hence it became crucial to cache the plans. Contrast this with MySQL where using prepared statements gains nothing : the "optimizer" does so little work that it actually doesn't matter. So, basically, Orcale : - Parses the query every time (identifies tables, permissions etc) (soft parse) - From that parsed query it looks up a cached plan (the lookup key could then be different depending on the schema etc) - If not, it must plan the query (hard parse). Also the Oracle doc mentions that the soft parsing should be avoided by using prepared statements in the application (ie Parse once and Bind lots of times) So, Oracle will redo the parsing + permissions check each time, unless prepared statements are used, in which case it's direct execution. And DB2 : Er, the document is not very clear about what it actually does, but the stats look nice ;) > I liked your global prepared statements idea much better. Named the > statements is no problem: DB frontends do that for you anyway > sometimes. Hm. The "global statements" and the cache would complement each other actually. Why not. When the user wants to name the statements, he can do so (and perhaps control who can execute what, etc, like with stored procs) Permission checking overhead will be there at each execution. Should the plan be cached locally ? (RAM consumption times N bakends...) Cached per user once permissions have been checked ? (avoids the overhead of rechecking permissions) What about the search path ? (I'd force the global statements to use the default search path no matter what, being explicit is better than "why does it stop working ?") Can the application or the database library name the statements ? I'm not so sure. This could work for compiled languages (what about when you run several applications ? or several versions of the same application ? do we need a uniqueness of statement names from all developers all over the world ?) Solution : make each application use a different user name, and global prepared statements only visible to the user that created them, perhaps. This conflicts with some desirable features, though. It needs more thinking. What about non-compiled languages ? It will not be possible to generate a list of statements beforehands... And queries are also constructed dynamically by frameworks such as Rails, which makes naming them impossible, but caching the plans would work well. So, some situations would benefit from a plan cache, > Frankly, I think you're better off storing them in a table. Shared > memory is a limited resource and you cannot change how much you've I'd say that unless you have a perverse application that will try all the permutations of column names just to make sure the query is different every time, how many different queries would you want to cache ?... probably less than 1000... so it wouldn't take more than a couple megabytes... > allocated after the server has started. It does mean you'll have to > serialise/deserialise them, but this will be cheaper than replanning, > right? What would be the overhead of a catalog lookup to get a cached plan for a statement that returns 1 row ? Would the catalog cache make it fast enough ?And what about deserialization ?... > I am not too sure that plans and statistical counters should be stored > together... Not sure either. > Probably plans should go in one place, and statistics should go to the > stats collector (I know he's not quite ready for this ;)). That's the problem... > Hm, a limit on how much memory can be used for plans > (query_plan_cache_size GUC?), and a LRU/LFU expiration > of old plans? Now it gets hairy ;)Yes memory size should be limited. But how to make a LRU cleaner which doesn't create lots of contention ?... Luckily, with a hash having a fixed number of buckets, it is easier (clean a bucket every N seconds for instance). > Perhaps a GUC for controlling query cache should heve three values: > none -- don't cache any statement > smart -- use heuristics for deciding whether to cache it > all -- force caching all queries -- for uncommon/statistical/testing > purposes. I would not volunteer to write that heuristic ;)Although there would be a very simple solution : if time to parse > some percentage of time to execute then cache.The hairiness is in the plan dependence (or independence) on parameter values, ideally we only want to cache plans that would be good for all parameter values, only the user knows that precisely. Although it could be possible to examine the column histograms... >> (like mysql, /* flags */ SELECT blah ) > > I don't like the hint flags. They tend to haunt later on (when the > database gets smarter, but application forces it to be dumb). I would > say a GUC. I don't like them either... needs a better solution like a flag in PQexecParams, but this would cause lots of trouble, so it's not really possible...
> The hairiness is in the plan dependence (or independence) on parameter > values, ideally we only want to cache plans that would be good for all > parameter values, only the user knows that precisely. Although it could be > possible to examine the column histograms... If cached plans would be implemented, the dependence on parameter values could be solved too: use special "fork" nodes in the plan which execute different sub-plans depending on special parameter values/ranges, possibly looking up the stats at runtime, so that the plan is in a compiled state with the "decision points" wired in. This of course would mean a lot heavier planning and possibly a lot bigger plans, but you could afford that if you cache the plan. You could even have a special command to plan a query this way. Cheers, Csaba.
> If cached plans would be implemented, the dependence on parameter values > could be solved too: use special "fork" nodes in the plan which execute > different sub-plans depending on special parameter values/ranges, > possibly looking up the stats at runtime, so that the plan is in a > compiled state with the "decision points" wired in. > > This of course would mean a lot heavier planning and possibly a lot > bigger plans, but you could afford that if you cache the plan. You could > even have a special command to plan a query this way. And, the "fork node" could mutter to itself "Strange, I'm getting 10000 rows instead of the 2 for which I was planned, perhaps I should switch to a different plan..." I have made another very simple hack to test for another option : Bind message behaviour was modified : - If the user asks for execution of a named prepared statement, and the named statement does not exist in PG's prepared statements cache, instead of issuing an error and borking the transaction, it Binds to an empty statement, that takes no parameters, and returns no result. Parameters sent by the user are consumed but not used. The application was modified thusly : - Calls to pg_query_params were changed to calls to the following function : function pg_query_cached( $sql, $params ) { // Try to execute it, using the query string as statement name. $q = pg_execute( $sql, $params ); if( !$q ) die(pg_last_error() ); // If it worked, return result to caller. if( pg_result_status( $q, PGSQL_STATUS_STRING ) != "" ) return $q; // If we got an empty query result (not a result with 0 rows which is valid) then prepare the query $q = pg_prepare( $sql, $sql ); if( !$q ) die( pg_last_error() ); // and execute it again $q = pg_execute( $sql, $params ); if( !$q ) die( pg_last_error() ); return $q; } Pros :- It works- It is very very simple- The user can choose between caching plans or not by calling pg_query_params() (no cached plans) or pg_query_cached() (cached plans)- It works with persistent connections Cons :- It is too simple- Plans are cached locally, so memory use is proportional to number of connections- It is still vulnerable to search_path problems
PFC wrote: > Bind message behaviour was modified : > - If the user asks for execution of a named prepared statement, and the > named statement does not exist in PG's prepared statements cache, instead > of issuing an error and borking the transaction, it Binds to an empty > statement, that takes no parameters, and returns no result. Parameters > sent by the user are consumed but not used. You mentioned the need for a wire protocol change to allow this. Why can't this be controlled with a server variable, like SET auto_prepare = 'true'? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + If your life is a hard drive, Christ can be your backup. +
Csaba Nagy wrote: > If cached plans would be implemented, the dependence on parameter values > could be solved too: use special "fork" nodes in the plan which execute > different sub-plans depending on special parameter values/ranges, > possibly looking up the stats at runtime, so that the plan is in a > compiled state with the "decision points" wired in. That's an idea I've been thinking about for a long time, but never got around implementing. I see that as a completely orthogonal feature to the server-side shared plan cache, though. There's plenty of scenarios, like with client-side prepared statement cache, where it would be useful. Figuring out the optimal "decision points" is hard, and potentially very expensive. There is one pretty simple scenario though: enabling the use of partial indexes, preparing one plan where a partial index can be used, and another one where it can't. Another such case is "col LIKE ?" queries, where ? is actually a prefix query, "foo%". As an optimization, we could decide the decision points on the prepare message, and delay actually planning the queries until they're needed. That way we wouldn't waste time planning queries for combinations of parameters that are never used. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Mon, 2008-04-14 at 16:54 +0300, Heikki Linnakangas wrote: > Figuring out the optimal "decision points" is hard, and potentially very > expensive. There is one pretty simple scenario though: enabling the use > of partial indexes, preparing one plan where a partial index can be > used, and another one where it can't. Another such case is "col LIKE ?" > queries, where ? is actually a prefix query, "foo%". Another point is when the cardinality distribution of some key's values is very skewed, with some values very frequent and the majority of values being unique. There you could check the stats at execution time just for deciding to go for the low cardinality plan or the high one... > As an optimization, we could decide the decision points on the prepare > message, and delay actually planning the queries until they're needed. > That way we wouldn't waste time planning queries for combinations of > parameters that are never used. ... or plan the query with the actual parameter value you get, and also record the range of the parameter values you expect the plan to be valid for. If at execution time the parameter happens to be out of that range, replan, and possibly add new sublpan covering the extra range. This could still work with prepared queries (where you don't get any parameter values to start with) by estimating the most probable parameter range (whatever that could mean), and planning for that. Cheers, Csaba.
On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote: > ... or plan the query with the actual parameter value you get, and also > record the range of the parameter values you expect the plan to be valid > for. If at execution time the parameter happens to be out of that range, > replan, and possibly add new sublpan covering the extra range. This > could still work with prepared queries (where you don't get any > parameter values to start with) by estimating the most probable > parameter range (whatever that could mean), and planning for that. More on that: recording the presumptions under which the (cached!)plan is thought to be valid would also facilitate setting up dependencies against statistics, to be checked when you analyze tables... and if the key value which you depend on with your query changed, the analyze process could possibly replan it in the background. Cheers, Csaba.
> ... or plan the query with the actual parameter value you get, and also > record the range of the parameter values you expect the plan to be valid > for. If at execution time the parameter happens to be out of that range, > replan, and possibly add new sublpan covering the extra range. This > could still work with prepared queries (where you don't get any > parameter values to start with) by estimating the most probable > parameter range (whatever that could mean), and planning for that. Another thought: if the cached plans get their own table (as it was suggested) then you could also start gathering parameter range statistics meaningfully... and on the next replan you know what to optimize your planning efforts for. Cheers, Csaba.
>> Bind message behaviour was modified : >> - If the user asks for execution of a named prepared statement, and the >> named statement does not exist in PG's prepared statements cache, >> instead >> of issuing an error and borking the transaction, it Binds to an empty >> statement, that takes no parameters, and returns no result. Parameters >> sent by the user are consumed but not used. > > You mentioned the need for a wire protocol change to allow this. Why > can't this be controlled with a server variable, like SET auto_prepare = > 'true'? Actually, thanks to the hack, the wire protocol doesn't change.Explanation : - Send Parse(SQL) to unnamed statement + Bind unnamed statement => works as usual (no cache) - Send only Bind (named statement) with a statement name that is not found in the cache => doesn't raise an error, instead informs the application that the statement does not exist. The application can then prepare (send a Parse message with SQL and a name) the statement and give it a name. I used as name the SQL itself, but you can use anything else. The application can then send the Bind again, which will (hopefully) work. So, here, the information ("cache" or "don't cache") is passed from the client to the server, in a hidden way : it depends on what function you use to send the query (unnamed statements are not cached, named statements are cached).There is no protocol change, but a new information is provided to the server nonetheless. Downside to this is that the application needs to be modified (only a little, though) and applications that expect exceptions on "Statement does not exist" will break, thus the necessity of a GUC to control it. It was just a quick & dirty test to see if this way of doing it was an option to consider or not. Apparently it works, but wether it is The Right Way remains to be seen...
I like cross-session query plan caching talk. I would prefer if the function was optional (i.e. per-session "use cross-session query plan cache" variable). I like the "automatic re-plan if the estimate did not match the actual" idea with some softening technique involved such as "if the last 3 times it ran, it did the wrong thing, learn from our mistake and adapt". The other ideas about automatically deciding between plans based on ranges and such strike me as involving enough complexity and logic, that to do properly, it might as well be completely re-planned from the beginning to get the most benefit. Cheers, mark -- Mark Mielke <mark@mielke.cc>
On Mon, 2008-04-14 at 10:55 -0400, Mark Mielke wrote: > The other ideas about automatically deciding between plans based on > ranges and such strike me as involving enough complexity and logic, that > to do properly, it might as well be completely re-planned from the > beginning to get the most benefit. ... except if you hard-wire the most common alternative plans, you still get the benefit of cached plan for a wider range of parameter values. Not to mention that if you know you'll cache the plan, you can try harder planning it right, getting possibly better plans for complex queries... you could argue that complex queries tend not to be repeated, but we do have here some which are in fact repeated a lot in batches, then discarded. So I guess a cached plan discard/timeout mechanism would also be nice. Cheers, Csaba.
On Mon, 14 Apr 2008 16:17:18 +0200, Csaba Nagy <nagy@ecircle-ag.com> wrote: > On Mon, 2008-04-14 at 16:10 +0200, Csaba Nagy wrote: >> ... or plan the query with the actual parameter value you get, and also >> record the range of the parameter values you expect the plan to be valid >> for. If at execution time the parameter happens to be out of that range, >> replan, and possibly add new sublpan covering the extra range. This >> could still work with prepared queries (where you don't get any >> parameter values to start with) by estimating the most probable >> parameter range (whatever that could mean), and planning for that. > > More on that: recording the presumptions under which the (cached!)plan > is thought to be valid would also facilitate setting up dependencies > against statistics, to be checked when you analyze tables... and if the > key value which you depend on with your query changed, the analyze > process could possibly replan it in the background. LOL, it started with the idea to make small queries faster, and now the brain juice is pouring.Those "Decision" nodes could potentially lead to lots of decisions (ahem).What if you have 10 conditionsin the Where, plus some joined ones ? That would make lots of possibilities... Consider several types of queries : - The small, quick query which returns one or a few rows : in this case, planning overhead is large relative to execution time, but I would venture to guess that the plans always end up being the same.- The query that takes a while : in this case, planning overhead isnil compared to execution time, better replan every time with the params.- The complex query that still executes fast becauseit doesn't process a lot of rows and postgres finds a good plan (for instance, a well optimized search query). Those would benefit from reducing the planning overhead, but those also typically end up having many different plans depending on the search parameters. Besides, those queries are likely to be dynamically generated. So, would it be worth it to add all those features just to optimize those ? I don't know...
On Mon, 2008-04-14 at 17:08 +0200, PFC wrote: > Those "Decision" nodes could potentially lead to lots of decisions (ahem). > What if you have 10 conditions in the Where, plus some joined ones ? That > would make lots of possibilities... Yes, that's true, but most of them are likely not relevant for the end result. In any real life query there are a few parameters which are really important for what plan you should choose... the key here is that you should spend more time on finding the possibilities for a cached plan than you do for a one shot query. In principle one-shot planning should be the default and caching should be something the user has to chose deliberately. I would really like a special command to plan and cache a query without actually executing it, possibly having a parameter how hard to try... for e.g. you could expend the extra cycles to eliminate all redundancies from boolean expressions, in lists, to get the parse tree in a canonical format - all things which can make planning easier. All these lose in one-shot queries, but once you cache you can really do a lot of smarts which were no-no before... > Consider several types of queries : > > - The small, quick query which returns one or a few rows : in this case, > planning overhead is large relative to execution time, but I would venture > to guess that the plans always end up being the same. Consider a 'select a where b like $1' -> the parameter $1 will considerably affect the query plan. A query can't go much simpler... > - The query that takes a while : in this case, planning overhead is nil > compared to execution time, better replan every time with the params. I guess these queries are not the ones targeted by this feature. In fact for these queries it really doesn't matter if you cache or not, except: if you know you're gonna cache, you'll expend more effort planning right, and that could still matter for a query which runs long. Note that if you don't cache, planning harder will lose in the long run, only once you cache you can afford to plan harder... > - The complex query that still executes fast because it doesn't process a > lot of rows and postgres finds a good plan (for instance, a well optimized > search query). Those would benefit from reducing the planning overhead, > but those also typically end up having many different plans depending on > the search parameters. Besides, those queries are likely to be dynamically > generated. So, would it be worth it to add all those features just to > optimize those ? I don't know... We have here dynamically generated queries which are specifically chunked to be executed in small increments so none of the queries runs too long (they would block vacuuming vital tables otherwise). Those chunks would greatly benefit from properly planned and cached plans... A real smart system would store canonical plan fragments as response to (also canonicalized) parse tree fragments, and then assemble the plan out of those fragments, but that would be indeed really complex (to design, the resulting code might be simpler than one thinks) ;-) Cheers, Csaba.
On Mon, Apr 14, 2008 at 5:01 PM, Csaba Nagy <nagy@ecircle-ag.com> wrote: > On Mon, 2008-04-14 at 10:55 -0400, Mark Mielke wrote: > > The other ideas about automatically deciding between plans based on > > ranges and such strike me as involving enough complexity and logic, that > > to do properly, it might as well be completely re-planned from the > > beginning to get the most benefit. > > ... except if you hard-wire the most common alternative plans, you still > get the benefit of cached plan for a wider range of parameter values. > Not to mention that if you know you'll cache the plan, you can try > harder planning it right, getting possibly better plans for complex > queries... you could argue that complex queries tend not to be repeated, > but we do have here some which are in fact repeated a lot in batches, > then discarded. So I guess a cached plan discard/timeout mechanism would > also be nice. I think ANALYZE on tables involved should _force_ replanning of cached query. After all, if ANALYZE was fired, then contents changed substantially and replanning feels like a good idea. As for planner getting smarter (and slower ;)) -- complex queries tend not to be repeated -- so it is worth the trouble to plan them carefully. These would benefit from smarter planer with or without caching. The problem is with "simple queries", which can be argued are a majority of queries. its where the caching comes in. If you cache the queries, you can let the planner be smarter (and slower). If you don't cache, you probably don't want trade "frequent simple query"'s speed for "once in a while complex query". That stated, for me the most important feature is the possibility to have a good online query statistics. :) Regards, Dawid
* Dawid Kuroczko: > Right now the only way of getting such information from PostgreSQL > is by logging all queries and analyzing logs. The current_query > column of pg_stat_activity is useless as the (prepared) queries are > usually so short lived that you will see one execution out of > thousands happening. If the cached plans are kept a bit longer than actually necessary, it might also be possible to see the query plan of a query that involves temporary tables, something that is somewhat convoluted to do otherwise (ptop uses the query string from pg_stat_activity and tacks an EXPLAIN in front of it, which breaks with temporary tables). -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99