Thread: Re: eWeek Poll: Which database is most critical to your
-----Original Message----- From: Neil Conway [mailto:nconway@klamath.dyndns.org] Sent: Tuesday, February 26, 2002 3:04 PM To: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to your On Tue, 2002-02-26 at 15:30, Zak Greant wrote: > Good Day All, > > eWeek has posted a poll that asks which database server is most critical > to your organization. The article mentions a MySQL feature which apparently improved performance considerably: // MySQL 4.0.1's new, extremely fast query cache is also quite notable, as no other database we tested had this feature. If the text of an incoming query has a byte-for-byte match with a cached query, MySQL can retrieve the results directly from the cache without compiling the query, getting locks or doing index accesses. This query caching will be effective only for tables with few updates because any table updates that clear the cache to guarantee correct results are always returned. // My guess is that it would be relatively simple to implement. Any comments on this? If I implemented this, any chance this would make it into the tree? Of course, it would be: - disabled by default - enabled on a table-by-table basis (maybe an ALTER TABLE command) >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> >> I don't see how it will do any good. There is no "prepare" in Postgresql and therefore you will simply be reexecuting the queries every time any way. Also, parameter markers only work in embedded SQL and that is a single tasking system. I think it would be a major piece of work to do anything useful along those lines. If you look at how DB/2 works, you will see that they store prepared statements. Another alternative would be to keep some point in the parser marked and somehow jump to that point, but you would have to be able to save a parse tree somewhere and also recognize the query. Here is where problems come in... -- Someone wants blue and blue-green, etc shirts that are backordered SELECT shirt, color, backorder_qty FROM garments WHERE color like "BLUE%" Now, another query comes along: -- Someone else wants reddish, etc shirts that are backordered: SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%" It's the same query with different data. Without parameter markers you will never know it. And yet this is exactly the sort of caching that is useful. <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< <<
On Tue, 2002-02-26 at 18:20, Dann Corbit wrote: > I don't see how it will do any good. There is no "prepare" in > Postgresql > and therefore you will simply be reexecuting the queries every time any > way. Also, parameter markers only work in embedded SQL and that is a > single tasking system. Perhaps I wasn't clear. The feature I'm proposing is this: When processing SELECT queries but before any real work has been done, lookup the query in a hash table. If it already exists, return the cached result. If it doesn't exist, execute the query and cache the result in the hash table. Optionally, we could not immediately cache the query, just increment a "frequency" counter stored in the hash table. If the counter goes above a certain constant, we decide that the query is worth caching, so we cache the full result in the hash table. When processing INSERTs, UPDATEs and DELETEs, check if the query would affect any of the tables for which we are maintaing this cache. If so, flush the cache. This ensures that we will never return invalid results. We could perhaps be fancy and keep stats on which columns our cached queries utilize and which columns the modifying query will affect, but that is unlikely to be an overall win. HOWEVER -- I don't see this feature as something that will appeal to, say, 75% of PgSQL users. If the table in question is being modified on a regular basis, or if a wide variety of queries are being issued, this cache isn't a good idea. Nevertheless, I think there are certainly some situations in which this cache is useful -- and furthermore, these kinds of "mostly read-only" situations are often where MySQL is chosen over PostgreSQL. Anyway, just putting this on the table -- if the consensus is that this isn't a very worthwhile feature, I won't bother with it. Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
Dann Corbit wrote:<br /><blockquote cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com" type="cite"><prewrap="">-----Original Message-----<br />From: Neil Conway [<a class="moz-txt-link-freetext" href="mailto:nconway@klamath.dyndns.org">mailto:nconway@klamath.dyndns.org</a>]<br/>Sent: Tuesday, February 26, 2002 3:04PM<br />To: <a class="moz-txt-link-abbreviated" href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a><br/>Subject: Re: [HACKERS] eWeek Poll: Whichdatabase is most critical to<br />your<br /><br /><br />On Tue, 2002-02-26 at 15:30, Zak Greant wrote:<br /></pre><blockquotetype="cite"><pre wrap="">Good Day All,<br /><br />eWeek has posted a poll that asks which database serveris most<br /></pre></blockquote><pre wrap="">critical<br /></pre><blockquote type="cite"><pre wrap="">to your organization.<br/></pre></blockquote><pre wrap=""><br />The article mentions a MySQL feature which apparently improved<br/>performance considerably:<br /><br />//<br />MySQL 4.0.1's new, extremely fast query cache is also quite notable,as<br />no other database we tested had this feature. If the text of an incoming<br />query has a byte-for-byte matchwith a cached query, MySQL can retrieve<br />the results directly from the cache without compiling the query, getting<br/>locks or doing index accesses. This query caching will be effective only<br />for tables with few updates becauseany table updates that clear the<br />cache to guarantee correct results are always returned.<br />//<br /><br />Myguess is that it would be relatively simple to implement. Any<br />comments on this?<br /><br />If I implemented this,any chance this would make it into the tree? Of<br />course, it would be:<br /><br /> - disabled by default<br /> - enabled on a table-by-table basis (maybe an ALTER TABLE command)<br /></pre> <pre wrap="">I don't see how it willdo any good. There is no "prepare" in<br />Postgresql<br />and therefore you will simply be reexecuting the queriesevery time any<br />way. Also, parameter markers only work in embedded SQL and that is a <br />single tasking system.<br/><br />I think it would be a major piece of work to do anything useful along<br />those lines.<br /><br />If youlook at how DB/2 works, you will see that they store prepared<br />statements. Another alternative would be to keep somepoint in the<br />parser marked and somehow jump to that point, but you would have to<br />be able to save a parse treesomewhere and also recognize the query.<br /><br />Here is where problems come in...<br />-- Someone wants blue and blue-green,etc shirts that are backordered<br />SELECT shirt, color, backorder_qty FROM garments WHERE color like<br />"BLUE%"<br/><br />Now, another query comes along:<br /><br />-- Someone else wants reddish, etc shirts that are backordered:<br/>SELECT shirt, color, back order_qty FROM garments WHERE color like "RED%"<br /><br />It's the same query with different data. Without parameter markersyou<br />will never know it. And yet this is exactly the sort of caching that is<br />useful.</pre></blockquote>However, an exact match is still not a bad idea. You might have a different execution plandepending on the statistics of the data in your column.<br /><br /> If there were a way to store the execution plan andstart executing from there it still might not be a bad idea.<br /><br /><blockquote cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com"type="cite"><pre wrap=""><br /><br /><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<br /><<<br/><br />---------------------------(end of broadcast)---------------------------<br />TIP 5: Have you checkedour extensive FAQ?<br /><br /><a class="moz-txt-link-freetext" href="http://www.postgresql.org/users-lounge/docs/faq.html">http://www.postgresql.org/users-lounge/docs/faq.html</a><br /></pre></blockquote><br/><br />
Neil Conway wrote: <snip> Hi everyone, This is getting me to think : We don't have a really established process for PostgreSQL development which allows for "experimental features". aka the Linux kernel "EXPERIMENTAL", and so forth. aka ./configure --experimental-querycache --experimental-pertablecostings --expermental-something? This way, people can still use PostgreSQL as per normal, but it also allows for development of code which might or might not actually see the light of day, depending on if it turns out to really be useful. By default, things should only become "experimental features" after the correct consideration, not by default. There are situations where it would be beneficial (perhaps in Neil's example here). If Neil were to split off a project and do it on Sourceforge, then only some people would check it out when kind of ready, etc, etc and it wouldn't really be as available to a wider audience. But if's it part of the present source tree, I think the experimental features would see a wider audience and we'd get a better indication of what's good/bad/etc. Don't know how to support those features, but pretty much developers and hard-core-users should be the only ones using them and therefore should be able to figure most of the stuff out for themselves (i.e. we'd only get the real questions). Does this seems like a worthwhile viewpoint to consider, then perhaps we can think about starting it implement it for 7.3? (Wish I could code better). :) Regards and best wishes, Justin Clift > > Perhaps I wasn't clear. The feature I'm proposing is this: > > When processing SELECT queries but before any real work has been > done, lookup the query in a hash table. If it already exists, return the > cached result. If it doesn't exist, execute the query and cache the > result in the hash table. Optionally, we could not immediately cache the > query, just increment a "frequency" counter stored in the hash table. If > the counter goes above a certain constant, we decide that the query is > worth caching, so we cache the full result in the hash table. > > When processing INSERTs, UPDATEs and DELETEs, check if the query > would affect any of the tables for which we are maintaing this cache. If > so, flush the cache. This ensures that we will never return invalid > results. We could perhaps be fancy and keep stats on which columns our > cached queries utilize and which columns the modifying query will > affect, but that is unlikely to be an overall win. > > HOWEVER -- I don't see this feature as something that will appeal to, > say, 75% of PgSQL users. If the table in question is being modified on a > regular basis, or if a wide variety of queries are being issued, this > cache isn't a good idea. Nevertheless, I think there are certainly some > situations in which this cache is useful -- and furthermore, these kinds > of "mostly read-only" situations are often where MySQL is chosen over > PostgreSQL. > > Anyway, just putting this on the table -- if the consensus is that this > isn't a very worthwhile feature, I won't bother with it. > > Cheers, > > Neil > > -- > Neil Conway <neilconway@rogers.com> > PGP Key ID: DB3C29FC > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org -- "My grandfather once told me that there are two kinds of people: those who work and those who take the credit. He told me to try to be in the first group; there was less competition there." - Indira Gandhi
Neil Conway <nconway@klamath.dyndns.org> writes: > When processing INSERTs, UPDATEs and DELETEs, check if the query > would affect any of the tables for which we are maintaing this cache. If > so, flush the cache. This ensures that we will never return invalid > results. Note that this would imply that the cache is *global* across all backends; therefore it is a shared data structure and hence an access bottleneck. (Not to mention a memory-management headache, since the size of shared memory can't easily be varied on-the-fly.) I cannot believe that caching results for literally-identical queries is a win, except perhaps for the most specialized (read brain dead) applications. Has anyone looked at the details of the test case that MySQL uses to claim that this is a good idea? Has it got any similarity to your own usage patterns? We have talked about caching query plans for suitably-parameterized queries, but even that seems unlikely to be a huge win; at least I'd not think it useful to try to drive the cache totally automatically. If an application could say "here's a query I expect to use a lot, varying these specific parameters" then caching a plan for that would make sense. Now, there are notions of "prepared statements" in many access APIs that fit this description, and in fact the underlying capability exists in the backend --- we've just not gotten around to building the interfaces to tie it all together. *That* would be worth working on. regards, tom lane
On Wed, 2002-02-27 at 00:39, Tom Lane wrote: > I cannot believe that caching results for literally-identical queries > is a win, except perhaps for the most specialized (read brain dead) > applications. According to MySQL: "The query cache is extremely useful in an environment where (some) tables don't change very often and you have a lot of identical queries. This is a typical situation for many web servers that use a lot of dynamic content." Would people agree with the MySQL guys on this? In particular, that this is a "typical situation" for many webapps? > Has anyone looked at the details of the test case that > MySQL uses to claim that this is a good idea? I emailed the author of the eWeek benchmarks asking for more information. My guess is that the benchmark results benefit from the query cache because it executes exactly the same query over and over again (e.g. selecting the same product from the database every time, not simulating typical user behavior by selecting random products). If that is the case, the results are clearly irrelevant. > Has it got any similarity to your own usage patterns? I would be very interested to hear about this as well. > Now, there are notions of "prepared statements" in many access APIs > that fit this description, and in fact the underlying capability exists > in the backend --- we've just not gotten around to building the > interfaces to tie it all together. *That* would be worth working on. Okay, I'll take a look at this... Cheers, Neil -- Neil Conway <neilconway@rogers.com> PGP Key ID: DB3C29FC
> According to MySQL: "The query cache is extremely useful in an > environment where (some) tables don't change very often and you have a > lot of identical queries. This is a typical situation for many web > servers that use a lot of dynamic content." > > Would people agree with the MySQL guys on this? In particular, that this > is a "typical situation" for many webapps? Hmmm. We have a lot of repeated _parameterised_ queries, but the recurrence of identical queries is quite small. It'd be an interesting thing to try and measure. > > Now, there are notions of "prepared statements" in many access APIs > > that fit this description, and in fact the underlying capability exists > > in the backend --- we've just not gotten around to building the > > interfaces to tie it all together. *That* would be worth working on. > > Okay, I'll take a look at this... This is the more general solution, compared to MySQL's query cache - and can speed up paramaterised queries as well as identical queries... Chris
On Tue, 26 Feb 2002 15:20:17 PST, "Dann Corbit" wrote: > -----Original Message----- > From: Neil Conway [mailto:nconway@klamath.dyndns.org] > Sent: Tuesday, February 26, 2002 3:04 PM > To: pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to > your > > > On Tue, 2002-02-26 at 15:30, Zak Greant wrote: > > Good Day All, > > > > eWeek has posted a poll that asks which database server is most > critical > > to your organization. > > The article mentions a MySQL feature which apparently improved > performance considerably: > > // > MySQL 4.0.1's new, extremely fast query cache is also quite notable, as > no other database we tested had this feature. If the text of an incoming > query has a byte-for-byte match with a cached query, MySQL can retrieve > the results directly from the cache without compiling the query, getting > locks or doing index accesses. This query caching will be effective only > for tables with few updates because any table updates that clear the > cache to guarantee correct results are always returned. > // > > My guess is that it would be relatively simple to implement. Any > comments on this? > > If I implemented this, any chance this would make it into the tree? Of > course, it would be: > > - disabled by default > - enabled on a table-by-table basis (maybe an ALTER TABLE command) > >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> > >> > I don't see how it will do any good. There is no "prepare" in > Postgresql > and therefore you will simply be reexecuting the queries every time any > way. Also, parameter markers only work in embedded SQL and that is a > single tasking system. > > I think it would be a major piece of work to do anything useful along > those lines. > > If you look at how DB/2 works, you will see that they store prepared > statements. Another alternative would be to keep some point in the > parser marked and somehow jump to that point, but you would have to > be able to save a parse tree somewhere and also recognize the query. > > Here is where problems come in... > -- Someone wants blue and blue-green, etc shirts that are backordered > SELECT shirt, color, backorder_qty FROM garments WHERE color like > "BLUE%" > > Now, another query comes along: > > -- Someone else wants reddish, etc shirts that are backordered: > SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%" > > It's the same query with different data. Without parameter markers you > will never know it. And yet this is exactly the sort of caching that is > useful. > > <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< > << While the cache that MySQL implemented may be of limited value for dynamic queries, it would be very useful for many, many queries. For example, many applications (especially the ones that I write), there are literally hundreds of selects against static "lookup" tables that maintain acceptable values for new data inputs (i.e., validated using referential integrity). Each and every one of those selects would receive an improvement. Also, it is relatively easy to "parameterize" the select. After all, it is a structured query. Just by parsing it, you can identify the "parameters". As a matter of fact, I'm pretty certain that this is already done by the optimizer when breaking the SQL down into the code to perform the query. The trick is to cache the "parameterized" version of the query. This is likely at or near the end of the optimizer processing. Also, you would want to capture the query plan in the cache. The query plan is not going to be interested at all in the literal value of the parameters and therefore will be the same for any query of the same form. For example, from above: SELECT shirt, color, backorder_qty FROM garments WHERE color like 'BLUE%' should become something on the order of: SELECT shirt, color, backorder_qty FROM garments WHERE color like '{param0}%' The next query: SELECT shirt, color, backorder_qty FROM garments WHERE color like 'RED%' should also be "parameterized" on the order of: SELECT shirt, color, backorder_qty FROM garments WHERE color like '{param0}%' A lookup into the query hash would match and therefore _at least_ the same query plan can be used. The commercial database "Cache" by Intersystems uses an approach similar to this. The performance of that database is phenomenal. (Of course, there is more going on in the internals of that system than just the query cache.)
F Harvell <fharvell@fts.net> writes: > The query plan is not going to be interested at all in > the literal value of the parameters and therefore will be the same for > any query of the same form. Unfortunately, this is completely false. > For example, from above: > SELECT shirt, color, backorder_qty FROM garments WHERE color like > 'BLUE%' > should become something on the order of: > SELECT shirt, color, backorder_qty FROM garments WHERE color like > '{param0}%' You managed to pick an example that's perfectly suited to demolish your assertion. The query with "color like 'BLUE%'" can be optimized into an indexscan (using index quals of the form "color >= 'BLUE' and color < 'BLUF'), at least in C locale. The parameterized query cannot be optimized at all, because the planner cannot know whether the substituted parameter string will provide a left-anchored pattern. What if param0 contains '_FOO' at runtime? An indexscan will be useless in that case. In general, Postgres' query plans *do* depend on the values of constants, and it's not always possible to produce an equally good plan that doesn't assume anything about constants. This is why I think it's a lousy idea for the system to try to automatically abstract a parameterized query plan from the actual queries it sees. On the other hand, an application programmer will have a very good idea of which parts of a repeated query are really constant and which are parameters. So what we really need is preparable parameterized queries, wherein the application tells us what to parameterize, rather than having to guess about it. regards, tom lane
> In general, Postgres' query plans *do* depend on the values of > constants, and it's not always possible to produce an equally good plan > that doesn't assume anything about constants. This is why I think it's > a lousy idea for the system to try to automatically abstract a > parameterized query plan from the actual queries it sees. On the other > hand, an application programmer will have a very good idea of which > parts of a repeated query are really constant and which are parameters. > So what we really need is preparable parameterized queries, wherein the > application tells us what to parameterize, rather than having to guess > about it. I think we could store the constants that went with the saved plan and re-use the plan if the new constants were _similar_ to the old ones. (Of course, figuring out _similar_ is the trick here.) -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
On Wed, Feb 27, 2002 at 12:39:16AM -0500, Tom Lane wrote: > I cannot believe that caching results for literally-identical queries > is a win, except perhaps for the most specialized (read brain dead) I don't think they are brain dead. Well that is at first I thought so too, but then thinking some more it made sense. After all MySQL is used mostly for web pages and even your dynamic content doesn't change that often. But in between there are thousands of concurrent access that all execute the very same statement. This feature makes no sense IMO for the "normal" use we both probably had in mind when first reading, but for this web usage I see a benefit if it's implementable. Michael -- Michael Meskes Michael@Fam-Meskes.De Go SF 49ers! Go Rhein Fire! Use Debian GNU/Linux! Use PostgreSQL!
On Wed, 27 Feb 2002 16:24:45 EST, Tom Lane wrote: > F Harvell <fharvell@fts.net> writes: > > The query plan is not going to be interested at all in > > the literal value of the parameters and therefore will be the same for > > any query of the same form. > > Unfortunately, this is completely false. > > > For example, from above: > > > SELECT shirt, color, backorder_qty FROM garments WHERE color like > > 'BLUE%' > > > should become something on the order of: > > > SELECT shirt, color, backorder_qty FROM garments WHERE color like > > '{param0}%' > > You managed to pick an example that's perfectly suited to demolish your > assertion. The query with "color like 'BLUE%'" can be optimized into an > indexscan (using index quals of the form "color >= 'BLUE' and color < > 'BLUF'), at least in C locale. The parameterized query cannot be > optimized at all, because the planner cannot know whether the > substituted parameter string will provide a left-anchored pattern. > What if param0 contains '_FOO' at runtime? An indexscan will be > useless in that case. Thanks for the feedback. In the example that was used, it was important to note that the {param0} was the string literal "BLUE" and not the % "operator". This IMHO ties the query to a left anchored pattern. I certainly do not think that the "parameter" can be anything but a literal. Functions and operators would very likely affect any query plan. Is it true that the optimizer manipulates the literal? It would seem that that would require a huge amount of processing (due to character sets, etc.). It would appear that it would be more viable to use a simpler optimization that does not manipulate the literal such as an index quals of the form "color{0,4} == 'BLUE'" than to generate a range comparison. Of course, this is a very simple query and I am likely missing a critical concept. Thanks, F Harvell
On Wed, 2002-02-27 at 18:39, Tom Lane wrote: > > I cannot believe that caching results for literally-identical queries > is a win, except perhaps for the most specialized (read brain dead) > applications. We have an application which we developed, and which we added specifically added query caching of the type being discussed. This was added within the application, rather than within the database, and provided huge gains for the client. One difficulty is in knowing when the results of the exact same query will result in a different result set (i.e. invalidating your cached resultsets), but as far as I can see this is just as knowable at a database level as it was for us in the application code. If this were available I wouldn't want it applied to _all_ queries, however - I'd want to be able to enable it against particular tables and/or queries (po: a 'WITH RESULTCACHE' option on the end of a SELECT) where I knew there was going to be a dramatic win. The reason there was such a large win for our application was that when people go to a job website on Monday morning to look for a better place to work, there is a strong tendency for them to be searching for much the same sort of thing (as each other) - especially if the site is getting in the millions of page views. The application I talk about here was written in Progress (WebSpeed), and we have found it much harder to do this sort of thing more recently in Oracle or PostgreSQL. This would be nice. Regards, Andrew. -- -------------------------------------------------------------------- Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington WEB: http://catalyst.net.nz/ PHYS: Level 2, 150-154 Willis St DDI: +64(4)916-7201 MOB: +64(21)635-694 OFFICE: +64(4)499-2267 Are you enrolled at http://schoolreunions.co.nz/yet?
If you were to expire the cache when the table was updated it wouldn't be so bad, and you certainly don't want to cache everything under the assumption something may come along -- as it probably won't. But if you're interested in speeding this up I can't see any harm in adding a stat into statistics that stores query frequency. If the frequency goes above a certain notch (without the results changing -- static table data) then cache results. However, I think 99% of these cases could be easily rectified with a php interface which caches the results into memory at the users request as I really don't see hundreds of machines making the same queries frequently -- rather 1 or 2 poorly written ones :) Then again, I regularly hit one row database tables which act as markers for data changes to see if the program should reload the entire data set. (Parallel machines for interface purposes cache table data in ram normally, but test for changes with every request -- goal was that they shouldn't have to communicate to eachother). But a 1 row table is very quick to select from. Can this make it faster, if so it'll be useful. Anyway, write it up. If you can speed up some without making a hit against others it'll be a large advantage. -- Rod Taylor This message represents the official view of the voices in my head ----- Original Message ----- From: "Neil Conway" <nconway@klamath.dyndns.org> To: "Dann Corbit" <DCorbit@connx.com> Cc: <pgsql-hackers@postgresql.org> Sent: Tuesday, February 26, 2002 6:42 PM Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to your > On Tue, 2002-02-26 at 18:20, Dann Corbit wrote: > > I don't see how it will do any good. There is no "prepare" in > > Postgresql > > and therefore you will simply be reexecuting the queries every time any > > way. Also, parameter markers only work in embedded SQL and that is a > > single tasking system. > > Perhaps I wasn't clear. The feature I'm proposing is this: > > When processing SELECT queries but before any real work has been > done, lookup the query in a hash table. If it already exists, return the > cached result. If it doesn't exist, execute the query and cache the > result in the hash table. Optionally, we could not immediately cache the > query, just increment a "frequency" counter stored in the hash table. If > the counter goes above a certain constant, we decide that the query is > worth caching, so we cache the full result in the hash table. > > When processing INSERTs, UPDATEs and DELETEs, check if the query > would affect any of the tables for which we are maintaing this cache. If > so, flush the cache. This ensures that we will never return invalid > results. We could perhaps be fancy and keep stats on which columns our > cached queries utilize and which columns the modifying query will > affect, but that is unlikely to be an overall win. > > HOWEVER -- I don't see this feature as something that will appeal to, > say, 75% of PgSQL users. If the table in question is being modified on a > regular basis, or if a wide variety of queries are being issued, this > cache isn't a good idea. Nevertheless, I think there are certainly some > situations in which this cache is useful -- and furthermore, these kinds > of "mostly read-only" situations are often where MySQL is chosen over > PostgreSQL. > > Anyway, just putting this on the table -- if the consensus is that this > isn't a very worthwhile feature, I won't bother with it. > > Cheers, > > Neil > > -- > Neil Conway <neilconway@rogers.com> > PGP Key ID: DB3C29FC > > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
On Wed, 2002-02-27 at 10:39, Tom Lane wrote: > Neil Conway <nconway@klamath.dyndns.org> writes: > > When processing INSERTs, UPDATEs and DELETEs, check if the query > > would affect any of the tables for which we are maintaing this cache. If > > so, flush the cache. This ensures that we will never return invalid > > results. > > Note that this would imply that the cache is *global* across all > backends; therefore it is a shared data structure and hence an access > bottleneck. (Not to mention a memory-management headache, since the > size of shared memory can't easily be varied on-the-fly.) I think that it would be enough if the hashes and bookkeeping info (tmin,tmax,filename) were in a global table. We could also purge all data more than a few minutes old. We also need an inverse lookup from changed table to cached query for cache invalidation on insert/update/delete The result could be even saved in temp files and be mostly faster than doing the full parse/plan/execute, both for complex queries returning a few rows (saves planning time) or many rows (saves execute time). The format used for saving should be exact wire protocol, so that efficient system calls could be used where available (linux 2.4+ has a system call that will transfer a whole file to a socket in one call bypassing all copying and cacheing) The lookup part will be pretty trivial - lookup using hash, check for tmin/tmax, if ok push cached result out to client.This will make us as fast or faster than MySQL for trivial_benchmark/busy_website case. The cache creation/maintenance part will be much trickier - When creating cache * the tables affected can be determined only from fully built plans because of possible rule expansions. * if there is a trigger on select for this query it can't be cached * put some temporary insert/update/delete triggers on all real tables used in query that will invalidate cache - as an alternativelywe could always run the invalidate-query-cache code for affected table on insert/update/delete on a table ifexact caching is enabled * invalidate cache on schema changes * run a periodic check and invalidate old cache entries. Some of the above could also be needed for caching query plans. > I cannot believe that caching results for literally-identical queries > is a win, except perhaps for the most specialized (read brain dead) > applications. Actually a web app that looks up contents of a 5 row combobox is not really brain-dead. Doing all the caching (and cache invalidation) on client side is hard and fragle - what happens when someone adds a trigger in backend ? > Has anyone looked at the details of the test case that > MySQL uses to claim that this is a good idea? Has it got any similarity > to your own usage patterns? Yes - for content management / web apps. No - for bean-counting apps. > We have talked about caching query plans for suitably-parameterized > queries, but even that seems unlikely to be a huge win; at least I'd > not think it useful to try to drive the cache totally automatically. > If an application could say "here's a query I expect to use a lot, > varying these specific parameters" then caching a plan for that would > make sense. > > Now, there are notions of "prepared statements" in many access APIs > that fit this description, and in fact the underlying capability exists > in the backend --- we've just not gotten around to building the > interfaces to tie it all together. *That* would be worth working on. Sure. It would a) make many queries faster b) make client libs (ODBC/JDBC/ECPG) faster and simpler by not forcing them to fake it. But there is also a big class of applications that would benefit much more from caching exact queries. And it will make us as fast as MySQL for 100000 consecutive calls of SELECT MAX(N) FROM T ;) --------------- Hannu