Re: Caching of Queries - Mailing list pgsql-performance
From | Pierre-Frédéric Caillaud |
---|---|
Subject | Re: Caching of Queries |
Date | |
Msg-id | opsfap0zbrcq72hf@musicbox Whole thread Raw |
In response to | Re: Caching of Queries ("Jim C. Nasby" <decibel@decibel.org>) |
List | pgsql-performance |
> pgpool (which makes some rather questionable claims IMO); any decent web > application language/environment will support connection pooling. That's why it should not be tied to something specific as pgpool. If you want performance, which is the case here, usually you have a webserver serving static files, and an application server serving dynamic pages. This is not necessarily a huge application server, it can be as simple as an Apache instance serving static files, with a special path mod_proxy'ed to another instance of apache acting as an application server. IMHO this is a nice way to do it, because you have a light weight static files server which can spawn many processes without using precious resources like memory and postgres connections, and a specialized server which has a lot less processes, each one having more size, a db connection, etc. The connexions are permanent, of course, so there is no connection overhead. The proxy has an extra advantage buffering the data from the "app server" and sending it back slowly to the client, so the app server can then very quickly process the next request instead of hogging a db connection while the html is slowly trickled back to the client. IMHO the standard PHP way of doing things (just one server) is wrong because every server process, even if it's serving static files, hogs a connection and thus needs an extra layer for pooling. Thus, I see query result caching as a way to pushing further architectures which are already optimized for performance, not as a band-aid for poor design solutions like the one-apache server with pooling. Now, a proposition : Here is where we are now, a typical slow query : PREPARE myquery(text,integer) EXECUTE myquery('john',2) My proposition : PREPARE myquery(text,integer) PLANNED USING ('john',2) CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL DEPENDS ON $1, $2 MAXIMUM CACHE TIME '5 minute'::interval MINIMUM CACHE TIME '1 minute'::interval MAXIMUM CACHE SIZE 2000000 AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield>=$1; EXECUTE myquery('john',2) Explainations : ----------- PLANNED USING ('john',2) Tells the planner to compute the stored query plan using the given parameters. This is independent from caching but could be a nice feature as it would avoid the possibility of storing a bad query plan. ----------- CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL Specifies that the result is to be cached. There is an optional condition (here, IF ...) telling postgres of when and where it should cache, or not cache. It could be useful to avoid wasting cache space. ----------- DEPENDS ON $1, $2 Defines the cache key. I don't know if this is useful, as the query parameters make a pretty obvious cache key so why repeat them. It could be used to add other data as a cache key, like : DEPENDS ON (SELECT somefunction($1)) Also a syntax for specifying which tables should be watched for updates, and which should be ignored, could be interesting. ----------- MAXIMUM CACHE TIME '5 minute'::interval Pretty obvious. ----------- MINIMUM CACHE TIME '1 minute'::interval This query is a count and I want a fast but imprecise count. Thus, I specify a minimum cache time of 1 minute, meaning that the result will stay in the cache even if the tables change. This is dangerous, so I'd suggest the following : MINIMUM CACHE TIME CASE WHEN result.number>10 THEN '1 minute'::interval ELSE '5 second'::interval Thus the cache time is an expression ; it is evaluated after performed the query. There needs to be a way to access the 'count' result, which I called 'result.number' because of the SELECT count() as number. The result could also be used in the CACHE IF. The idea here is that the count will vary over time, but we accept some imprecision to gain speed. SWho cares if there are 225 or 227 messages in a forum thread counter anyway ? However, if there are 2 messages, first caching the query is less necessary because it's fast, and second a variation in the count will be much easier to spot, thus we specify a shorter cache duration for small counts and a longer duration for large counts. For queries returning result sets, this is not usable of course, but a special feature for speeding count() queries would be welcome ! ----------- MAXIMUM CACHE SIZE 2000000 Pretty obvious. Size in bytes. For queries returning several rows, MIN/MAX on result rows could be useful also : MAXIMUM RESULT ROWS nnn Or maybe : CACHE IF (select count(*) from result) > nnn Thinking about it, using prepared queries seems a bad idea ; maybe the cache should act on the result of functions. This would force the application programmers to put the queries they want to optimize in functions, but as function code is shared between connections and prepared statements are not, maybe it would be easier to implement, and would shield against some subtle bugs, like PREPARing the different queries under the same name... In that case the cache manager would also know if the function returns SETOF or not, which would be interesting. What do you think of these optimizations ? Right now, a count() query cache could be implemented as a simple plsql function with a table as the cache, by the way.
pgsql-performance by date: