Re: Caching of Queries - Mailing list pgsql-performance
From | Aaron Werman |
---|---|
Subject | Re: Caching of Queries |
Date | |
Msg-id | BAY18-DAV14hd2stb2e0004dade@hotmail.com Whole thread Raw |
In response to | Re: Caching of Queries ("Matt Clark" <matt@ymogen.net>) |
Responses |
Re: Caching of Queries
|
List | pgsql-performance |
I'm not sure I understand your req fully. If the same request is repeatedly done with same parameters, you could implement a proxy web server with a croned script to purge stale pages. If there is substantially the same data being summarized, doing your own summary tables works; if accessed enough, they're in memory. I interleaved some notes into your posting. ----- Original Message ----- From: "Josh Berkus" <josh@agliodbs.com> To: "Postgresql Performance" <pgsql-performance@postgresql.org> Sent: Friday, October 01, 2004 1:10 PM Subject: Re: [PERFORM] Caching of Queries > People: > > Transparent "query caching" is the "industry standard" for how these things > are handled. However, Postgres' lack of this feature has made me consider > other approaches, and I'm starting to wonder if the "standard" query caching > -- where a materialized query result, or some reduction thereof, is cached in > database memory -- isn't the best way to cache things. I'm going to > abbreviate it "SQC" for the rest of this e-mail. > > Obviously, the draw of SQC is its transparency to developers. With it, the > Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you > set it up, give it some RAM, and it "just works". As someone who frequently > has to consult based on limited knowledge, I can understand the appeal. My sense is that pg is currently unique among popular dbmses in having the majority of applications being homegrown (a chicken / egg / advocacy issue - if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want to change the code; we'll see more about this when native WinPg happens). > > However, one of the problems with SQC, aside from the ones already mentioned > of stale data and/or cache-clearing, is that (at least in applications like > MySQL's) it is indiscriminate and caches, at least breifly, unique queries as > readily as common ones. Possibly Oracle's implementation is more > sophisticated; I've not had an opportunity. I'm not sure I agree here. Stale data and caching choice are optimizer/buffer manager choices and implementation can decide whether to allow stale data. These are design choices involving development effort and choices of where to spend server cycles and memory. All buffering choices cache unique objects, I'm not sure why this is bad (but sensing you want control of the choices). FWIW, this is my impression of other dbmses. In MySQL, a global cache can be specified with size and globally, locally, or through statement hints in queries to suggest caching results. I don't believe that these could be used as common subexpressions (with an exception of MERGE table component results). The optimizer knows nothing about the cached results - SQL select statements are hashed, and can be replaced by the the cached statement/results on a match. In DB2 and Oracle result sets are not cached. They have rich sets of materialized view features (that match your requirements). They allow a materialized view to be synchronous with table updates or asynchronous. Synchronous is often an unrealistic option, and asynchronous materialized views are refreshed at a specified schedule. The optimizers allow "query rewrite" (in Oracle it is a session option) so one can connect to the database and specify that the optimizer is allowed to replace subexpressions with data from (possibly stale) materialized views. SQL Server 2K has more restrictive synchronous MVs, but I've never used them. So, in your example use in Oracle, you would need to define appropriate MVs with a ½ hour refresh frequency, and hope that the planner would use them in your queries. The only change in the app is on connection you would allow use of asynchronous stale data. You're suggesting an alternative involving identifying common, but expensive, subexpressions and generating MVs for them. This is a pretty sophisticated undertaking, and probably requires some theory research to determine if it's viable. > > The other half of that problem is that an entire query is cached, rather than > just the relevant data to uniquely identify the request to the application. > This is bad in two respects; one that the entire query needs to be parsed to > see if a new query is materially equivalent, and that two materially > different queries which could utilize overlapping ranges of the same > underlying result set must instead cache their results separately, eating up > yet more memory. There are two separate issues. The cost of parse/optimization and the cost of results retrieval. Other dbmses hash statement text. This is a good thing, and probably 3 orders of magnitude faster than parse and optimization. (Oracle also has options to replace literals with parameters and match parse trees instead of text, expecting parse costs to be less than planning costs.) MySQL on a match simply returns the result set. Oracle and DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL approach seems to be almost what you're describing: all it needs is the statement hash, statement, and result set. The rest of your wish list, identifying and caching data to satisfy multiple request is what query rewrite does - as long as you've created the appropriate MV. > > To explain what I'm talking about, let me give you a counter-example of > another approach. > > I have a data-warehousing application with a web front-end. The data in the > application is quite extensive and complex, and only a summary is presented > to the public users -- but that summary is a query involving about 30 lines > and 16 joins. This summary information is available in 3 slightly different > forms. Further, the client has indicated that an up to 1/2 hour delay in > data "freshness" is acceptable. This sounds like a requirement for a summary table - if the data can be summarized appropriately, and a regular refresh process. > > The first step is forcing that "materialized" view of the data into memory. > Right now I'm working on a reliable way to do that without using Memcached, > which won't install on our Solaris servers. Temporary tables have the > annoying property of being per-connection, which doesn't work in a pool of 60 > connections. I'm not clear on your desire to keep the data in memory. If it is because of I/O cost of the summary table, database buffers should be caching it. If you want to store calculated results, again - why not use a summary table? The con of summary tables is the customization / denormalization of the data, and the need to have programs use them instead of source data - you seem to be willing to do each of these things. > > The second step, which I completed first due to the lack of technical > obstacles, is to replace all queries against this data with calls to a > Set-Returning Function (SRF). This allowed me to re-direct where the data > was coming from -- presumably the same thing could be done through RULES, but > it would have been considerably harder to implement. > > The first thing the SRF does is check the criteria passed to it against a set > of cached (in a table) criteria with that user's permission level which is < > 1/2 hour old. If the same criteria are found, then the SRF is returned a > set of row identifiers for the materialized view (MV), and looks up the rows > in the MV and returns those to the web client. > > If no identical set of criteria are found, then the query is run to get a set > of identifiers which are then cached, and the SRF returns the queried rows. > > Once I surmount the problem of storing all the caching information in > protected memory, the advantages of this approach over SQC are several: You are creating summary data on demand. I have had problems with this approach, mostly because it tends to cost more than doing it in batch and adds latency (unfortunately adding to peak load - so I tend to prefer periodic extract/summarize programs). In either approach why don't you want pg to cache the data? The result also feels more like persisted object data than typical rdbms processing. > > 1) The materialized data is available in 3 different forms; a list, a detail > view, and a spreadsheet. Each form as somewhat different columns and > different rules about ordering, which would likely confuse an SQC planner. > In this implementation, all 3 forms are able to share the same cache. I'm not clear what the issue here is. Are you summarizing data differently or using some business rules to identify orthogonal queries? > > 2) The application is comparing only sets of unambiguous criteria rather than > long queries which would need to be compared in planner form in order to > determine query equivalence. > > 3) With the identifier sets, we are able to cache other information as well, > such as a count of rows, further limiting the number of queries we must run. > > 4) This approach is ideally suited to the pagination and re-sorting common to > a web result set. As only the identifiers are cached, the results can be > re-sorted and broken in to pages after the cache read, a fast, all-in-memory > operation. > > In conclusion, what I'm saying is that while forms of transparent query > caching (plan, materialized or whatever) may be desirable for other reasons, > it's quite possible to achieve a superior level of "query caching" through > tight integration with the front-end application. This looks like you're building an object store to support a custom app that periodically or on demand pulls rdbms data mart data. The description of the use seems either static, suggesting summary tables or dynamic, suggesting that you're mimicking some function of a periodically extracted OLAP cube. > > If people are interested in this, I'd love to see some suggestions on ways to > force the materialized view into dedicated memory. Can you identify your objections to summarizing the data and letting pg buffer it? /Aaron
pgsql-performance by date: