Re: Caching of Queries - Mailing list pgsql-performance

From Josh Berkus
Subject Re: Caching of Queries
Date
Msg-id 200410011010.40705.josh@agliodbs.com
Whole thread Raw
In response to Re: Caching of Queries  (Matt Clark <matt@ymogen.net>)
Responses Re: Caching of Queries  (Pierre-Frédéric Caillaud<lists@boutiquenumerique.com>)
Re: Caching of Queries  ("Jim C. Nasby" <decibel@decibel.org>)
List pgsql-performance
People:

Transparent "query caching" is the "industry standard" for how these things
are handled.   However, Postgres' lack of this feature has made me consider
other approaches, and I'm starting to wonder if the "standard" query caching
-- where a materialized query result, or some reduction thereof, is cached in
database memory -- isn't the best way to cache things.  I'm going to
abbreviate it "SQC" for the rest of this e-mail.

Obviously, the draw of SQC is its transparency to developers.   With it, the
Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you
set it up, give it some RAM, and it "just works".   As someone who frequently
has to consult based on limited knowledge, I can understand the appeal.

However, one of the problems with SQC, aside from the ones already mentioned
of stale data and/or cache-clearing, is that (at least in applications like
MySQL's) it is indiscriminate and caches, at least breifly, unique queries as
readily as common ones.   Possibly Oracle's implementation is more
sophisticated; I've not had an opportunity.

The other half of that problem is that an entire query is cached, rather than
just the relevant data to uniquely identify the request to the application.
This is bad in two respects; one that the entire query needs to be parsed to
see if a new query is materially equivalent, and that two materially
different queries which could utilize overlapping ranges of the same
underlying result set must instead cache their results seperately, eating up
yet more memory.

To explain what I'm talking about, let me give you a counter-example of
another approach.

I have a data-warehousing application with a web front-end.    The data in the
application is quite extensive and complex, and only a summary is presented
to the public users -- but that summary is a query involving about 30 lines
and 16 joins.  This summary information is available in 3 slightly different
forms.  Further, the client has indicated that an up to 1/2 hour delay in
data "freshness" is acceptable.

The first step is forcing that "materialized" view of the data into memory.
Right now I'm working on a reliable way to do that without using Memcached,
which won't install on our Solaris servers.  Temporary tables have the
annoying property of being per-connection, which doesn't work in a pool of 60
connections.

The second step, which I completed first due to the lack of technical
obstacles, is to replace all queries against this data with calls to a
Set-Returning Function (SRF).   This allowed me to re-direct where the data
was coming from -- presumably the same thing could be done through RULES, but
it would have been considerably harder to implement.

The first thing the SRF does is check the criteria passed to it against a set
of cached (in a table) criteria with that user's permission level which is <
1/2 hour old.   If the same criteria are found, then the SRF is returned a
set of row identifiers for the materialized view (MV), and looks up the rows
in the MV and returns those to the web client.

If no identical set of criteria are found, then the query is run to get a set
of identifiers which are then cached, and the SRF returns the queried rows.

Once I surmount the problem of storing all the caching information in
protected memory, the advantages of this approach over SQC are several:

1) The materialized data is available in 3 different forms; a list, a detail
view, and a spreadsheet.  Each form as somewhat different columns and
different rules about ordering, which would likely confuse an SQC planner.
In this implementation, all 3 forms are able to share the same cache.

2) The application is comparing only sets of unambguous criteria rather than
long queries which would need to be compared in planner form in order to
determine query equivalence.

3) With the identifier sets, we are able to cache other information as well,
such as a count of rows, further limiting the number of queries we must run.

4) This approach is ideally suited to the pagination and re-sorting common to
a web result set.  As only the identifiers are cached, the results can be
re-sorted and broken in to pages after the cache read, a fast, all-in-memory
operation.

In conclusion, what I'm saying is that while forms of transparent query
caching (plan, materialized or whatever) may be desirable for other reasons,
it's quite possible to acheive a superior level of "query caching" through
tight integration with the front-end application.

If people are interested in this, I'd love to see some suggestions on ways to
force the materialized view into dedicated memory.

--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-performance by date:

Previous
From: "Aaron Werman"
Date:
Subject: Re: Caching of Queries
Next
From: Patrick Hatcher
Date:
Subject: Slow update/insert process