Re: Caching of Queries - Mailing list pgsql-performance

From Aaron Werman
Subject Re: Caching of Queries
Date
Msg-id BAY18-DAV14hd2stb2e0004dade@hotmail.com
Whole thread Raw
In response to Re: Caching of Queries  ("Matt Clark" <matt@ymogen.net>)
Responses Re: Caching of Queries
List pgsql-performance
I'm not sure I understand your req fully. If the same request is repeatedly
done with same parameters, you could implement a proxy web server with a
croned script to purge stale pages. If there is substantially the same data
being summarized, doing your own summary tables works; if accessed enough,
they're in memory. I interleaved some notes into your posting.

----- Original Message -----

From: "Josh Berkus" <josh@agliodbs.com>

To: "Postgresql Performance" <pgsql-performance@postgresql.org>

Sent: Friday, October 01, 2004 1:10 PM

Subject: Re: [PERFORM] Caching of Queries




> People:
>
> Transparent "query caching" is the "industry standard" for how these
things
> are handled.   However, Postgres' lack of this feature has made me
consider
> other approaches, and I'm starting to wonder if the "standard" query
caching
> -- where a materialized query result, or some reduction thereof, is cached
in
> database memory -- isn't the best way to cache things.  I'm going to
> abbreviate it "SQC" for the rest of this e-mail.
>
> Obviously, the draw of SQC is its transparency to developers.   With it,
the
> Java/Perl/PHP programmers and the DBA don't have to communicate at all --
you
> set it up, give it some RAM, and it "just works".   As someone who
frequently
> has to consult based on limited knowledge, I can understand the appeal.

My sense is that pg is currently unique among popular dbmses in having the
majority of applications being homegrown (a chicken / egg / advocacy issue -
if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want
to change the code; we'll see more about this when native WinPg happens).




>
> However, one of the problems with SQC, aside from the ones already
mentioned
> of stale data and/or cache-clearing, is that (at least in applications
like
> MySQL's) it is indiscriminate and caches, at least breifly, unique queries
as
> readily as common ones.   Possibly Oracle's implementation is more
> sophisticated; I've not had an opportunity.

I'm not sure I agree here. Stale data and caching choice are
optimizer/buffer manager choices and implementation can decide whether to
allow stale data. These are design choices involving development effort and
choices of where to spend server cycles and memory. All buffering choices
cache unique objects, I'm not sure why this is bad (but sensing you want
control of the choices). FWIW, this is my impression of other dbmses.

In MySQL, a global cache can be specified with size and globally, locally,
or through statement hints in queries to suggest caching results. I don't
believe that these could be used as common subexpressions (with an exception
of MERGE table component results). The optimizer knows nothing about the
cached results - SQL select statements are hashed, and can be replaced by
the the cached statement/results on a match.

In DB2 and Oracle result sets are not cached. They have rich sets of
materialized view features (that match your requirements). They allow a
materialized view to be synchronous with table updates or asynchronous.
Synchronous is often an unrealistic option, and asynchronous materialized
views are refreshed at a specified schedule. The optimizers allow "query
rewrite" (in Oracle it is a session option) so one can connect to the
database and specify that the optimizer is allowed to replace subexpressions
with data from (possibly stale) materialized views. SQL Server 2K has more
restrictive synchronous MVs, but I've never used them.

So, in your example use in Oracle, you would need to define appropriate MVs
with a ½ hour refresh frequency, and hope that the planner would use them in
your queries. The only change in the app is on connection you would allow
use of asynchronous stale data.

You're suggesting an alternative involving identifying common, but
expensive, subexpressions and generating MVs for them. This is a pretty
sophisticated undertaking, and probably requires some theory research to
determine if it's viable.


>
> The other half of that problem is that an entire query is cached, rather
than
> just the relevant data to uniquely identify the request to the
application.
> This is bad in two respects; one that the entire query needs to be parsed
to
> see if a new query is materially equivalent, and that two materially
> different queries which could utilize overlapping ranges of the same
> underlying result set must instead cache their results separately, eating
up
> yet more memory.

There are two separate issues. The cost of parse/optimization and the cost
of results retrieval. Other dbmses hash statement text. This is a good
thing, and probably 3 orders of magnitude faster than parse and
optimization. (Oracle also has options to replace literals with parameters
and match parse trees instead of text, expecting parse costs to be less than
planning costs.) MySQL on a match simply returns the result set. Oracle and
DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL
approach seems to be almost what you're describing: all it needs is the
statement hash, statement, and result set. The rest of your wish list,
identifying and caching data to satisfy multiple request is what query
rewrite does - as long as you've created the appropriate MV.




>
> To explain what I'm talking about, let me give you a counter-example of
> another approach.
>
> I have a data-warehousing application with a web front-end.    The data in
the
> application is quite extensive and complex, and only a summary is
presented
> to the public users -- but that summary is a query involving about 30
lines
> and 16 joins.  This summary information is available in 3 slightly
different
> forms.  Further, the client has indicated that an up to 1/2 hour delay in
> data "freshness" is acceptable.

This sounds like a requirement for a summary table - if the data can be
summarized appropriately, and a regular refresh process.




>
> The first step is forcing that "materialized" view of the data into
memory.
> Right now I'm working on a reliable way to do that without using
Memcached,
> which won't install on our Solaris servers.  Temporary tables have the
> annoying property of being per-connection, which doesn't work in a pool of
60
> connections.




I'm not clear on your desire to keep the data in memory. If it is because of
I/O cost of the summary table, database buffers should be caching it. If you
want to store calculated results, again - why not use a summary table? The
con of summary tables is the customization / denormalization of the data,
and the need to have programs use them instead of source data - you seem to
be willing to do each of these things.


>
> The second step, which I completed first due to the lack of technical
> obstacles, is to replace all queries against this data with calls to a
> Set-Returning Function (SRF).   This allowed me to re-direct where the
data
> was coming from -- presumably the same thing could be done through RULES,
but
> it would have been considerably harder to implement.
>
> The first thing the SRF does is check the criteria passed to it against a
set
> of cached (in a table) criteria with that user's permission level which is
<
> 1/2 hour old.   If the same criteria are found, then the SRF is returned a
> set of row identifiers for the materialized view (MV), and looks up the
rows
> in the MV and returns those to the web client.
>
> If no identical set of criteria are found, then the query is run to get a
set
> of identifiers which are then cached, and the SRF returns the queried
rows.
>
> Once I surmount the problem of storing all the caching information in
> protected memory, the advantages of this approach over SQC are several:

You are creating summary data on demand. I have had problems with this
approach, mostly because it tends to cost more than doing it in batch and
adds latency (unfortunately adding to peak load - so I tend to prefer
periodic extract/summarize programs). In either approach why don't you want
pg to cache the data? The result also feels more like persisted object data
than typical rdbms processing.

>
> 1) The materialized data is available in 3 different forms; a list, a
detail
> view, and a spreadsheet.  Each form as somewhat different columns and
> different rules about ordering, which would likely confuse an SQC planner.
> In this implementation, all 3 forms are able to share the same cache.




I'm not clear what the issue here is. Are you summarizing data differently
or using some business rules to identify orthogonal queries?


>
> 2) The application is comparing only sets of unambiguous criteria rather
than
> long queries which would need to be compared in planner form in order to
> determine query equivalence.

>
> 3) With the identifier sets, we are able to cache other information as
well,
> such as a count of rows, further limiting the number of queries we must
run.
>
> 4) This approach is ideally suited to the pagination and re-sorting common
to
> a web result set.  As only the identifiers are cached, the results can be
> re-sorted and broken in to pages after the cache read, a fast,
all-in-memory
> operation.
>
> In conclusion, what I'm saying is that while forms of transparent query
> caching (plan, materialized or whatever) may be desirable for other
reasons,
> it's quite possible to achieve a superior level of "query caching" through
> tight integration with the front-end application.

This looks like you're building an object store to support a custom app that
periodically or on demand pulls rdbms data mart data. The description of the
use seems either static, suggesting summary tables or dynamic, suggesting
that you're mimicking some function of a periodically extracted OLAP cube.




>
> If people are interested in this, I'd love to see some suggestions on ways
to
> force the materialized view into dedicated memory.




Can you identify your objections to summarizing the data and letting pg
buffer it?

/Aaron

pgsql-performance by date:

Previous
From: "Aaron Werman"
Date:
Subject: Re: Slow update/insert process
Next
From: Josh Berkus
Date:
Subject: Re: Caching of Queries