Thread: Database Caching

Database Caching

From
"Greg Sabino Mullane"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I had to make a relatively long drive yesterday, so I had lots of free time 
to do some thinking...and my thoughts were turning to caching and databases. 
The following is what I came up with: forgive me if it seems to be just an 
obvious ramble...

Why does a database need caching?

Normally, when one thinks of a database (or to be precise, a RDBMS) the 
ACID acronym comes up. This is concerned with having a stable database that 
can reliably be used by many users at the same time. Caching a query is 
unintuitive because it involves sharing information from transactions that 
may be separated by a great amount of time and/or by different users. 
However, from the standpoint of the database server, caching increases 
efficiency enormously. If 800 users all make the same query, then caching 
can help the database server backend (hereafter simply "database") to 
save part or all of the work it performs so it doesn't have to repeat the 
entire sequence of steps 800 times.

What is caching?

Caching basically means that we want to save frequently-used information 
into an easy to get to area. Usually, this means storing it into memory. 
Caching has three main goals: reducing disk access, reducing computation 
(i.e. CPU utilization), and speeding up the time as measured by how long a 
it takes a user to seea  result.  It does all this at the expense of RAM, 
and the tradeoff is almost always worth it.

In a database, there are three basic types of caching: query results, 
query plans, and relations.

The first, query result caching, simply means that we store into memory 
the exact output of a SELECT query for the next time that somebody performs 
that exact same SELECT query. Thus, if 800 people do a "SELECT * FROM foo", 
the database runs it for the first person, saves the results, and simply 
reads the cache for the next 799 requests. This saves the database from doing 
any disk access, practically removes CPU usage, and speeds up the query.

The second, query plan caching, involves saving the results of the optimizer, 
which is responsible for figuring out exactly "how" the databse is going to 
fetch the requested data. This type of caching usually involves a "prepared" 
query, which has almost all of the information needed to run the query with 
the exception of one or more "placeholders" (spots that are populated with 
variables at a later time). The query could also involve non-prepared 
statments as well. Thus, if someone prepares the query "SELECT flavor FROM 
foo WHERE size=?", and then executes it by sending in 300 different values 
for "size", the prepared statement is run through the optimizer, the r
esulting path is stored into the query plan cache, and the stored path is 
used for the 300 execute requests. Because the path is already known, the 
optimizer does not need to be called, which saves the database CPU and time.

The third, relation caching, simply involves putting the entire relation 
(usually a table or index) into memory so that it can be read quickly. 
This saves disk access, which basically means that it saves time. (This type 
of caching also can occur at the OS level, which caches files, but that will 
not be discussed here).

Those are the three basic types of caching, ways of implementing each are 
discussed below. Each one should complement the other, and a query may be 
able to use one, two, or all three of the caches.

I. Query result caching:

A query result cache is only used for SELECT queries that involve a 
relation (i.e. not for "SELECT version") Each cache entry has the following 
fields: the query itself, the actual results, a status, an access time, an 
access number, and a list of all included columns. (The column list actually 
tells as much information as needed to uniquely identify it, i.e. schema, 
database, table, and column). The status is merely an indicator of whether or 
not this cached query is valid. It may not be, because it may be invalidated 
for a user within a transaction but still be of use to others. 

When a select query is processed, it is first parsed apart into a basic common 
form, stripping whitespace, standardizing case, etc., in order to facilitate 
an accurate match. Note that no other pre-processing is really required, 
since we are only interested in exact matches that produce the exact same 
output. An advanced version of this would ideally be able to use the cached 
output of "SELECT bar,baz FROM foo" when it receives the query "SELECT 
baz,bar FROM foo", but that will require some advanced parsing. Possible, 
but probably not something to attempt in the first iteration of a query 
caching function. :) If there *is* a match (via a simple strcmp at first), 
and the status is marked as "valid", then the database simply uses the 
stored output, updates the access time and count, and exits. This should be 
extremely fast, as no disk access is needed, and almost no CPU. The 
complexity of the query will not matter either: a simple query will run just 
as fast as something with 12 sorts and 28 joins.

If a query is *not* already in the cache, then after the results are found 
and delivered to the user, the database will try and store them for the 
next appearance of that query. First, the size of the cache will be compared 
to the size of the query+output, to see if there is room for it. If there 
is, the query will be saved, with a status of valid, a time of 'now', a count 
of 1, a list of all affected columns found by parsing the query, and the total 
size of the query+output. If there is no room, then it will try to delete one 
or more to make room. Deleting can be done based on the oldest access time, 
smallest access count, or size of the query+output. Some balance of the first 
two would probably work best, with the access time being the most important. 
Everything will be configurable, of course.

Whenever a table is changed, the cache must be checked as well. A list of 
all columns that were actually changed is computed and compared against 
the list of columns for each query. At the first sign of a match, the 
query is marked as "invalid." This should happen before the changes are made 
to the table itself. We do not delete the query immediately since this may 
be inside of a transaction, and subject to rollback. However, we do need 
to mark it as invalid for the current user inside the current transaction: 
thus, the status flag. When the transaction is commited, all queries that have 
an "invalid" flag are deleted, then the tables are changed. Since the only 
time a query can be flagged as "invalid" is inside your own transaction, 
the deletion can be done very quickly.


II. Query plan caching

If a query is not cached, then it "falls through" to the next level of 
caching, the query plan. This can either be automatic or strictly on a 
user-requested format (i.e. through the prepare-execute paradigm). The latter 
is probably better, but it also would not hurt much to store non-explicitly 
prepared queries in this cache as long as there is room. This cache has a 
field for the query itself, the plan to be followed (i.e. scan this table, 
that index, sort the results, then group them), the columns used, the access 
time, the access count, and the total size. It may also want a simple flag 
of "prepared or non-prepared", where prepared indicates an explicitly 
prepared statment that has placeholders for future values. A good optimizer 
will actually change the plan based on the values plugged in to the prepared 
queries, so that information should become a part of the query itself as 
needed, and multiple queries may exist to handle different inputs. In 
general, most of the inputs will be similar enough to use the same path (e.g. 
"SELECT flavor FROM foo WHERE size=?" will most usually result in a simple 
numeric value for the executes). If a match *is* found, then the database 
can use the stored path, and not have to bother calling up the optimizer 
to figure it out. It then updates the access time, the access count, and
continues as normal. If a match was *not* found, then it might possibly 
want to be cached. Certainly, explicit prepares should always be cached. 
Non-explicitly prepared queries (those without placeholders) can also be 
cached. In theory, some of this will also be in the result cache, so that 
should be checked as well: it it is there, no reason to put it here. Prepared 
queries should always have priority over non-prepared, and the rest of the 
rules above for the result query should also apply, with a caveat that things 
that would affect the output of the optimizer (e.g. vacuuming) should also 
be taken into account when deleting entries.


III. Relation caching

The final cache is the relation itself, and simply involves putting the entire 
relation into memory. This cache has a field for the name of the relation, 
the table info itself, the type (indexes should ideally be cached more than 
tables, for example), the access time, and the acccess number. Loading could 
be done automatically, but most likely should be done according to a flag 
on the table itself or as an explicit command by the user.


Notes:

The "strcmp" used may seem rather crude, as it misses all but the exact 
same query, but it does cover most of the everyday cases. Queries are 
usually called through an application that keeps it in the same format, 
time after time, so the queries are very often exactly identical. A better 
parser would help, of course, but it would get rather complicated quickly.
Two quick examples: a web page that is read from a database is a query that 
is called many times with exactly the same syntax; a user doing a "refresh" 
to constantly check if something has changed since they last looked.

Sometimes a query may jump back to a previous type of cache, especially 
for things like subselects. The entire subselect query may not match, 
but the inner query should also be checked against the query result cache.

Each cache should have some configurable parameters, including the size in 
RAM, the maximum number of entries, and rules for adding and deleting.
They should also be directly viewable through a system table, so a DBA 
can quickly see exactly which queries are being cached and how often 
they are being used. There should be a command to quickly flush the cache, 
remove "old" entries, and to populate the query plan cache via a prepare 
statment. It should also be possible to do table changes without stopping 
to check the cache: perhaps flushing the cache and setting a global 
"cache is empty" flag would suffice.

Another problem is the prepare and execute: you are not always guaranteed 
to get a cached prepare if you do an execute, as it may have expired or 
there may simply be no more room. Those prepared statements inside a 
transaction should probably be flagged as "non-deletable" until the 
transaction is ended.

Storing the results of an execute in the query result cache is another 
problem. When a prepare is made, the database returns a link to that 
exact prepared statement in cache, so that all the client has to say is 
"run the query at 0x4AB3 with the value "12". Ideally, the database 
should be able to check these against the query result cache as well. It 
can do this by reconstructing the resulting query (by plugging the value into 
the prepared statement) or it can store the execute request as a type of 
query itself; instead of "SELECT baz FROM bar WHERE size=12" it would 
store "p0x4aB3:12".


The result cache would probaby be the easiest to implement, and also gives 
the most "bang for the buck." The path cache may be a bit harder, but is 
a very necessary feature. I don't know about the relation caching: it looks 
to be fairly easy, and I don't trust that, so I am guessing it is actually 
quite difficult.

Greg Sabino Mullane  greg@turnstep.com
PGP Key: 0x14964AC8 200202281132

-----BEGIN PGP SIGNATURE-----
Comment: http://www.turnstep.com/pgp.html

iD8DBQE8fqybvJuQZxSWSsgRAps9AKDwCkIH7GKSBjflyYSA0F7mQqD1MwCeJLCw
hqE1SxJ2Z7RxFGCu3UwIBrI=
=jlBy
-----END PGP SIGNATURE-----




Re: Database Caching

From
Tom Lane
Date:
"Greg Sabino Mullane" <greg@turnstep.com> writes:
> III. Relation caching

> The final cache is the relation itself, and simply involves putting the entire 
> relation into memory. This cache has a field for the name of the relation, 
> the table info itself, the type (indexes should ideally be cached more than 
> tables, for example), the access time, and the acccess number. Loading could 
> be done automatically, but most likely should be done according to a flag 
> on the table itself or as an explicit command by the user.

This would be a complete waste of time; the buffer cache (both Postgres'
own, and the kernel's disk cache) serves the purpose already.

As I've commented before, I have deep misgivings about the idea of a
query-result cache, too.
        regards, tom lane


Re: Database Caching

From
"Dann Corbit"
Date:
-----Original Message-----
From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
Sent: Thursday, February 28, 2002 3:27 PM
To: Greg Sabino Mullane
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Database Caching


"Greg Sabino Mullane" <greg@turnstep.com> writes:
> III. Relation caching

> The final cache is the relation itself, and simply involves putting
the entire
> relation into memory. This cache has a field for the name of the
relation,
> the table info itself, the type (indexes should ideally be cached more
than
> tables, for example), the access time, and the acccess number. Loading
could
> be done automatically, but most likely should be done according to a
flag
> on the table itself or as an explicit command by the user.

This would be a complete waste of time; the buffer cache (both Postgres'
own, and the kernel's disk cache) serves the purpose already.

As I've commented before, I have deep misgivings about the idea of a
query-result cache, too.
>>
I certainly agree with Tom on both counts.

Think of the extra machinery that would be needed to retain full
relational integrity with a result cache...

Then think of how easy it is to write your own application that caches
results if that is what you are after and you know (for some reason)
that it won't matter if the database gets updated.

I don't see how result caching can be a win, since it can be done when
needed anyway, without adding complexity to the database engine.  Just
have the application cache the result set.  Certainly a web server could
do this, if needed.

If there were a way to mark a database as read only (and this could not
be changed unless the entire database is shut down and restarted in
read/write mode) then there might be some utility to result set cache.
Otherwise, I think it will be wasted effort.  It might be worthwhile to
do the same for individual tables (with the same sort of restrictions).
But think of all the effort that would be needed to do this properly,
and what sort of payback would be received from it?

Again, the same goals can easily be accomplished without having to
perform major surgery on the database system.  I suspect that there is
some logical reason that Oracle/Sybase/IBM/Microsoft have not bothered
with it.

I am ready to be convinced otherwise if I see a logical reason for it.
But with the current evidence, I don't see any compelling reason to put
effort in that direction.
<<


Re: Database Caching

From
Karel Zak
Date:
On Thu, Feb 28, 2002 at 10:23:46PM -0000, Greg Sabino Mullane wrote:

> The first, query result caching, simply means that we store into memory 
> the exact output of a SELECT query for the next time that somebody performs 
> that exact same SELECT query. Thus, if 800 people do a "SELECT * FROM foo", 
> the database runs it for the first person, saves the results, and simply 
> reads the cache for the next 799 requests. This saves the database from doing 
> any disk access, practically removes CPU usage, and speeds up the query.
How expensive is keep cache in consistent state? May be maintainthe result cache is simular to read raw data from
disk/buffercache.
 
The result cache may be speeds up SELECT, but probably speeds downUPDATE/INSERT.

> The second, query plan caching, involves saving the results of the optimizer, 
> which is responsible for figuring out exactly "how" the databse is going to 
> fetch the requested data. This type of caching usually involves a "prepared" 
> query, which has almost all of the information needed to run the query with 
> the exception of one or more "placeholders" (spots that are populated with 
> variables at a later time). The query could also involve non-prepared 
> statments as well. Thus, if someone prepares the query "SELECT flavor FROM 
> foo WHERE size=?", and then executes it by sending in 300 different values 
> for "size", the prepared statement is run through the optimizer, the r
> esulting path is stored into the query plan cache, and the stored path is 
> used for the 300 execute requests. Because the path is already known, the 
> optimizer does not need to be called, which saves the database CPU and time.
IMHO query plan cache maintained by user's PREPARE/EXECUTE/DEALLOCATE statements is sufficient, because user good know
whatchange in DBscheme (drop function, relation...).
 
The "transparent" query cache for each query that go into backend(IMHO) will too expensive, because you must
check/analyzeeach query. I mean more effective is keep in memory fragments of query, for exampleoperator, relation
description-- the cache like this PostgreSQL alreadyhave (syscache).
 

> The third, relation caching, simply involves putting the entire relation 
> (usually a table or index) into memory so that it can be read quickly. 
Already done by buffers :-)
       Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Database Caching

From
mlw
Date:
Tom Lane wrote:
> 
> "Greg Sabino Mullane" <greg@turnstep.com> writes:
> > III. Relation caching
> 
> > The final cache is the relation itself, and simply involves putting the entire
> > relation into memory. This cache has a field for the name of the relation,
> > the table info itself, the type (indexes should ideally be cached more than
> > tables, for example), the access time, and the acccess number. Loading could
> > be done automatically, but most likely should be done according to a flag
> > on the table itself or as an explicit command by the user.
> 
> This would be a complete waste of time; the buffer cache (both Postgres'
> own, and the kernel's disk cache) serves the purpose already.
> 
> As I've commented before, I have deep misgivings about the idea of a
> query-result cache, too.

I appreciate your position, and I can see your point, however, where caching is
a huge win is when you have a database which is largely static, something like
the back end of a web server.

The content is updated regularly, but not constantly. There is a window, say 4
hours, where the entire database is static, and all it is doing is running the
same 100 queries, over and over again.

My previous company, www.dmn.com, has a music database system. We logged all
the backed info, most of the queries were duplicated many times. This can be
explained by multiple users interested in the same thing or the same user
hitting "next page"

If you could cache the "next page" or similar hit results, you could really
increase throughput and capaciy of a website.


Re: Database Caching

From
Jan Wieck
Date:
Tom Lane wrote:
> "Greg Sabino Mullane" <greg@turnstep.com> writes:
> > III. Relation caching
>
> > The final cache is the relation itself, and simply involves putting the entire
> > relation into memory. This cache has a field for the name of the relation,
> > the table info itself, the type (indexes should ideally be cached more than
> > tables, for example), the access time, and the acccess number. Loading could
> > be done automatically, but most likely should be done according to a flag
> > on the table itself or as an explicit command by the user.
>
> This would be a complete waste of time; the buffer cache (both Postgres'
> own, and the kernel's disk cache) serves the purpose already.
>
> As I've commented before, I have deep misgivings about the idea of a
> query-result cache, too.
   I  wonder how this sort of query result caching could work in   our MVCC and visibility world  at  all.  Multiple
concurrent  running  transactions  see  different snapshots of the table,   hence different result sets for  exactly
one and  the  same   querystring  at the same time ... er ...  yeah, one cache set   per query/snapshot combo, great!
 
   To really gain some speed with this sort of query cache, we'd   have to adopt the #1 MySQL design rule "speed over
precision"  and ignore MVCC for query-cached relations, or what?
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Database Caching

From
"Marc G. Fournier"
Date:
On Fri, 1 Mar 2002, Jan Wieck wrote:

> Tom Lane wrote:
> > "Greg Sabino Mullane" <greg@turnstep.com> writes:
> > > III. Relation caching
> >
> > > The final cache is the relation itself, and simply involves putting the entire
> > > relation into memory. This cache has a field for the name of the relation,
> > > the table info itself, the type (indexes should ideally be cached more than
> > > tables, for example), the access time, and the acccess number. Loading could
> > > be done automatically, but most likely should be done according to a flag
> > > on the table itself or as an explicit command by the user.
> >
> > This would be a complete waste of time; the buffer cache (both Postgres'
> > own, and the kernel's disk cache) serves the purpose already.
> >
> > As I've commented before, I have deep misgivings about the idea of a
> > query-result cache, too.
>
>     I  wonder how this sort of query result caching could work in
>     our MVCC and visibility world  at  all.  Multiple  concurrent
>     running  transactions  see  different snapshots of the table,
>     hence different result sets for  exactly  one  and  the  same
>     querystring  at the same time ... er ...  yeah, one cache set
>     per query/snapshot combo, great!
>
>     To really gain some speed with this sort of query cache, we'd
>     have to adopt the #1 MySQL design rule "speed over precision"
>     and ignore MVCC for query-cached relations, or what?

Actually, you are missing, I think, as is everyone, the 'semi-static'
database ... you know?  the one where data gets dumped to it by a script
every 5 minutes, but between dumps, there are hundreds of queries per
second/minute between the updates that are the same query repeated each
time ...

As soon as there is *any* change to the data set, the query cache should
be marked dirty and reloaded ... mark it dirty on any update, delete or
insert ...

So, if I have 1000 *pure* SELECTs, the cache is fine ... as soon as one
U/I/D pops up, its invalidated ...





Re: Database Caching

From
Tom Lane
Date:
mlw <markw@mohawksoft.com> writes:
> My previous company, www.dmn.com, has a music database system. We logged all
> the backed info, most of the queries were duplicated many times. This can be
> explained by multiple users interested in the same thing or the same user
> hitting "next page"

> If you could cache the "next page" or similar hit results, you could really
> increase throughput and capaciy of a website.

Sure, but the most appropriate place to do that sort of thing is in the
application (in this case, probably a cgi/php-ish layer).  Only the
application can know what its requirements are.  In the case you
describe, it'd be perfectly okay for a "stale" cache result to be
delivered that's a few minutes out of date.  Maybe a few hours out of
date would be good enough too, or maybe not.  But if we do this at the
database level then we have to make sure it won't break *any*
applications, and that means the most conservative validity assumptions.
(Thus all the angst about how to invalidate cache entries on-the-fly.)

Likewise, the application has a much better handle than the database on
the issue of which query results are likely to be worth caching.

I think that reports of "we sped up this application X times by caching
query results on the client side" are interesting, but they are not good
guides to what would happen if we tried to put a query-result cache into
the database.
        regards, tom lane


Re: Database Caching

From
Jan Wieck
Date:
Marc G. Fournier wrote:
> On Fri, 1 Mar 2002, Jan Wieck wrote:
>
> > Tom Lane wrote:
> > > "Greg Sabino Mullane" <greg@turnstep.com> writes:
> > > > III. Relation caching
> > >
> > > > The final cache is the relation itself, and simply involves putting the entire
> > > > relation into memory. This cache has a field for the name of the relation,
> > > > the table info itself, the type (indexes should ideally be cached more than
> > > > tables, for example), the access time, and the acccess number. Loading could
> > > > be done automatically, but most likely should be done according to a flag
> > > > on the table itself or as an explicit command by the user.
> > >
> > > This would be a complete waste of time; the buffer cache (both Postgres'
> > > own, and the kernel's disk cache) serves the purpose already.
> > >
> > > As I've commented before, I have deep misgivings about the idea of a
> > > query-result cache, too.
> >
> >     I  wonder how this sort of query result caching could work in
> >     our MVCC and visibility world  at  all.  Multiple  concurrent
> >     running  transactions  see  different snapshots of the table,
> >     hence different result sets for  exactly  one  and  the  same
> >     querystring  at the same time ... er ...  yeah, one cache set
> >     per query/snapshot combo, great!
> >
> >     To really gain some speed with this sort of query cache, we'd
> >     have to adopt the #1 MySQL design rule "speed over precision"
> >     and ignore MVCC for query-cached relations, or what?
>
> Actually, you are missing, I think, as is everyone, the 'semi-static'
> database ... you know?  the one where data gets dumped to it by a script
> every 5 minutes, but between dumps, there are hundreds of queries per
> second/minute between the updates that are the same query repeated each
> time ...
>
> As soon as there is *any* change to the data set, the query cache should
> be marked dirty and reloaded ... mark it dirty on any update, delete or
> insert ...
>
> So, if I have 1000 *pure* SELECTs, the cache is fine ... as soon as one
> U/I/D pops up, its invalidated ...
   But  in that case, why not caching the entire HTML result for   the URL or search request? That'd save some wasted
cycles in   Tomcat as well.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Database Caching

From
mlw
Date:
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > My previous company, www.dmn.com, has a music database system. We logged all
> > the backed info, most of the queries were duplicated many times. This can be
> > explained by multiple users interested in the same thing or the same user
> > hitting "next page"
> 
> > If you could cache the "next page" or similar hit results, you could really
> > increase throughput and capaciy of a website.
> 
> Sure, but the most appropriate place to do that sort of thing is in the
> application (in this case, probably a cgi/php-ish layer).  Only the
> application can know what its requirements are.  In the case you
> describe, it'd be perfectly okay for a "stale" cache result to be
> delivered that's a few minutes out of date.  Maybe a few hours out of
> date would be good enough too, or maybe not.  But if we do this at the
> database level then we have to make sure it won't break *any*
> applications, and that means the most conservative validity assumptions.
> (Thus all the angst about how to invalidate cache entries on-the-fly.)
> 
> Likewise, the application has a much better handle than the database on
> the issue of which query results are likely to be worth caching.
> 
> I think that reports of "we sped up this application X times by caching
> query results on the client side" are interesting, but they are not good
> guides to what would happen if we tried to put a query-result cache into
> the database.

I would like to respectfully differ with you here. If query results are cached
in an ACID safe way, then many things could be improved.

The problem with applications caching is that they do not have intimate
knowledge of the database, and thus do not know when their cache is invalid. On
top of that, many web sites have multiple web servers connected to a single
database. The caching must sit between the web software and the DB. The logical
place for caching is in the database.

If we went even further, and cached multiple levels of query, i.e. the result
of the sub-select within the whole query, then things like views and more
complex queries would could get an increase in performance.

Take this query:

select * from (select * from T1 where field = 'fubar') as Z right outer join   (select alt from T2,  (select * from T1
wherefield = 'fubar') as X where
 
T2.key = X.key) as Y on T3.key = Y.key) on (Z.key = Y.alt) where Z.key = NULL;


Forgive this query, it is probably completely wrong, the actual query it is
intended to represent is quite a bit larger. The intention is to select a set
of alternate values based on a set of initial values, but also eliminating any
alternate values which may also be in the initial set. Anyway, we have to query
"Select * from T1 where field = 'fubar'" twice.

If that subselect could be cached, it could speed up the query a bit. Right now
I use a temp table, which is a hassle.

Caching results can and do speed up duplicate queries, there can really be no
argument about it. The argument is about the usefulness of the feature and the
cost of implementing it. If maintaining the cache costs more than the benefit
of having it, obviously it is a loser. If implementing it takes up the
biological CPU cycles of he development team that would be spent doing more
important things, then it is also a loser. If however, it is relatively "easy"
(hehe) to do, and doesn't affect performance greatly, is there any harm in
doing so?


Re: Database Caching

From
Stephan Szabo
Date:
> >     I  wonder how this sort of query result caching could work in
> >     our MVCC and visibility world  at  all.  Multiple  concurrent
> >     running  transactions  see  different snapshots of the table,
> >     hence different result sets for  exactly  one  and  the  same
> >     querystring  at the same time ... er ...  yeah, one cache set
> >     per query/snapshot combo, great!
> >
> >     To really gain some speed with this sort of query cache, we'd
> >     have to adopt the #1 MySQL design rule "speed over precision"
> >     and ignore MVCC for query-cached relations, or what?
>
> Actually, you are missing, I think, as is everyone, the 'semi-static'
> database ... you know?  the one where data gets dumped to it by a script
> every 5 minutes, but between dumps, there are hundreds of queries per
> second/minute between the updates that are the same query repeated each
> time ...
>
> As soon as there is *any* change to the data set, the query cache should
> be marked dirty and reloaded ... mark it dirty on any update, delete or
> insert ...
>
> So, if I have 1000 *pure* SELECTs, the cache is fine ... as soon as one
> U/I/D pops up, its invalidated ...

The question is, when it's invalidated, how does it become valid again?
I don't see that there's a way to do it only by query string that doesn't
result in meaning that the cache cannot cache a query again until any
transactions that can see the prior state are finished since otherwise
you'd be providing the incorrect results to that transaction. But I
haven't spent much time thinking about it either.



Re: Database Caching

From
Justin Clift
Date:
Hi guys,

Stephan Szabo wrote:
<snip> 
> The question is, when it's invalidated, how does it become valid again?
> I don't see that there's a way to do it only by query string that doesn't
> result in meaning that the cache cannot cache a query again until any
> transactions that can see the prior state are finished since otherwise
> you'd be providing the incorrect results to that transaction. But I
> haven't spent much time thinking about it either.

It seems like a good idea to me, but only if it's optional.  It could
get in the way for systems that don't need it, but would be really
beneficial for some types of systems which are read-only or mostly-read
only (with consistent queries) in nature.

i.e.  Lets take a web page where clients can look up which of 10,000
records are either .biz, .org, .info, or .com.

So, we have a database query of simply:

SELECT name FROM sometable WHERE tld = 'biz';

And lets say 2,000 records come back, which are cached.

Then the next query comes in, which is :

SELECT name FROM sometable WHERE tld = 'info';

And lets say 3,000 records come back, which are also cached.

Now, both of these queries are FULLY cached.  So, if either query
happens again, it's a straight memory read and dump, no disk activity
involved, etc (very fast in comparison).

Now, lets say a transaction which involves a change of "sometable"
COMMITs.  This should invalidate these results in the cache, as the
viewpoint of the transaction could now be incorrect (there might now be
less or more or different results for .info or .biz).  The next queries
will be cached too, and will keep upon being cached until the next
transaction involving a change to "sometable" COMMITs.

In this type of database access, this looks like a win.

But caching results in this matter could be a memory killer for those
applications which aren't so predictable in their queries, or are not so
read-only.  That's why I feel it should be optional, but I also feel it
should be added due to what looks like massive wins without data
integrity nor reliability issues.

Hope this helps.

:-)

Regards and best wishes,

Justin Clift

> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi


Re: Database Caching

From
Stephan Szabo
Date:
On Sat, 2 Mar 2002, Justin Clift wrote:

> Hi guys,
>
> Stephan Szabo wrote:
> <snip>
> > The question is, when it's invalidated, how does it become valid again?
> > I don't see that there's a way to do it only by query string that doesn't
> > result in meaning that the cache cannot cache a query again until any
> > transactions that can see the prior state are finished since otherwise
> > you'd be providing the incorrect results to that transaction. But I
> > haven't spent much time thinking about it either.
>
> i.e.  Lets take a web page where clients can look up which of 10,000
> records are either .biz, .org, .info, or .com.
>
> So, we have a database query of simply:
>
> SELECT name FROM sometable WHERE tld = 'biz';
>
> And lets say 2,000 records come back, which are cached.
>
> Then the next query comes in, which is :
>
> SELECT name FROM sometable WHERE tld = 'info';
>
> And lets say 3,000 records come back, which are also cached.
>
> Now, both of these queries are FULLY cached.  So, if either query
> happens again, it's a straight memory read and dump, no disk activity
> involved, etc (very fast in comparison).
>
> Now, lets say a transaction which involves a change of "sometable"
> COMMITs.  This should invalidate these results in the cache, as the
> viewpoint of the transaction could now be incorrect (there might now be
> less or more or different results for .info or .biz).  The next queries
> will be cached too, and will keep upon being cached until the next
> transaction involving a change to "sometable" COMMITs.

But, if there's a transaction that started before the change committed,
then you may have two separate sets of possible results for the same query
string so query string doesn't seem unique enough to describe a set of
results.  Maybe I haven't read carefully enough, but most of the proposals
seem to gloss over this point.




Re: Database Caching

From
Greg Copeland
Date:
I'm sneaking out of my cave here.. ;)

mlw wrote:
> Tom Lane wrote:
> 
>>mlw <markw@mohawksoft.com> writes:
>>
>>>My previous company, www.dmn.com, has a music database system. We logged all
>>>the backed info, most of the queries were duplicated many times. This can be
>>>explained by multiple users interested in the same thing or the same user
>>>hitting "next page"
>>>
>>>If you could cache the "next page" or similar hit results, you could really
>>>increase throughput and capaciy of a website.

Well, I'm going to assume that the records in question have been 
completely cached in cases like this.  So isn't the primary source of 
improvement the query parse and plan generation?  As Tom seems to think, 
wouldn't it make more sense to optimize the parse/plan generation rather 
than caching the result set?  After all, if the plan can be pinned how 
much of a performance boost do you expect to get from processing a 
cached plan versus returning a cached result set?

Seriously, I am curious as to what the expected return is?  Still a 
multiple or simply some minor percent?


>>>
>>Sure, but the most appropriate place to do that sort of thing is in the
>>application (in this case, probably a cgi/php-ish layer).  Only the
>>application can know what its requirements are.  In the case you
>>describe, it'd be perfectly okay for a "stale" cache result to be
>>delivered that's a few minutes out of date.  Maybe a few hours out of
>>date would be good enough too, or maybe not.  But if we do this at the
>>database level then we have to make sure it won't break *any*
>>applications, and that means the most conservative validity assumptions.
>>(Thus all the angst about how to invalidate cache entries on-the-fly.)
>>
>>Likewise, the application has a much better handle than the database on
>>the issue of which query results are likely to be worth caching.
>>
>>I think that reports of "we sped up this application X times by caching
>>query results on the client side" are interesting, but they are not good
>>guides to what would happen if we tried to put a query-result cache into
>>the database.
>>
> 
> I would like to respectfully differ with you here. If query results are cached
> in an ACID safe way, then many things could be improved.
> 
> The problem with applications caching is that they do not have intimate
> knowledge of the database, and thus do not know when their cache is invalid. On
> top of that, many web sites have multiple web servers connected to a single
> database. The caching must sit between the web software and the DB. The logical
> place for caching is in the database.
> 


But hybrid application cache designs can mostly if not completely 
address this and also gets some added benefits in many cases.  If you 
have a "cache" table which denotes the tables which are involved in the 
cached results that you desire, you can then update it's state via 
triggers or even exposed procedures accordingly to reflect if the client 
side cache has been invalidated or not.  This means that a client need 
only query the cache table first to determine if it's cache is clean or 
dirty.  When it's dirty, it mearly needs to query the result set again.

Let's also not forget that client side caching can also yield 
significant networking performance improvements over a result set that 
is able to be cached on the server.  Why?  Well, let's say a query has a 
result set of 10,000 rows which are being cached on the server.  A 
remote client queries and fetches 10,0000 results over the network.  Now 
then, even though the result set is cached by the database, it is still 
being transfered over the wire for each and every query.  Now then, 
let's assume that 10 other people perform this same query.  That's 
100,000 rows which get transfered across the wire.  With the client side 
caching scheme, you have 10,010 rows (initial 10,000 result set lpus a 
single row result set which indicates the status of the cache) returned 
across the wire which tell the client that it's cache is clean or dirty.

Let's face it, in order for the cache to make sense, the same result set 
needs to be used over and over again.  In these cases, it would seem 
like in real world situations, a strong client side hybrid caching 
scheme wins in most cases.

I'd also like to toss out that I'd expect somewhere there would be a 
trade off between data cache and result set cache.  On systems without 
infinite memory, where's the line of deliniation?  It seems somewhere 
you may be limiting the size of the generalized cache at the expense of 
the cached result sets.  If this happens, the cases where a cached 
result set may be improved but refreshing that result set my be hindered 
as well might all other queries on the system.


> If we went even further, and cached multiple levels of query, i.e. the result
> of the sub-select within the whole query, then things like views and more
> complex queries would could get an increase in performance.
> 
> Take this query:
> 
> select * from (select * from T1 where field = 'fubar') as Z right outer join
>     (select alt from T2,  (select * from T1 where field = 'fubar') as X where
> T2.key = X.key) as Y 
>     on T3.key = Y.key) on (Z.key = Y.alt) where Z.key = NULL;
> 
> 
> Forgive this query, it is probably completely wrong, the actual query it is
> intended to represent is quite a bit larger. The intention is to select a set
> of alternate values based on a set of initial values, but also eliminating any
> alternate values which may also be in the initial set. Anyway, we have to query
> "Select * from T1 where field = 'fubar'" twice.
> 
> If that subselect could be cached, it could speed up the query a bit. Right now
> I use a temp table, which is a hassle.
> 

It's funny you say that because I was thinking that should server side 
result set caching truely be desired, wouldn't the use of triggers, a 
procedure for client interface and a temporary table be a poor-man's 
implementation yeilding almost the same results?  Though, I must say I'm 
assuming that the queries will *be* the same and *not nearly* the same.  But in the web world, isn't this really the
situationwe're trying to 
 
address?  That is, give me the front page?


> Caching results can and do speed up duplicate queries, there can really be no
> argument about it. The argument is about the usefulness of the feature and the
> cost of implementing it. If maintaining the cache costs more than the benefit
> of having it, obviously it is a loser. If implementing it takes up the
> biological CPU cycles of he development team that would be spent doing more
> important things, then it is also a loser. If however, it is relatively "easy"
> (hehe) to do, and doesn't affect performance greatly, is there any harm in
> doing so?
> 


Are any of the ideas that I put forth a viable substitute?

Greg



Re: Database Caching

From
Hannu Krosing
Date:
On Fri, 2002-03-01 at 04:44, Dann Corbit wrote:
> As I've commented before, I have deep misgivings about the idea of a
> query-result cache, too.
> >>
> I certainly agree with Tom on both counts.
> 
> Think of the extra machinery that would be needed to retain full
> relational integrity with a result cache...
> 
> Then think of how easy it is to write your own application that caches
> results if that is what you are after and you know (for some reason)
> that it won't matter if the database gets updated.

That would be trivial indeed.

The tricky case is when you dont know when and how the database will be
updated. That would need an insert/update/delete trigger on each and
every table that contributes to the query, either explicitly ot through
rule expansion. Doing that from client side would a) be difficult and b)
probably too slow to be of any use. To do it in a general fashion wopuld
also need a way to get the expanded query tree for a query to see which
tables the query depends on.

> I don't see how result caching can be a win, since it can be done when
> needed anyway, without adding complexity to the database engine.  Just
> have the application cache the result set.  Certainly a web server could
> do this, if needed.

More advanced application server can do it all right. But you still need
sound cache invalidation mechanisms.

> If there were a way to mark a database as read only (and this could not
> be changed unless the entire database is shut down and restarted in
> read/write mode) then there might be some utility to result set cache.
> Otherwise, I think it will be wasted effort.  It might be worthwhile to
> do the same for individual tables (with the same sort of restrictions).
> But think of all the effort that would be needed to do this properly,
> and what sort of payback would be received from it?

The payback will be blazingly fast slashdot type applications with very
little effort from end application programmer.

> Again, the same goals can easily be accomplished without having to
> perform major surgery on the database system.  I suspect that there is
> some logical reason that Oracle/Sybase/IBM/Microsoft have not bothered
> with it.

I think they were designed/developed when WWW was nonexistent, and
client-server meant a system where client and server where separated by
a (slow) network connection that would negate most of the benefit from
server side cacheing. On todays application server scenario the client
(AS) and server (DB) are usually very close, if not on the same computer
and thus effectively managed cache can be kept on DB to avoid all the
cache invalidation logic going through DB-AS link.

> I am ready to be convinced otherwise if I see a logical reason for it.
> But with the current evidence, I don't see any compelling reason to put
> effort in that direction.

In what direction are _you_ planning to put your effort ?

------------
Hannu



Re: Database Caching

From
"Rod Taylor"
Date:
> The tricky case is when you dont know when and how the database will
be
> updated. That would need an insert/update/delete trigger on each and
> every table that contributes to the query, either explicitly ot
through
> rule expansion. Doing that from client side would a) be difficult
and b)
> probably too slow to be of any use. To do it in a general fashion
wopuld
> also need a way to get the expanded query tree for a query to see
which
> tables the query depends on.

Rather than result caching, I'd much rather see an asynchronous NOTICE
telling my webservers which have RULES firing them off when a table is
modified.

Let the webserver hold the cache (as they do now in my case, and in
slashdots) but it gives a way that the database can tell all those
involved to drop the cache and rebuild.  Currently I accomplish this
with a timestamp on a single row table.  Could probably accomplish it
with a periodic SELECT TRUE and watch for the notice -- but in my case
I need to support other dbs as well.



Re: Database Caching

From
Tom Lane
Date:
"Rod Taylor" <rbt@zort.ca> writes:
> Sorry, NOTIFY -- not NOTICE  (damn keyboard...)
> Right now we're required to do a select against the database
> periodically

Certainly not; I fixed that years ago (I think it was the first
nontrivial thing I ever did with the PostgreSQL code).  You can
execute PQnotifies without any PQexecs, and you can even have your
application sleep waiting for a notify to come in.  See the libpq
docs under the heading of "Asynchronous Notification".
        regards, tom lane


Re: Database Caching

From
Tom Lane
Date:
"Rod Taylor" <rbt@zort.ca> writes:
> Rather than result caching, I'd much rather see an asynchronous NOTICE
> telling my webservers which have RULES firing them off when a table is
> modified.

LISTEN/NOTIFY?
        regards, tom lane


Re: Database Caching

From
"Rod Taylor"
Date:
Sorry, NOTIFY -- not NOTICE  (damn keyboard...)

Right now we're required to do a select against the database
periodically which means a test is required before hitting any cached
elements.  By the time you select true, might as well do the real
select anyway (normally simple index lookup).

The ability to receive one without making a query first would be very
advantageous.
--
Rod Taylor

This message represents the official view of the voices in my head

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Rod Taylor" <rbt@zort.ca>
Cc: "Hannu Krosing" <hannu@krosing.net>; "Dann Corbit"
<DCorbit@connx.com>; "Greg Sabino Mullane" <greg@turnstep.com>;
<pgsql-hackers@postgresql.org>
Sent: Monday, March 04, 2002 4:50 PM
Subject: Re: [HACKERS] Database Caching


> "Rod Taylor" <rbt@zort.ca> writes:
> > Rather than result caching, I'd much rather see an asynchronous
NOTICE
> > telling my webservers which have RULES firing them off when a
table is
> > modified.
>
> LISTEN/NOTIFY?
>
> regards, tom lane
>



Cache invalidation notification (was: Database Caching)

From
Hannu Krosing
Date:
On Mon, 2002-03-04 at 23:50, Tom Lane wrote:
> "Rod Taylor" <rbt@zort.ca> writes:
> > Rather than result caching, I'd much rather see an asynchronous NOTICE
> > telling my webservers which have RULES firing them off when a table is
> > modified.
> 
> LISTEN/NOTIFY?

But is there an easy way to see which tables affect the query result,
something like machine-readable EXPLAIN ?

Another thing that I have thought about Is adding a parameter to notify,
so that you can be told _what_ is changed (there is big difference
between being told that "somebody called" and "Bob called")

There are two ways of doing it 

1) the "wire protocol compatible" way , where the argument to LISTEN is
interpreted as a regular expression (or LIKE expression), so that you
can do

LISTEN 'ITEM_INVALID:%';

and the receive all notifies for

NOTIFY 'ITEM_INVALID:' || ITEM_ID;

and

NOTIFY 'ITEM_INVALID:ALL';

where the notify comes in as one string

2) the more general way where you listen on exact "relation" and notify
has an argument at both syntax and protocol level, i.e

LISTEN ITEM_INVALID;

and

NOTIFY 'ITEM_INVALID',ITEM_ID;

NOTIFY 'ITEM_INVALID','ALL';

------------------
Hannu




Re: Cache invalidation notification (was: Database Caching)

From
"Rod Taylor"
Date:
You could accomplish that with rules.

Make a rule with the where clause you like, then NOTIFY the client.
--
Rod Taylor

This message represents the official view of the voices in my head

----- Original Message -----
From: "Hannu Krosing" <hannu@tm.ee>
To: "Tom Lane" <tgl@sss.pgh.pa.us>
Cc: "Rod Taylor" <rbt@zort.ca>; "Hannu Krosing" <hannu@krosing.net>;
"Dann Corbit" <DCorbit@connx.com>; "Greg Sabino Mullane"
<greg@turnstep.com>; <pgsql-hackers@postgresql.org>
Sent: Tuesday, March 05, 2002 2:45 AM
Subject: [HACKERS] Cache invalidation notification (was: Database
Caching)


> On Mon, 2002-03-04 at 23:50, Tom Lane wrote:
> > "Rod Taylor" <rbt@zort.ca> writes:
> > > Rather than result caching, I'd much rather see an asynchronous
NOTICE
> > > telling my webservers which have RULES firing them off when a
table is
> > > modified.
> >
> > LISTEN/NOTIFY?
>
> But is there an easy way to see which tables affect the query
result,
> something like machine-readable EXPLAIN ?
>
> Another thing that I have thought about Is adding a parameter to
notify,
> so that you can be told _what_ is changed (there is big difference
> between being told that "somebody called" and "Bob called")
>
> There are two ways of doing it
>
> 1) the "wire protocol compatible" way , where the argument to LISTEN
is
> interpreted as a regular expression (or LIKE expression), so that
you
> can do
>
> LISTEN 'ITEM_INVALID:%';
>
> and the receive all notifies for
>
> NOTIFY 'ITEM_INVALID:' || ITEM_ID;
>
> and
>
> NOTIFY 'ITEM_INVALID:ALL';
>
> where the notify comes in as one string
>
> 2) the more general way where you listen on exact "relation" and
notify
> has an argument at both syntax and protocol level, i.e
>
> LISTEN ITEM_INVALID;
>
> and
>
> NOTIFY 'ITEM_INVALID',ITEM_ID;
>
> NOTIFY 'ITEM_INVALID','ALL';
>
> ------------------
> Hannu
>
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>



Re: Database Caching

From
F Harvell
Date:
On 01 Mar 2002 10:24:39 +0500, Hannu Krosing wrote:
> ...
> 
> > I don't see how result caching can be a win, since it can be done when
> > needed anyway, without adding complexity to the database engine.  Just
> > have the application cache the result set.  Certainly a web server could
> > do this, if needed.
> 
> More advanced application server can do it all right. But you still need
> sound cache invalidation mechanisms.
> 
> > If there were a way to mark a database as read only (and this could not
> > be changed unless the entire database is shut down and restarted in
> > read/write mode) then there might be some utility to result set cache.
> > Otherwise, I think it will be wasted effort.  It might be worthwhile to
> > do the same for individual tables (with the same sort of restrictions).
> > But think of all the effort that would be needed to do this properly,
> > and what sort of payback would be received from it?
> 
> The payback will be blazingly fast slashdot type applications with very
> little effort from end application programmer.
> 
> > Again, the same goals can easily be accomplished without having to
> > perform major surgery on the database system.  I suspect that there is
> > some logical reason that Oracle/Sybase/IBM/Microsoft have not bothered
> > with it.
> 
> I think they were designed/developed when WWW was nonexistent, and
> client-server meant a system where client and server where separated by
> a (slow) network connection that would negate most of the benefit from
> server side cacheing. On todays application server scenario the client
> (AS) and server (DB) are usually very close, if not on the same computer
> and thus effectively managed cache can be kept on DB to avoid all the
> cache invalidation logic going through DB-AS link.
> 
> ...

You have identified one of the most valuable reasons for query/result
caching.  As I read the threads going on about caching, it reminds me
of the arguments within MySQL about the need for referential integrity
and transactions.  Yes, the application can do it, however, by having
the database do it, it frees the application programmer to perform
more important tasks (e.g., more features).

Your insights about the caching, etc. in the older, legacy databases
is very likely the case.  With the development of high speed
networking, etc., it is now very feasible to move the caching closer
to the sources of change/invalidation.

Quite frankly, while certainly there are literally millions of
applications that would find minimal value in query/results caching, I
know that, at least for web applications, that the query/results
caching _would_ be very valuable.

Certainly, as has been pointed out several times, the application can
do the caching, however, utilizing todays "generic" tools such as
Apache and PHP, it is relatively difficult.  By moving caching to the
database server, there is little that needs to be done by the
application to realize the benefits.

For example, with my typical application, as a straight dynamic
application, the process would be approximately:
 1) receive the request "http://www.xyz.com/index.php"
 2) query the current page contents from the database:
select * from table where dateline = '2002-03-05';
 3) begin the HTML output
 4) loop through the select results printing the contents
 5) complete the HTML output

With caching within the database, this would likely achieve
performance good enough to serve millions of requests per day.  (This
is the case for some of our sites using Intersystems Cache which does
do query caching and serves over 2 million page views per day.)

Without the database caching, it is necessary to have the application
perform this function.  Because of the short life of an Apache/PHP
process, caching needs to be performed externally to the application:
 1) receive the request "http://www.xyz.com/index.php"
 2) check the age of the cache file (1min, 5min ???):
 3) if the cache is not fresh:
 3.1) lock the cache file
 3.2) query the database
 3.3) begin HTML output to the cache file
 3.4) loop through select results
 3.5) complete the HTML output to the cache file
 3.6) unlock the cache file
 4) open the cache file (i.e., wait for locked file)
 5) read/passthrough cache contents

(Please note that this is one, simplified way to do the caching.  It
assumes that the data becomes stale over time and needs to be
refreshed.  It also uses a file cache instead of a memory cache.
Certainly things could be made more efficient through the use of
notifications from the database and the use of shared memory.  Another
path would be to have an external program generating the pages and
then placing the "static" pages into service (which requires changes
to apache and/or the OS due to cache file invalidation during the
generation process). Both paths make the application more complex
requiring more programming time.)

It is possible to have an application server do this caching for you.
Of course, that in turn has its own complications.  For applications
that do not need the added "features" of an application server,
database caching can be a big win in both processing/response time as
well as programmer time.

Thanks,
F Harvell




Re: Database Caching

From
"Rod Taylor"
Date:
> Certainly, as has been pointed out several times, the application
can
> do the caching, however, utilizing todays "generic" tools such as
> Apache and PHP, it is relatively difficult.  By moving caching to
the
> database server, there is little that needs to be done by the
> application to realize the benefits.

PHP has an amazingly easy to use Shared Memory resource.

Create a semaphore, and an index hash in position 0 which points to
the rest of the resources.

On query lock memory, lookup in index, pull values if it exists
otherwise run db query and stuff values in.

The tricky part is expiring the cache.  I store a time in position 1
and a table in the database.  To see if cache should be expired I do a
select against the 'timer' table in the db.  If it's expired, clear
cache and let it be rebuilt.  A trigger updates the time on change.

Doesn't work well at all for frequent updates -- but thats not what
it's for.


Took about 15 minutes to write the caching mechanism in a generic
fashion for all my applications.  Step 2 of my process was to cache
the generated HTML page itself so I generate it once.  I store it gzip
compressed, and serve directly to browsers which support gzip
compression (most).  Pages are stored in shared memory using the above
mechanism.   This took about 40 minutes (php output buffer) and is
based on the uniqueness of the pages requested.  Transparent to the
application too (prepend and append code elements). (Zend cache does
this too I think).

Anyway, I don't feel like rewriting it as non-private code, but the
concept is simple enough.  Perhaps Zend or PEAR could write the above
data lookup wrappers to shared memory.  Although you don't even have
to worry about structures or layout, just the ID that represents the
location of your object in memory.

It can serve a couple million per hour using this on a E220 with lots
of ram -- bandwidth always runs out first.


Anyway, first suggestion is to buy Zend Cache (if available) to cache
entire HTML pages, not to cache db queries.  Even the great Slashdot
(which everone appears to be comparing this to) uses a page cache to
reduce load and serves primarily static HTML pages which were
pregenerated.

I agree with the solution, I don't agree with the proposed location as
caching DB queries only solves about 1/4 of the whole problem.  The
other being network (millions of queries across 100mbit isn't fast
either), and genereation of the non-static page from the static
(cached) query.  Build a few large (10k row) tables to see what I mean
about where the slowdown really is.



Re: Database Caching

From
"Steve Wolfe"
Date:
> > > I don't see how result caching can be a win, since it can be done when
> > > needed anyway, without adding complexity to the database engine.  Just
> > > have the application cache the result set.  Certainly a web server
could
> > > do this, if needed.
 There are a couple of catches with that idea.  First, we have thirty+
applications that we've written, and trying to go back and patch the caching
into all of them would be much more work than just doing it in the DB.
Secondly, changes to the data can be made from psql, other apps, or from
triggers and hence, there is no reliable way to deal with cache expiration.
Not only that, but if you have a pool of web servers, applications on each
one can be inserting/updating data, and none of the other machines have any
clue about that.
 Really, doing it in PG makes a lot of sense.  Doing it outside of PG has a
lot of problems.  At one point, I was resolved that I was going to do it in
middleware.  I sat down and planned out the entire thing, including a really
nifty hash structure that would make cache expiration and invalidtion
lightning-fast... but I had focused entirely on the implementation, and when
I realized all of the drawbacks to doing it outside of the backend, I
scrapped it.  That's the first time I've ever really wished that I was a C
programmer....
  In the worst-case scenario (never repeating a query), result caching
would have a very small overhead.  In any semi-realistic scenario, the
benefits are likely to be significant to extraordinary.

steve






Re: Database Caching

From
F Harvell
Date:
On Tue, 05 Mar 2002 12:46:27 EST, "Rod Taylor" wrote:
> 
> PHP has an amazingly easy to use Shared Memory resource.
> 
> ...
> 
> Anyway, first suggestion is to buy Zend Cache (if available) to cache
> entire HTML pages, not to cache db queries.  Even the great Slashdot
> (which everone appears to be comparing this to) uses a page cache to
> reduce load and serves primarily static HTML pages which were
> pregenerated.
 Thanks for the information about PHP.  It looks very useful.
 Just an FYI, Zend cache (which we use) only caches the prepared PHP
"code" in a shared memory buffer.  The code is still executed (just
not parsed) for each request.  "Finished" HTML pages are not cached.
This is still a terrific gain as the majority of the overhead with PHP
is the parsing/preparation.  We have seen 8 page/second executions
jump to 40 pages/second.

> I agree with the solution, I don't agree with the proposed location as
> caching DB queries only solves about 1/4 of the whole problem.  The
> other being network (millions of queries across 100mbit isn't fast
> either), and genereation of the non-static page from the static
> (cached) query.  Build a few large (10k row) tables to see what I mean
> about where the slowdown really is.
 Certainly I agree that query caching does not provide the most
efficient solution for webpage caching.  Neither does DB based
referential integrity provide the most efficient solution for
integrity either.  Application driven referential integrity can have
orders of magnitude better performance (application, i.e., programmer,
knows what needs to happen, etc.).
 What query caching does (potentially) provide is a direct, data
driven solution to caching that relieves the programmer from having to
deal with the issue.  The solution brings the caching down to the data
layer where invalidation/caching occurs automatically and "correctly".
 This works well for most applications, at least until the provided
"solution" becomes/drives the next bottleneck.  This can occur with
referential integrity, transactions, data typing, etc. even now.  When
it does, it becomes incumbent upon the programmer to find a higher
level solution to the issue.
 It also likely would provide a measured level of benefit to most
applications.  I know that many, many people on this list have said
that after thinking about it they didn't think so, but, it has been my
experience (as a database engine user not a database engine
programmer) that people using any application tend to ask the same
query over and over and over again.  This has true from both the human
interaction as well as the application code interaction.  Because I
have been buried in the web application domain for the last 5 years, I
will defer to others about applicability to other domains.

Thanks,
F Harvell




Re: Database Caching

From
Bruce Momjian
Date:
Do we want to add "query caching" to the TODO list, perhaps with a
question mark?

---------------------------------------------------------------------------

Greg Sabino Mullane wrote:
[ There is text before PGP section. ]
> 
[ PGP not available, raw data follows ]
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> 
> I had to make a relatively long drive yesterday, so I had lots of free time 
> to do some thinking...and my thoughts were turning to caching and databases. 
> The following is what I came up with: forgive me if it seems to be just an 
> obvious ramble...
> 
> Why does a database need caching?
> 
> Normally, when one thinks of a database (or to be precise, a RDBMS) the 
> ACID acronym comes up. This is concerned with having a stable database that 
> can reliably be used by many users at the same time. Caching a query is 
> unintuitive because it involves sharing information from transactions that 
> may be separated by a great amount of time and/or by different users. 
> However, from the standpoint of the database server, caching increases 
> efficiency enormously. If 800 users all make the same query, then caching 
> can help the database server backend (hereafter simply "database") to 
> save part or all of the work it performs so it doesn't have to repeat the 
> entire sequence of steps 800 times.
> 
> What is caching?
> 
> Caching basically means that we want to save frequently-used information 
> into an easy to get to area. Usually, this means storing it into memory. 
> Caching has three main goals: reducing disk access, reducing computation 
> (i.e. CPU utilization), and speeding up the time as measured by how long a 
> it takes a user to seea  result.  It does all this at the expense of RAM, 
> and the tradeoff is almost always worth it.
> 
> In a database, there are three basic types of caching: query results, 
> query plans, and relations.
> 
> The first, query result caching, simply means that we store into memory 
> the exact output of a SELECT query for the next time that somebody performs 
> that exact same SELECT query. Thus, if 800 people do a "SELECT * FROM foo", 
> the database runs it for the first person, saves the results, and simply 
> reads the cache for the next 799 requests. This saves the database from doing 
> any disk access, practically removes CPU usage, and speeds up the query.
> 
> The second, query plan caching, involves saving the results of the optimizer, 
> which is responsible for figuring out exactly "how" the databse is going to 
> fetch the requested data. This type of caching usually involves a "prepared" 
> query, which has almost all of the information needed to run the query with 
> the exception of one or more "placeholders" (spots that are populated with 
> variables at a later time). The query could also involve non-prepared 
> statments as well. Thus, if someone prepares the query "SELECT flavor FROM 
> foo WHERE size=?", and then executes it by sending in 300 different values 
> for "size", the prepared statement is run through the optimizer, the r
> esulting path is stored into the query plan cache, and the stored path is 
> used for the 300 execute requests. Because the path is already known, the 
> optimizer does not need to be called, which saves the database CPU and time.
> 
> The third, relation caching, simply involves putting the entire relation 
> (usually a table or index) into memory so that it can be read quickly. 
> This saves disk access, which basically means that it saves time. (This type 
> of caching also can occur at the OS level, which caches files, but that will 
> not be discussed here).
> 
> Those are the three basic types of caching, ways of implementing each are 
> discussed below. Each one should complement the other, and a query may be 
> able to use one, two, or all three of the caches.
> 
> I. Query result caching:
> 
> A query result cache is only used for SELECT queries that involve a 
> relation (i.e. not for "SELECT version") Each cache entry has the following 
> fields: the query itself, the actual results, a status, an access time, an 
> access number, and a list of all included columns. (The column list actually 
> tells as much information as needed to uniquely identify it, i.e. schema, 
> database, table, and column). The status is merely an indicator of whether or 
> not this cached query is valid. It may not be, because it may be invalidated 
> for a user within a transaction but still be of use to others. 
> 
> When a select query is processed, it is first parsed apart into a basic common 
> form, stripping whitespace, standardizing case, etc., in order to facilitate 
> an accurate match. Note that no other pre-processing is really required, 
> since we are only interested in exact matches that produce the exact same 
> output. An advanced version of this would ideally be able to use the cached 
> output of "SELECT bar,baz FROM foo" when it receives the query "SELECT 
> baz,bar FROM foo", but that will require some advanced parsing. Possible, 
> but probably not something to attempt in the first iteration of a query 
> caching function. :) If there *is* a match (via a simple strcmp at first), 
> and the status is marked as "valid", then the database simply uses the 
> stored output, updates the access time and count, and exits. This should be 
> extremely fast, as no disk access is needed, and almost no CPU. The 
> complexity of the query will not matter either: a simple query will run just 
> as fast as something with 12 sorts and 28 joins.
> 
> If a query is *not* already in the cache, then after the results are found 
> and delivered to the user, the database will try and store them for the 
> next appearance of that query. First, the size of the cache will be compared 
> to the size of the query+output, to see if there is room for it. If there 
> is, the query will be saved, with a status of valid, a time of 'now', a count 
> of 1, a list of all affected columns found by parsing the query, and the total 
> size of the query+output. If there is no room, then it will try to delete one 
> or more to make room. Deleting can be done based on the oldest access time, 
> smallest access count, or size of the query+output. Some balance of the first 
> two would probably work best, with the access time being the most important. 
> Everything will be configurable, of course.
> 
> Whenever a table is changed, the cache must be checked as well. A list of 
> all columns that were actually changed is computed and compared against 
> the list of columns for each query. At the first sign of a match, the 
> query is marked as "invalid." This should happen before the changes are made 
> to the table itself. We do not delete the query immediately since this may 
> be inside of a transaction, and subject to rollback. However, we do need 
> to mark it as invalid for the current user inside the current transaction: 
> thus, the status flag. When the transaction is commited, all queries that have 
> an "invalid" flag are deleted, then the tables are changed. Since the only 
> time a query can be flagged as "invalid" is inside your own transaction, 
> the deletion can be done very quickly.
> 
> 
> II. Query plan caching
> 
> If a query is not cached, then it "falls through" to the next level of 
> caching, the query plan. This can either be automatic or strictly on a 
> user-requested format (i.e. through the prepare-execute paradigm). The latter 
> is probably better, but it also would not hurt much to store non-explicitly 
> prepared queries in this cache as long as there is room. This cache has a 
> field for the query itself, the plan to be followed (i.e. scan this table, 
> that index, sort the results, then group them), the columns used, the access 
> time, the access count, and the total size. It may also want a simple flag 
> of "prepared or non-prepared", where prepared indicates an explicitly 
> prepared statment that has placeholders for future values. A good optimizer 
> will actually change the plan based on the values plugged in to the prepared
> queries, so that information should become a part of the query itself as 
> needed, and multiple queries may exist to handle different inputs. In 
> general, most of the inputs will be similar enough to use the same path (e.g. 
> "SELECT flavor FROM foo WHERE size=?" will most usually result in a simple 
> numeric value for the executes). If a match *is* found, then the database 
> can use the stored path, and not have to bother calling up the optimizer 
> to figure it out. It then updates the access time, the access count, and 
> continues as normal. If a match was *not* found, then it might possibly 
> want to be cached. Certainly, explicit prepares should always be cached. 
> Non-explicitly prepared queries (those without placeholders) can also be 
> cached. In theory, some of this will also be in the result cache, so that 
> should be checked as well: it it is there, no reason to put it here. Prepared 
> queries should always have priority over non-prepared, and the rest of the 
> rules above for the result query should also apply, with a caveat that things 
> that would affect the output of the optimizer (e.g. vacuuming) should also 
> be taken into account when deleting entries.
> 
> 
> III. Relation caching
> 
> The final cache is the relation itself, and simply involves putting the entire 
> relation into memory. This cache has a field for the name of the relation, 
> the table info itself, the type (indexes should ideally be cached more than 
> tables, for example), the access time, and the acccess number. Loading could 
> be done automatically, but most likely should be done according to a flag 
> on the table itself or as an explicit command by the user.
> 
> 
> Notes:
> 
> The "strcmp" used may seem rather crude, as it misses all but the exact 
> same query, but it does cover most of the everyday cases. Queries are 
> usually called through an application that keeps it in the same format, 
> time after time, so the queries are very often exactly identical. A better 
> parser would help, of course, but it would get rather complicated quickly.
> Two quick examples: a web page that is read from a database is a query that 
> is called many times with exactly the same syntax; a user doing a "refresh" 
> to constantly check if something has changed since they last looked.
> 
> Sometimes a query may jump back to a previous type of cache, especially 
> for things like subselects. The entire subselect query may not match, 
> but the inner query should also be checked against the query result cache.
> 
> Each cache should have some configurable parameters, including the size in 
> RAM, the maximum number of entries, and rules for adding and deleting.
> They should also be directly viewable through a system table, so a DBA 
> can quickly see exactly which queries are being cached and how often 
> they are being used. There should be a command to quickly flush the cache, 
> remove "old" entries, and to populate the query plan cache via a prepare 
> statment. It should also be possible to do table changes without stopping 
> to check the cache: perhaps flushing the cache and setting a global 
> "cache is empty" flag would suffice.
> 
> Another problem is the prepare and execute: you are not always guaranteed 
> to get a cached prepare if you do an execute, as it may have expired or 
> there may simply be no more room. Those prepared statements inside a 
> transaction should probably be flagged as "non-deletable" until the 
> transaction is ended.
> 
> Storing the results of an execute in the query result cache is another 
> problem. When a prepare is made, the database returns a link to that 
> exact prepared statement in cache, so that all the client has to say is 
> "run the query at 0x4AB3 with the value "12". Ideally, the database 
> should be able to check these against the query result cache as well. It 
> can do this by reconstructing the resulting query (by plugging the value into 
> the prepared statement) or it can store the execute request as a type of 
> query itself; instead of "SELECT baz FROM bar WHERE size=12" it would 
> store "p0x4aB3:12".
> 
> 
> The result cache would probaby be the easiest to implement, and also gives 
> the most "bang for the buck." The path cache may be a bit harder, but is 
> a very necessary feature. I don't know about the relation caching: it looks 
> to be fairly easy, and I don't trust that, so I am guessing it is actually 
> quite difficult.
> 
> Greg Sabino Mullane  greg@turnstep.com
> PGP Key: 0x14964AC8 200202281132
> 
> -----BEGIN PGP SIGNATURE-----
> Comment: http://www.turnstep.com/pgp.html
> 
> iD8DBQE8fqybvJuQZxSWSsgRAps9AKDwCkIH7GKSBjflyYSA0F7mQqD1MwCeJLCw
> hqE1SxJ2Z7RxFGCu3UwIBrI=
> =jlBy
> -----END PGP SIGNATURE-----
> 
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html
> 
[ Decrypting message... End of raw data. ]

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: Database Caching

From
"J. R. Nield"
Date:
I'm not sure about query result caching or 'relation caching', since the
first would seem to run into problems with concurrent updates, and the
second is sort-of what the buffer cache does.

Query plan caching sounds like a really good idea though. Neil Conway's
PREPARE patch already does this for an individual backend. Do you think
it would be hard to make it use shared memory, and check if a query has
already been prepared by another backend? Maybe it could use something
like a whitespace insensitive checksum for a shared hash key.

Regards,
John Nield

On Sun, 2002-08-25 at 20:15, Bruce Momjian wrote:
> 
> Do we want to add "query caching" to the TODO list, perhaps with a
> question mark?
> 
> ---------------------------------------------------------------------------
> 
> Greg Sabino Mullane wrote:
[snip]
> 
-- 
J. R. Nield
jrnield@usol.com





Re: Database Caching

From
Curt Sampson
Date:
On Sun, 25 Aug 2002, Bruce Momjian wrote:

> Do we want to add "query caching" to the TODO list, perhaps with a
> question mark?

I'd love to have query plans cached, preferably across backends.

cjs
-- 
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org   Don't you know, in this new Dark Age, we're
alllight.  --XTC
 



Re: Database Caching

From
Karel Zak
Date:
On Sun, Aug 25, 2002 at 09:35:24PM -0400, J. R. Nield wrote:
> I'm not sure about query result caching or 'relation caching', since the
> first would seem to run into problems with concurrent updates, and the
> second is sort-of what the buffer cache does.
> 
> Query plan caching sounds like a really good idea though. Neil Conway's
> PREPARE patch already does this for an individual backend. Do you think
> it would be hard to make it use shared memory, and check if a query has
> already been prepared by another backend? Maybe it could use something
> like a whitespace insensitive checksum for a shared hash key.
The original version of query plan cache allows exactly this. Butafter some discussion the shared memory usage in
qcachewas remove.
 
I think better and more robus solution is store cached planns inbackend memory and allows to run backend as persistent
(meansnotstartup/stop for each client connection).
 
   Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz