Re: Database Caching - Mailing list pgsql-hackers

From mlw
Subject Re: Database Caching
Date
Msg-id 3C7FA268.519257AF@mohawksoft.com
Whole thread Raw
In response to Database Caching  ("Greg Sabino Mullane" <greg@turnstep.com>)
List pgsql-hackers
Tom Lane wrote:
> 
> mlw <markw@mohawksoft.com> writes:
> > My previous company, www.dmn.com, has a music database system. We logged all
> > the backed info, most of the queries were duplicated many times. This can be
> > explained by multiple users interested in the same thing or the same user
> > hitting "next page"
> 
> > If you could cache the "next page" or similar hit results, you could really
> > increase throughput and capaciy of a website.
> 
> Sure, but the most appropriate place to do that sort of thing is in the
> application (in this case, probably a cgi/php-ish layer).  Only the
> application can know what its requirements are.  In the case you
> describe, it'd be perfectly okay for a "stale" cache result to be
> delivered that's a few minutes out of date.  Maybe a few hours out of
> date would be good enough too, or maybe not.  But if we do this at the
> database level then we have to make sure it won't break *any*
> applications, and that means the most conservative validity assumptions.
> (Thus all the angst about how to invalidate cache entries on-the-fly.)
> 
> Likewise, the application has a much better handle than the database on
> the issue of which query results are likely to be worth caching.
> 
> I think that reports of "we sped up this application X times by caching
> query results on the client side" are interesting, but they are not good
> guides to what would happen if we tried to put a query-result cache into
> the database.

I would like to respectfully differ with you here. If query results are cached
in an ACID safe way, then many things could be improved.

The problem with applications caching is that they do not have intimate
knowledge of the database, and thus do not know when their cache is invalid. On
top of that, many web sites have multiple web servers connected to a single
database. The caching must sit between the web software and the DB. The logical
place for caching is in the database.

If we went even further, and cached multiple levels of query, i.e. the result
of the sub-select within the whole query, then things like views and more
complex queries would could get an increase in performance.

Take this query:

select * from (select * from T1 where field = 'fubar') as Z right outer join   (select alt from T2,  (select * from T1
wherefield = 'fubar') as X where
 
T2.key = X.key) as Y on T3.key = Y.key) on (Z.key = Y.alt) where Z.key = NULL;


Forgive this query, it is probably completely wrong, the actual query it is
intended to represent is quite a bit larger. The intention is to select a set
of alternate values based on a set of initial values, but also eliminating any
alternate values which may also be in the initial set. Anyway, we have to query
"Select * from T1 where field = 'fubar'" twice.

If that subselect could be cached, it could speed up the query a bit. Right now
I use a temp table, which is a hassle.

Caching results can and do speed up duplicate queries, there can really be no
argument about it. The argument is about the usefulness of the feature and the
cost of implementing it. If maintaining the cache costs more than the benefit
of having it, obviously it is a loser. If implementing it takes up the
biological CPU cycles of he development team that would be spent doing more
important things, then it is also a loser. If however, it is relatively "easy"
(hehe) to do, and doesn't affect performance greatly, is there any harm in
doing so?


pgsql-hackers by date:

Previous
From: Jan Wieck
Date:
Subject: Re: Database Caching
Next
From: Thomas Lockhart
Date:
Subject: Re: Bug #605: timestamp(timestamp('a timestamp)) no longer works