Re: Database Caching - Mailing list pgsql-hackers

From Greg Copeland
Subject Re: Database Caching
Date
Msg-id 3C80096C.8050302@copelandconsulting.net
Whole thread Raw
In response to Database Caching  ("Greg Sabino Mullane" <greg@turnstep.com>)
List pgsql-hackers
I'm sneaking out of my cave here.. ;)

mlw wrote:
> Tom Lane wrote:
> 
>>mlw <markw@mohawksoft.com> writes:
>>
>>>My previous company, www.dmn.com, has a music database system. We logged all
>>>the backed info, most of the queries were duplicated many times. This can be
>>>explained by multiple users interested in the same thing or the same user
>>>hitting "next page"
>>>
>>>If you could cache the "next page" or similar hit results, you could really
>>>increase throughput and capaciy of a website.

Well, I'm going to assume that the records in question have been 
completely cached in cases like this.  So isn't the primary source of 
improvement the query parse and plan generation?  As Tom seems to think, 
wouldn't it make more sense to optimize the parse/plan generation rather 
than caching the result set?  After all, if the plan can be pinned how 
much of a performance boost do you expect to get from processing a 
cached plan versus returning a cached result set?

Seriously, I am curious as to what the expected return is?  Still a 
multiple or simply some minor percent?


>>>
>>Sure, but the most appropriate place to do that sort of thing is in the
>>application (in this case, probably a cgi/php-ish layer).  Only the
>>application can know what its requirements are.  In the case you
>>describe, it'd be perfectly okay for a "stale" cache result to be
>>delivered that's a few minutes out of date.  Maybe a few hours out of
>>date would be good enough too, or maybe not.  But if we do this at the
>>database level then we have to make sure it won't break *any*
>>applications, and that means the most conservative validity assumptions.
>>(Thus all the angst about how to invalidate cache entries on-the-fly.)
>>
>>Likewise, the application has a much better handle than the database on
>>the issue of which query results are likely to be worth caching.
>>
>>I think that reports of "we sped up this application X times by caching
>>query results on the client side" are interesting, but they are not good
>>guides to what would happen if we tried to put a query-result cache into
>>the database.
>>
> 
> I would like to respectfully differ with you here. If query results are cached
> in an ACID safe way, then many things could be improved.
> 
> The problem with applications caching is that they do not have intimate
> knowledge of the database, and thus do not know when their cache is invalid. On
> top of that, many web sites have multiple web servers connected to a single
> database. The caching must sit between the web software and the DB. The logical
> place for caching is in the database.
> 


But hybrid application cache designs can mostly if not completely 
address this and also gets some added benefits in many cases.  If you 
have a "cache" table which denotes the tables which are involved in the 
cached results that you desire, you can then update it's state via 
triggers or even exposed procedures accordingly to reflect if the client 
side cache has been invalidated or not.  This means that a client need 
only query the cache table first to determine if it's cache is clean or 
dirty.  When it's dirty, it mearly needs to query the result set again.

Let's also not forget that client side caching can also yield 
significant networking performance improvements over a result set that 
is able to be cached on the server.  Why?  Well, let's say a query has a 
result set of 10,000 rows which are being cached on the server.  A 
remote client queries and fetches 10,0000 results over the network.  Now 
then, even though the result set is cached by the database, it is still 
being transfered over the wire for each and every query.  Now then, 
let's assume that 10 other people perform this same query.  That's 
100,000 rows which get transfered across the wire.  With the client side 
caching scheme, you have 10,010 rows (initial 10,000 result set lpus a 
single row result set which indicates the status of the cache) returned 
across the wire which tell the client that it's cache is clean or dirty.

Let's face it, in order for the cache to make sense, the same result set 
needs to be used over and over again.  In these cases, it would seem 
like in real world situations, a strong client side hybrid caching 
scheme wins in most cases.

I'd also like to toss out that I'd expect somewhere there would be a 
trade off between data cache and result set cache.  On systems without 
infinite memory, where's the line of deliniation?  It seems somewhere 
you may be limiting the size of the generalized cache at the expense of 
the cached result sets.  If this happens, the cases where a cached 
result set may be improved but refreshing that result set my be hindered 
as well might all other queries on the system.


> If we went even further, and cached multiple levels of query, i.e. the result
> of the sub-select within the whole query, then things like views and more
> complex queries would could get an increase in performance.
> 
> Take this query:
> 
> select * from (select * from T1 where field = 'fubar') as Z right outer join
>     (select alt from T2,  (select * from T1 where field = 'fubar') as X where
> T2.key = X.key) as Y 
>     on T3.key = Y.key) on (Z.key = Y.alt) where Z.key = NULL;
> 
> 
> Forgive this query, it is probably completely wrong, the actual query it is
> intended to represent is quite a bit larger. The intention is to select a set
> of alternate values based on a set of initial values, but also eliminating any
> alternate values which may also be in the initial set. Anyway, we have to query
> "Select * from T1 where field = 'fubar'" twice.
> 
> If that subselect could be cached, it could speed up the query a bit. Right now
> I use a temp table, which is a hassle.
> 

It's funny you say that because I was thinking that should server side 
result set caching truely be desired, wouldn't the use of triggers, a 
procedure for client interface and a temporary table be a poor-man's 
implementation yeilding almost the same results?  Though, I must say I'm 
assuming that the queries will *be* the same and *not nearly* the same.  But in the web world, isn't this really the
situationwe're trying to 
 
address?  That is, give me the front page?


> Caching results can and do speed up duplicate queries, there can really be no
> argument about it. The argument is about the usefulness of the feature and the
> cost of implementing it. If maintaining the cache costs more than the benefit
> of having it, obviously it is a loser. If implementing it takes up the
> biological CPU cycles of he development team that would be spent doing more
> important things, then it is also a loser. If however, it is relatively "easy"
> (hehe) to do, and doesn't affect performance greatly, is there any harm in
> doing so?
> 


Are any of the ideas that I put forth a viable substitute?

Greg



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: elog() patch
Next
From: Tatsuo Ishii
Date:
Subject: Re: timestamp_part() bug?