Re: Is There Any Way .... - Mailing list pgsql-performance

From Kevin Grittner
Subject Re: Is There Any Way ....
Date
Msg-id s3432970.018@gwmta.wicourts.gov
Whole thread Raw
In response to Is There Any Way ....  ("Lane Van Ingen" <lvaningen@esncc.com>)
List pgsql-performance
First off, Mr. Trainor's response proves nothing about anyone or
anything except Mr. Trainor.

I'm going to offer an opinion on the caching topic.  I don't have
any benchmarks; I'm offering a general sense of the issue based on
decades of experience, so I'll give a short summary of that.

I've been earning my living by working with computers since 1972,
and am the architect and primary author of a little-known
database product (developed in 1984) which saw tens of thousands
of installations in various vertical markets.  (Last I checked, a
couple years ago, it was still being used statewide by one state
government after a multi-million dollar attempt to replace it with a
popular commercial database product failed.)  I've installed and
tuned many other database products over the years.  I'm just getting
to know PostgreSQL, and am pretty excited about it.

Now on to the meat of it.

My experience is that a DBMS can improve performance by caching
certain types of data.  In the product I developed, we had a fairly
small cache which used a weighting algorithm for what to keep
(rather than simply relying on LRU).  Index pages got higher weight
than data pages; the higher in the index, the higher the weight.
Recent access got higher weight than older access, although it took
quite a while for the older access to age out entirely. This
improved performance quite a bit over a generalized caching
product alone.

However, there was a point of diminishing return.  My sense is that
every 10% you add to a "smart" cache yields less benefit at a
higher cost, so beyond a certain point, taking RAM from the general
cache to expand the smart cache degrades performance.  Clever
programming techniques can shift the break-even point, but I very
much doubt it can be eliminated entirely, unless the ratio of
performance between CPU+RAM and persistent storage is much
more extreme than I've ever seen.

There is another issue, which has been raised from time to time in
these lists, but not enunciated clearly enough in my view.  These
discussions about caching generally address maximum throughput,
while there are times when it is important that certain tables can
be queried very quickly, even if it hurts overall throughput.  As an
example, there can be tables which are accessed as a user types in
a window and tabs around from one GUI control to another.  The
user perception of the application performance is going to depend
PRIMARILY on how quickly the GUI renders the results of these
queries; if the run time for a large report goes up by 10%, they
will probably not notice.  This is a situation where removing RAM
from a generalized cache, or one based on database internals, to
create an "application specific" cache can yield big rewards.

One client has addressed this in a commercial product by defining
a named cache large enough to hold these tables, and binding those
tables to the cache.  One side benefit is that such caches can be
defined as "relaxed LRU" -- meaning that they eliminate the
overhead of tracking accesses, since they can assume that data will
rarely, if ever, be discarded from the cache.

It seems to me that in the PostgreSQL world, this would currently
be addressed by binding the tables to a tablespace where the file
system, controller, or drive(s) would cache the data, although this
is somewhat less flexible than the "named cache" approach -- unless
there is a file system that can layer a cache on top of a reference to
some other file system's space.  (And let's not forget the many OS
environments in which people use PostgreSQL.)  So I do see that
there would be benefit to adding a feature to PostgreSQL to define
caches and bind tables or indexes to them.

So I do think that it is SMART of PostgreSQL to rely on the
increasingly sophisticated file systems to provide the MAIN cache.
I suspect that a couple types of smaller "smart" caches in front of
this could boost performance, and it might be a significant boost.
I'm not sure what the current shared memory is used for; perhaps
this is already caching specific types of structures for the DBMS.
I'm pretty sure that programmers of GUI apps would appreciate the
named cache feature, so they could tune the database for snappy
GUI response, even under heavy load.

I realize this is short on specifics -- I'm shooting for perspective.
For the record, I don't consider myself particularly religious on the
topic, but I do pull back a little at arguments which sound strictly
academic -- I've found that most of what I've drawn from those
circles has needed adjustment in solving real-world problems.
(Particularly when algorithms optimize for best worst-case
performance.  I've found users are much happier with best typical
case performance as long as the product of worst case performance
and worst case frequency is low.)

Like many others who have posted on the topic, I am quite
prepared to alter my views in the face of relavent evidence.

Feel free to laugh at the old fart who decided to sip his Bushmill's
while reading through this thread and try to run with the young lions.
As someone else recently requested, though, please don't point while
you laugh -- that's just rude.   :-)

-Kevin


>>> Ron Peacetree <rjpeace@earthlink.net> 10/04/05 10:06 PM >>>
Unfortunately, no matter what I say or do, I'm not going to please
or convince anyone who has already have made their minds up
to the extent that they post comments like Mr Trainor's below.
His response style pretty much proves my earlier point that this
is presently a religious issue within the pg community.


pgsql-performance by date:

Previous
From: "jan.aerts@bbsrc.ac.uk"
Date:
Subject: index on custom function; explain
Next
From: "Dann Corbit"
Date:
Subject: Re: [HACKERS] A Better External Sort?