Caching (was Re: choosing the right platform) - Mailing list pgsql-performance

From Tom Lane
Subject Caching (was Re: choosing the right platform)
Date
Msg-id 11732.1049934037@sss.pgh.pa.us
Whole thread Raw
In response to Re: choosing the right platform  ("Jim C. Nasby" <jim@nasby.net>)
Responses Re: Caching (was Re: choosing the right platform)
List pgsql-performance
"Jim C. Nasby" <jim@nasby.net> writes:
> That seems odd... shouldn't pgsql be able to cache information better
> since it would be cached in whatever format is best for it, rather than
> the raw page format (or maybe that is the best format). There's also the
> issue of having to go through more layers of software if you're relying
> on the OS caching. All the tuning info I've seen for every other
> database I've worked with specifically recommends giving the database as
> much memory as you possibly can, the theory being that it will do a much
> better job of caching than the OS will.

There are a number of reasons why that's a dubious policy for PG (I
won't take a position on whether these apply to other databases...)

One is that because we sit on top of the OS' filesystem, we can't
(portably) prevent the OS from caching blocks.  So it's quite easy to
get into a situation where the same data is cached twice, once in PG
buffers and once in kernel disk cache.  That's clearly a waste of RAM
however you slice it, and it's worst when you set the PG shared buffer
size to be about half of available RAM.  You can minimize the
duplication by skewing the allocation one way or the other: either set
PG's allocation relatively small, relying heavily on the OS to do the
caching; or make PG's allocation most of RAM and hope to squeeze out
the OS' cache.  There are partisans for both approaches on this list.
I lean towards the first policy because I think that starving the kernel
for RAM is a bad idea.  (Especially if you run on Linux, where this
policy tempts the kernel to start kill -9'ing random processes ...)

Another reason is that PG uses a simplistic fixed-number-of-buffers
internal cache, and therefore it can't adapt on-the-fly to varying
memory pressure, whereas the kernel can and will give up disk cache
space to make room when it's needed for processes.  Since PG isn't
even aware of the total memory pressure on the system as a whole,
it couldn't do as good a job of trading off cache vs process workspace
as the kernel can do, even if we had a variable-size cache scheme.

A third reason is that on many (most?) Unixen, SysV shared memory is
subject to swapping, and the bigger you make the shared_buffer arena,
the more likely it gets that some of the arena will be touched seldom
enough to make it a candidate for swapping.  A disk buffer that gets
swapped to disk is worse than useless (if it's dirty, the swapping
is downright counterproductive, since an extra read and write cycle
will be needed before the data can make it to its rightful place).

PG is *not* any smarter about the usage patterns of its disk buffers
than the kernel is; it uses a simple LRU algorithm that is surely no
brighter than what the kernel uses.  (We have looked at smarter buffer
recycling rules, but failed to see any performance improvement.)  So the
notion that PG can do a better job of cache management than the kernel
is really illusory.  About the only advantage you gain from having data
directly in PG buffers rather than kernel buffers is saving the CPU
effort needed to move data across the userspace boundary --- which is
not zero, but it's sure a lot less than the time spent for actual I/O.

So my take on it is that you want shared_buffers fairly small, and let
the kernel do the bulk of the heavy lifting for disk cache.  That's what
it does for a living, so let it do what it does best.  You only want
shared_buffers big enough so you don't spend too many CPU cycles shoving
data back and forth between PG buffers and kernel disk cache.  The
default shared_buffers setting of 64 is surely too small :-(, but my
feeling is that values in the low thousands are enough to get past the
knee of that curve in most cases.

            regards, tom lane


pgsql-performance by date:

Previous
From: Josh Berkus
Date:
Subject: Help analyzing 7.2.4 EXPLAIN
Next
From: Tom Lane
Date:
Subject: Re: Help analyzing 7.2.4 EXPLAIN