On Tue, Oct 15, 2013 at 10:26 AM, Julien Cigar <jcigar@ulb.ac.be> wrote:
for two reasons:
- some of the overhead of bgwriter and checkpoints is more or less linear in the size of shared_buffers, for example it could be possible that a large quantity of data could be dirty when a checkpoint occurs).
- the OS cache is also being used for reads and writes, the larger shared_buffers is, the more you risk double buffering (same blocks in the OS cache and in the database buffer cache).
Excellent. Thank you for the information. My suspicion has always been that the shared_buffers are "level 1 cache", so it seems to me that you'd want that to be large enough to hold your entire database if you could. However, I'm now realizing that I was _also_ assuming the IPC shared memory was also being locked via mlock to prevent swapping it out, and I'm now getting the feeling that this isn't true, which means the double buffering could lead to swap space use on buffer cache pressure...which I do occasionally see in ways I had not expected.
We do streaming replication and also store them for snapshot PITR backups, so I am intimately familiar with our write load, and I can say it is pretty low (we ship a 16MB WAL file about every 10-15 minutes during our busiest times).
That said, I can see how an import that is doing a bunch of writes could possibly spread those over a large area that would then consume a lot of CPU on the writer and checkpoint; however, I do not see how either of those would cause 40-60 different postgres backgroud processes (all running a normally "light query") to spin off into oblivion unless the write work load is somehow threaded into the background workers (which I'm sure it isn't). So, I think we're still dealing with a spinlock issue.
We're going to upgrade to 9.1.10 (with debug symbols) Thursday night and add another 64GB of RAM. I'll tune shared_buffers down to 2GB at that time and bump effective_cache_size up at the same time. My large history of sar data will make it apparent pretty quickly if that is a win/lose/tie.
If we have another spike in production, we'll be ready to measure it more accurately.