Thread: PG 8.3 and large shared buffer settings
Is there any practical limit to the number of shared buffers PG 8.3.7 can handle before more becomes counter-productive? I remember the buffer management algorithm used to get unhappy with too many buffers and past a certain point performance dropped with extra memory pitched at Postgres. My production DB's around 200G, and the box hosting it has 192G of memory on it, running a 64 bit AIX build of 8.3.7. I'm currently running with 15G of shared buffers, and while performance is pretty good, things still hit the disk more than I'd like. I can easily bump the shared buffer setting up to 40G or so without affecting anything else that matters. The box runs other things as well as the database, so the OS buffer cache tends to get effectively flushed -- permanently pinning more of the database in memory would be an overall win for DB performance, assuming bad things don't happen because of buffer management. (Unfortunately I've only got a limited window to bounce the server, so I can't do too much in the way of experimentation with buffer sizing) -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <dan@sidhe.org> writes: > Is there any practical limit to the number of shared buffers PG 8.3.7 > can handle before more becomes counter-productive? Probably, but I've not heard any definitive measurements showing an upper limit. The traditional wisdom of limiting it to 1G or so dates from before the last rounds of revisions to the bufmgr logic. > My production DB's around 200G, and the box hosting it has 192G of > memory on it, running a 64 bit AIX build of 8.3.7. Yowza. You might be able to do measurements that no one has done before. Let us know what you find out. BTW, does AIX have any provision for locking shared memory into RAM? One of the gotchas for large shared memory has always been the risk that the kernel would decide to swap some of it out. regards, tom lane
At 12:36 AM -0400 9/25/09, Tom Lane wrote: >Dan Sugalski <dan@sidhe.org> writes: >> Is there any practical limit to the number of shared buffers PG 8.3.7 >> can handle before more becomes counter-productive? > >Probably, but I've not heard any definitive measurements showing an >upper limit. The traditional wisdom of limiting it to 1G or so dates >from before the last rounds of revisions to the bufmgr logic. Excellent. > > My production DB's around 200G, and the box hosting it has 192G of >> memory on it, running a 64 bit AIX build of 8.3.7. > >Yowza. You might be able to do measurements that no one has done >before. Let us know what you find out. :) It's a machine of non-trivial size, to be sure. I'll give the buffer setting a good bump and see how it goes. I may be able to take one of the slony replicas off-line the next holiday and run some performance tests, but that won't be for a while. >BTW, does AIX have any provision for locking shared memory into RAM? >One of the gotchas for large shared memory has always been the risk >that the kernel would decide to swap some of it out. I'll have to go check, but I think it does. This box hasn't actually hit swap since it started -- a good chunk of that RAM is used as semi-permanent disk cache but unfortunately the regular day-to-day use of this box (they won't let me have it as a dedicated DB-only machine. Go figure :) doing other stuff the cache tends to turn over pretty quickly. -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
* Dan Sugalski <dan@sidhe.org> [090925 06:06]: > I'll have to go check, but I think it does. This box hasn't actually hit > swap since it started -- a good chunk of that RAM is used as > semi-permanent disk cache but unfortunately the regular day-to-day use of > this box (they won't let me have it as a dedicated DB-only machine. Go > figure :) doing other stuff the cache tends to turn over pretty quickly. All the more reason to find a way to use it all as shared buffers and lock it into ram... Oh, sorry, you expect the DB to play nice with everything else? ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
That won't work well anyway because the postgres shared_buffers dos not cache things that are sequentially scanned (it usesa ring buffer for each scan). So, for any data that is only accessed by sequential scan, you're relying on the OS andthe disks. If you access a table via index scan though, all its pages will go through shared_buffers. Size shared_buffers to no more than the 'hot' space of index and randomly accessed data. ________________________________________ From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of Aidan Van Dyk [aidan@highrise.ca] Sent: Friday, September 25, 2009 6:33 AM To: Dan Sugalski Cc: Tom Lane; pgsql-performance@postgresql.org Subject: Re: [PERFORM] PG 8.3 and large shared buffer settings * Dan Sugalski <dan@sidhe.org> [090925 06:06]: > I'll have to go check, but I think it does. This box hasn't actually hit > swap since it started -- a good chunk of that RAM is used as > semi-permanent disk cache but unfortunately the regular day-to-day use of > this box (they won't let me have it as a dedicated DB-only machine. Go > figure :) doing other stuff the cache tends to turn over pretty quickly. All the more reason to find a way to use it all as shared buffers and lock it into ram... Oh, sorry, you expect the DB to play nice with everything else? ;-) a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
* Scott Carey <scott@richrelevance.com> [090925 11:57]: > That won't work well anyway because the postgres shared_buffers dos not cache things that are sequentially scanned (ituses a ring buffer for each scan). So, for any data that is only accessed by sequential scan, you're relying on the OSand the disks. If you access a table via index scan though, all its pages will go through shared_buffers. In older version too, or only since synchronized scans got in? a. -- Aidan Van Dyk Create like a god, aidan@highrise.ca command like a king, http://www.highrise.ca/ work like a slave.
Attachment
On Fri, Sep 25, 2009 at 8:53 AM, Scott Carey <scott@richrelevance.com> wrote: > That won't work well anyway because the postgres shared_buffers dos not cache > things that are sequentially scanned (it uses a ring buffer for each scan). So, for > any data that is only accessed by sequential scan, you're relying on the OS and > the disks. If you access a table via index scan though, all its pages will go through > shared_buffers. Does it doe this even if the block was already in shared_buffers? That seems like a serious no-no to me to read the same block into different buffers. I thought that the sequential scan would have to break stride when it encountered a block already in buffer. But I haven't looked at the code, maybe I am over analogizing to other software I'm familiar with. Jeff
On Fri, 25 Sep 2009, Jeff Janes wrote: > Does it do this even if the block was already in shared_buffers? Usually not. The buffer ring algorithm is used to manage pages that are read in specifically to satisfy a sequential scan (there's a slightly different ring method used for VACUUM too). If the buffer you need is already available and not "pinned" (locked by someone else), it's not read from disk again. Instead, its usage count is incremently only if it's at zero (this doesn't count as a use unless it's about to be evicted as unused), and it's returned without being added to the ring. There's a section about this ("Buffer Ring Replacement Strategy") in the source code: http://git.postgresql.org/gitweb?p=postgresql.git;a=blob_plain;f=src/backend/storage/buffer/README;hb=HEAD The commit that added the feature is at http://git.postgresql.org/gitweb?p=postgresql.git;a=commit;h=ebf3d5b66360823edbdf5ac4f9a119506fccd4c0 The basic flow of this code is that backends ask for buffers using BufferAlloc, which then calls StrategyGetBuffer (where the ring list is managed) only if it doesn't first find the page in the buffer cache. You get what you'd hope for here: a sequential scan will use blocks when they're already available in the cache, while reading in less popular blocks that weren't cached into the temporary ring area. There's always the OS cache backing the PostrgreSQL one to handle cases where the working set you're using is just a bit larger than shared_buffers. The ring read requests may very well be satisfied by that too if there was a recent sequential scan the OS is still caching. You can read a high-level summary of the algorithm used for ring management (with an intro to buffer management in general) in my "Inside the PostgreSQL Buffer Cache" presentation at http://www.westnet.com/~gsmith/content/postgresql/ on P10 "Optimizations for problem areas". That doesn't specifically cover the "what if it's in the cache already?" case though. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 24 Sep 2009, Dan Sugalski wrote: > Is there any practical limit to the number of shared buffers PG 8.3.7 can > handle before more becomes counter-productive? There are actually two distinct questions here you should consider, because the popular wisdom here and what makes sense for your case might be different. The biggest shared_buffers tests I've seen come from Sun, where Jignesh there saw around 10GB was the largest amount of RAM you could give to the database before it stopped improving performance. As you guessed, there is a certain amount of overhead to managing the buffers involved, and as the size grows the chance you'll run into locking issues or similar resource contention grows too. Another problem spot are checkpoints. If you dirty a very large buffer cache, that whole thing will have to get dumped to disk eventually, and on some workloads people have found they have to reduce shared_buffers specifically to keep this from being too painful. That's not answering your question though; what it answers is "how large can shared_buffers get before it's counterproductive compared with giving the memory to OS to manage?" The basic design of PostgreSQL presumes that the OS buffer cache exists as a second-chance source for cached buffers. The OS cache tends to be optimized to handle large numbers of buffers well, but without very much memory about what's been used recently to optimize allocations and evictions. The symmetry there is one reason behind why shared_buffers shouldn't be most of the RAM on your system; splitting things up so that PG has a cut and the OS has at least as large of its own space lets the two cache management schemes complement each other. > The box runs other things as well as the database, so the OS buffer cache > tends to get effectively flushed -- permanently pinning more of the database > in memory would be an overall win for DB performance, assuming bad things > don't happen because of buffer management. This means that the question you want an answer to is "if the OS cache isn't really available, where does giving memory to shared_buffers becomes less efficient than not caching things at all?" My guess is that this number is much larger than 10GB, but I don't think anyone has done any tests to try to quantify exactly where it is. Typically when people are talking about systems as large as yours, they're dedicated database servers at that point, so the OS cache gets considered at the same time. If it's effectively out of the picture, the spot where caching still helps even when it's somewhat inefficient due to buffer contention isn't well explored. It would depend on the app too. If you're heavily balanced toward reads that don't need locks, you can certainly support a larger shared_buffers than someone who is writing a lot (just the checkpoint impact alone makes this true, and there's other sources for locking contention). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Sat, 26 Sep 2009, Greg Smith wrote: > On Fri, 25 Sep 2009, Jeff Janes wrote: > >> Does it do this even if the block was already in shared_buffers? > > Usually not. The buffer ring algorithm is used to manage pages that are read > in specifically to satisfy a sequential scan (there's a slightly different > ring method used for VACUUM too). If the buffer you need is already > available and not "pinned" (locked by someone else), it's not read from disk > again. Instead, its usage count is incremently only if it's at zero (this > doesn't count as a use unless it's about to be evicted as unused), and it's > returned without being added to the ring. > Hello Greg, What happens when a postmaster dies (e.g. core dump, kill -9, etc.). How is reference counting cleaned up and the lock removed? Thnx. Ciao, Gerhard -- http://www.wiesinger.com/
>> Is there any practical limit to the number of shared buffers PG 8.3.7 >> can handle before more becomes counter-productive? It is more efficient to have the page in shared buffers, rather than doing a context switch to the OS, copying the entire page from the OS's cache into shared buffers, and coming back to postgres. Shared buffers use less CPU. However, this is totally negligible versus the disk wait time of an uncached IO. The same page may be cached once in shared_buffers, and once in the OS cache, so if your shared buffers is half your RAM, and the other half is disk cache, perhaps it won't be optimal: is stuff is cached twice, you can cache half as much stuff. If your entire database can fit in shared buffers, good for you though. But then a checkpoint comes, and postgres will write all dirty buffers to disk in the order it finds them in Shared Buffers, which might be totally different from the on-disk order. If you have enough OS cache left to absorb these writes, the OS will reorder them. If not, lots of random writes are going to occur. On a RAID5 this can be a lot of fun.
On Sat, Sep 26, 2009 at 9:57 AM, Gerhard Wiesinger <lists@wiesinger.com> wrote: > On Sat, 26 Sep 2009, Greg Smith wrote: > >> On Fri, 25 Sep 2009, Jeff Janes wrote: >> >>> Does it do this even if the block was already in shared_buffers? >> >> Usually not. The buffer ring algorithm is used to manage pages that are >> read in specifically to satisfy a sequential scan (there's a slightly >> different ring method used for VACUUM too). If the buffer you need is >> already available and not "pinned" (locked by someone else), it's not read >> from disk again. Instead, its usage count is incremently only if it's at >> zero (this doesn't count as a use unless it's about to be evicted as >> unused), and it's returned without being added to the ring. >> > > Hello Greg, > > What happens when a postmaster dies (e.g. core dump, kill -9, etc.). How is > reference counting cleaned up and the lock removed? If a backend dies in disgrace, the master detects this and the whole cluster is taken down and brought back up. WARNING: terminating connection because of crash of another server process DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. HINT: In a moment you should be able to reconnect to the database and repeat your command. (The DETAIL is technically accurate, but somewhat misleading. If the crash to another backend happens while your backend is waiting on the commit record WAL fsync to return, then while the postmaster may have commanded your session to rollback, it is too late to actually do so and when the server comes back up and finishes recovery, you will probably find that your transaction has indeed committed, assuming you have some way to accurately deduce this) Jeff
On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@gregsmith.com> wrote: > > Another problem spot are checkpoints. If you dirty a very large buffer > cache, that whole thing will have to get dumped to disk eventually, and on > some workloads people have found they have to reduce shared_buffers > specifically to keep this from being too painful. Hi Greg, Is this the case even if checkpoint_completion_target is set close to 1.0? If you dirty buffers fast enough to dirty most of a huge shared_buffers area between checkpoints, then it seems like lowering the shared_buffers wouldn't reduce the amount of I/O needed, it would just shift the I/O from checkpoints to the backends themselves. It looks like checkpoint_completion_target was introduced in 8.3.0 Cheers, Jeff
On Sat, 26 Sep 2009, Jeff Janes wrote: > On Sat, Sep 26, 2009 at 8:19 AM, Greg Smith <gsmith@gregsmith.com> wrote: >> >> Another problem spot are checkpoints. If you dirty a very large buffer >> cache, that whole thing will have to get dumped to disk eventually, and on >> some workloads people have found they have to reduce shared_buffers >> specifically to keep this from being too painful. > > Is this the case even if checkpoint_completion_target is set close to 1.0? Sure. checkpoint_completion_target aims to utilize more of the space between each checkpoint by spreading them out over more of that space, but it alone doesn't change the fact that checkpoints are only so long. By default, you're going to get one every five minutes, and on active systems they can come every few seconds if you're not aggressive with increasing checkpoint_segments. Some quick math gives an idea of the scale of the problem. A single cheap disk can write random I/O (which checkpoints writes often are) at 1-2MB/s; let's call it 100MB/minute. That means that in 5 minutes, a single disk system might be hard pressed to write even 500MB of data out. But you can easily dirty 500MB in seconds nowadays. Now imagine shared_buffers is 40GB and you've dirtied a bunch of it; how long will that take to clear even on a fast RAID system? It won't be quick, and the whole system will grind to a halt at the end of the checkpoint as all the buffered writes queued up are forced out. > If you dirty buffers fast enough to dirty most of a huge shared_buffers > area between checkpoints, then it seems like lowering the shared_buffers > wouldn't reduce the amount of I/O needed, it would just shift the I/O > from checkpoints to the backends themselves. What's even worse is that backends can be writing data and filling the OS buffer cache in between checkpoints too, but all of that is forced to complete before the checkpoint can finish too. You can easily start the checkpoint process with the whole OS cache filled with backend writes that will slow checkpoint ones if you're not careful. Because disks are slow, you need to get things that are written to disk as soon as feasible, so the OS has more time to work on them, reorder for efficient writing, etc. Ultimately, the sooner you get I/O to the OS cache to write, the better, *unless* you're going to write that same block over again before it must go to disk. Normally you want buffers that aren't accessed often to get written out to disk early rather than linger until checkpoint time, there's nothing wrong with a backend doing a write if that block wasn't going to be used again soon. The ideal setup from a latency perspective is that you size shared_buffers just large enough to hold the things you write to regularly, but not so big that it caches every write. > It looks like checkpoint_completion_target was introduced in 8.3.0 Correct. Before then, you had no hope for reducing checkpoint overhead but to use very small settings for shared_buffers, particularly if you cranked the old background writer up so that it wrote lots of redundant information too (that's was the main result of "tuning" it on versions before 8.3 as well). -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 9/26/09 8:19 AM, Greg Smith wrote: > This means that the question you want an answer to is "if the OS cache > isn't really available, where does giving memory to shared_buffers > becomes less efficient than not caching things at all?" My guess is > that this number is much larger than 10GB, but I don't think anyone has > done any tests to try to quantify exactly where it is. Typically when > people are talking about systems as large as yours, they're dedicated > database servers at that point, so the OS cache gets considered at the > same time. If it's effectively out of the picture, the spot where > caching still helps even when it's somewhat inefficient due to buffer > contention isn't well explored. It also depends on the filesystem. In testing at Sun and on this list, people have found that very large s_b (60% of RAM) plus directIO was actually a win on Solaris UFS, partly because UFS isn't very agressive or smart about readahead and caching. On Linux/Ext3, however, it was never a win. I don't know what AIX's filesystems are like. -- Josh Berkus PostgreSQL Experts Inc. www.pgexperts.com
On Mon, 2009-09-28 at 10:36 -0700, Josh Berkus wrote: > On 9/26/09 8:19 AM, Greg Smith wrote: > > This means that the question you want an answer to is "if the OS cache > > isn't really available, where does giving memory to shared_buffers > > becomes less efficient than not caching things at all?" My guess is > > that this number is much larger than 10GB, but I don't think anyone has > > done any tests to try to quantify exactly where it is. Typically when > > people are talking about systems as large as yours, they're dedicated > > database servers at that point, so the OS cache gets considered at the > > same time. If it's effectively out of the picture, the spot where > > caching still helps even when it's somewhat inefficient due to buffer > > contention isn't well explored. > > It also depends on the filesystem. In testing at Sun and on this list, > people have found that very large s_b (60% of RAM) plus directIO was > actually a win on Solaris UFS, partly because UFS isn't very agressive > or smart about readahead and caching. On Linux/Ext3, however, it was > never a win. Again, it depends. On my recent testing of a simple seqscan on 1-int table, that are entirely in cache (either syscache or shared buffers), the shared buffers only scan was 6% to 10% percent faster than when the relation was entirely in system cache and each page had to be switched in via syscall / context switch. This was on Linux/Ext3 but I suspect this to be mostly independent of file system. Also, in ancient times, when I used Slony, and an early version of Slony at that, which did not drop and recreate indexes around initial copy, the copy time could be 2 to 3 _times_ slower for large tables with lots of indexes when indexes were in system cache vs. when they were in shared buffers (if I remember correctly, it was 1G shared buffers vs. 3G on a 4G machine). It was probably due to all kinds of index page splits etc which shuffled index pages back and forth a lot between userspace and syscache. So this is not entirely read-only thing either. -- Hannu Krosing http://www.2ndQuadrant.com PostgreSQL Scalability and Availability Services, Consulting and Training