Thread: checkpoints/bgwriter tuning verification
On my primary DB I'm observing random slowness which just doesn't make sense to me. The I/O system can easily do 40MB/sec writes, but I'm only seeing a sustained 5MB/sec, even as the application is stalling waiting on the DB. My only guess is that I'm getting hit by checkpoints too often, and this is causing some delays. I did a lot of reading and came across this most excelelent article by Greg Smith (http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm). Reading that and the Pg manual and looking at the statistics, here is what I think I need to do. (Greg, do you have a performance tuning book? If so, I really want to buy it! Your articles are awesome.) current settings for checkpoints: checkpoint_segments = 32 checkpoint_timeout = 900 checkpoint_completion_target = 0.9 log_checkpoints = on In my logs I see this pretty consistently during times of slowness: Oct 29 14:17:38 d04 postgres[54902]: [10990-1] LOG: checkpoint starting: xlog Oct 29 14:22:04 d04 postgres[54902]: [10991-1] LOG: checkpoint complete: wrote 52828 buffers (24.3%); 0 transaction log file(s) added, 0 removed, 32 recycled; write=265.881 Oct 29 14:22:57 d04 postgres[54902]: [10992-1] LOG: checkpoint starting: xlog Oct 29 14:26:56 d04 postgres[54902]: [10993-1] LOG: checkpoint complete: wrote 52773 buffers (24.3%); 0 transaction log file(s) added, 0 removed, 32 recycled; write=234.846 Oct 29 14:28:32 d04 postgres[54902]: [10994-1] LOG: checkpoint starting: xlog Oct 29 14:33:32 d04 postgres[54902]: [10995-1] LOG: checkpoint complete: wrote 53807 buffers (24.7%); 0 transaction log file(s) added, 0 removed, 32 recycled; write=299.170 basically, the next checkpoint starts within a few seconds of the prior one completing. The stats show this: # select * from pg_stat_bgwriter; -[ RECORD 1 ]------+----------- checkpoints_timed | 3236 checkpoints_req | 83044 buffers_checkpoint | 1376460896 buffers_clean | 59124159 maxwritten_clean | 304410 buffers_backend | 285595787 buffers_alloc | 6643047623 Based on Greg's article and the above number showing that most checkpoints are triggered by running out of WAL segments, I should increase my checkpoint_buffers. Also, based on the article, I should increase the bgwriter_lru_maxpages (currently at default 100). My plan is to bump checkpoint_segments to 48 and reduce checkpoint_completion_target to 0.7, and bump the bgwriter_lru_maxpages to 500. Can the checkpoint operation actually cause the DB to stop responding for a few seconds at a time? That seems to be what I observe. Sometimes for 5 or more seconds one transaction will just stall. Thanks for any ideas.
On Thu, 2009-10-29 at 14:46 -0400, Vick Khera wrote: > On my primary DB I'm observing random slowness which just doesn't make > sense to me. The I/O system can easily do 40MB/sec writes, but I'm > only seeing a sustained 5MB/sec, even as the application is stalling > waiting on the DB. > > My only guess is that I'm getting hit by checkpoints too often, and > this is causing some delays. I did a lot of reading and came across > this most excelelent article by Greg Smith > (http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm). > Reading that and the Pg manual and looking at the statistics, here is > what I think I need to do. (Greg, do you have a performance tuning > book? If so, I really want to buy it! Your articles are awesome.) > > current settings for checkpoints: > checkpoint_segments = 32 > checkpoint_timeout = 900 > checkpoint_completion_target = 0.9 > log_checkpoints = on > > In my logs I see this pretty consistently during times of slowness: > > Oct 29 14:17:38 d04 postgres[54902]: [10990-1] LOG: checkpoint starting: xlog > Oct 29 14:22:04 d04 postgres[54902]: [10991-1] LOG: checkpoint > complete: wrote 52828 buffers (24.3%); 0 transaction log file(s) > added, 0 removed, 32 recycled; write=265.881 > Oct 29 14:22:57 d04 postgres[54902]: [10992-1] LOG: checkpoint starting: xlog > Oct 29 14:26:56 d04 postgres[54902]: [10993-1] LOG: checkpoint > complete: wrote 52773 buffers (24.3%); 0 transaction log file(s) > added, 0 removed, 32 recycled; write=234.846 > Oct 29 14:28:32 d04 postgres[54902]: [10994-1] LOG: checkpoint starting: xlog > Oct 29 14:33:32 d04 postgres[54902]: [10995-1] LOG: checkpoint > complete: wrote 53807 buffers (24.7%); 0 transaction log file(s) > added, 0 removed, 32 recycled; write=299.170 > > > basically, the next checkpoint starts within a few seconds of the > prior one completing. That's most likely due to having checkpoint_completion target set near the maximum. You are telling it to stretch the checkpoint out as long as possible. > > The stats show this: > > # select * from pg_stat_bgwriter; > -[ RECORD 1 ]------+----------- > checkpoints_timed | 3236 > checkpoints_req | 83044 > buffers_checkpoint | 1376460896 > buffers_clean | 59124159 > maxwritten_clean | 304410 > buffers_backend | 285595787 > buffers_alloc | 6643047623 > > Based on Greg's article and the above number showing that most > checkpoints are triggered by running out of WAL segments, I should > increase my checkpoint_buffers. Also, based on the article, I should > increase the bgwriter_lru_maxpages (currently at default 100). > > My plan is to bump checkpoint_segments to 48 and reduce > checkpoint_completion_target to 0.7, That's what I would do. I think you getting hit by checkpointing too frequently or stretching the checkpoints out too long (or a combo of both). I'd also keep bumping checkpoint_segements up until you are checkpointing on the timeout. This will give you a longer window between checkpoints - which gives more time to smooth the the checkpoint out. > and bump the > bgwriter_lru_maxpages to 500. I would probably start by increasing the number of WAL segments and decreasing the checkpoint_completion_timeout and see if that solves the problem. I wouldn't change the bgwriter_lru_maxpages at the same time, and wouldn't necessarily increase it by a factor of 5 when doing so. It does look like you need to increase it though. > Can the checkpoint operation actually cause the DB to stop responding > for a few seconds at a time? That seems to be what I observe. > Sometimes for 5 or more seconds one transaction will just stall. Absolutely. The smoothing is supposed to alleviate that, but I'm not sure how well it works with aggressive activity driven checkpoints like you have. Keep up posted, I'm curious. > Thanks for any ideas. > -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera <vivek@khera.org> wrote: > On my primary DB I'm observing random slowness which just doesn't make > sense to me. The I/O system can easily do 40MB/sec writes, but I'm > only seeing a sustained 5MB/sec, even as the application is stalling > waiting on the DB. Just one point on top of everything else you'll hear. 40 MB/sec sequential throughput does not equal 40MB/sec random PLUS checkpoint throughput. Random access is gonna lower that 40MB/sec way down real fast. First step to speed things up is putting pg_xlog on its own disk(s). Since pg_xlog is mostly sequentially access, it's much faster when it's not competing with the rest of the db. THEN if you need faster disks you can buy them and throw them at path/data/base/. However, often just splitting things out like this will make a world of difference on a write heavy database.
On Thu, 29 Oct 2009, Vick Khera wrote: > Greg, do you have a performance tuning book? If so, I really want to > buy it! Your articles are awesome. Give me a few more months... > basically, the next checkpoint starts within a few seconds of the prior > one completing. That's the expected behavior with a high setting for the completion target. Each checkpoint finishes just before the next one needs to start. That's the most you can spread them out, and more spreading generally equals better behavior. Usually; always exceptions of course. > Based on Greg's article and the above number showing that most > checkpoints are triggered by running out of WAL segments, I should > increase my checkpoint_buffers. Also, based on the article, I should > increase the bgwriter_lru_maxpages (currently at default 100). You might actually want to decrease bgwriter_lru_maxpages to 0. The current background writer designs presumes you have some spare I/O capacity and want to trade off some write overhead for possibly lower latency in client backends. If you're at the point where the disks aren't keeping up with the checkpoint disk I/O, this could turn counter-productive. > My plan is to bump checkpoint_segments to 48 and reduce > checkpoint_completion_target to 0.7, and bump the > bgwriter_lru_maxpages to 500. In your situation, I would increase checkpoint_segments to 64, keep the completion target at 0.9, and decrease bgwriter_lru_maxpages to 0 so that feature is turned off altogether. If you're still not happy, you can try dropping the completion target too; I've heard one report of that logic not working out so well in practice, where lower settings actually resulted in less of an I/O spike. I would worry about the OS tuning first though, which brins us to: > Can the checkpoint operation actually cause the DB to stop responding > for a few seconds at a time? That seems to be what I observe. > Sometimes for 5 or more seconds one transaction will just stall. There are two ways that that writes can hang: 1) You've gotten to the point in the checkpoint cycle where it's calling fsync to flush everything out of the filesystem. At this point you could potentially have a big chunk of data that needs to go out to disk, and any other client that needs to write something (or needs a resource locked by a writer) is stuck behind that. 2) The entire OS write cache is filled. Once that happens, if a client tries to write something else, the OS will force it to nap until there's space again to hold that write. How that all works really depends on the OS. I know that under Linux, the way this is all handled is really bad, and can easily lead to multi-second hangs. Generally you can watch the writes accumulate by looking at /proc/meminfo. I wrote an example showing how the problem appears and what I did to improve long hangs one server at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html and it refers to a long theory page on Linux's pdflush on my web page. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 10/29/2009 04:42 PM, Scott Marlowe wrote: > On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera<vivek@khera.org> wrote: >> On my primary DB I'm observing random slowness which just doesn't make >> sense to me. The I/O system can easily do 40MB/sec writes, but I'm >> only seeing a sustained 5MB/sec, even as the application is stalling >> waiting on the DB. > > Just one point on top of everything else you'll hear. 40 MB/sec > sequential throughput does not equal 40MB/sec random PLUS checkpoint > throughput. Random access is gonna lower that 40MB/sec way down real > fast. > > First step to speed things up is putting pg_xlog on its own disk(s). Hi Scott, How exactly do you do this? By creating a link to the new location or is there a config option somewhere that says where the pg_xlog resides? Thanks, Steve > Since pg_xlog is mostly sequentially access, it's much faster when > it's not competing with the rest of the db. THEN if you need faster > disks you can buy them and throw them at path/data/base/. However, > often just splitting things out like this will make a world of > difference on a write heavy database. >
On Fri, 2009-10-30 at 07:15 -0400, Steve Clark wrote: > On 10/29/2009 04:42 PM, Scott Marlowe wrote: > > On Thu, Oct 29, 2009 at 12:46 PM, Vick Khera<vivek@khera.org> wrote: > >> On my primary DB I'm observing random slowness which just doesn't make > >> sense to me. The I/O system can easily do 40MB/sec writes, but I'm > >> only seeing a sustained 5MB/sec, even as the application is stalling > >> waiting on the DB. > > > > Just one point on top of everything else you'll hear. 40 MB/sec > > sequential throughput does not equal 40MB/sec random PLUS checkpoint > > throughput. Random access is gonna lower that 40MB/sec way down real > > fast. > > > > First step to speed things up is putting pg_xlog on its own disk(s). > Hi Scott, > > How exactly do you do this? By creating a link to the new location or > is there a config option somewhere that says where the pg_xlog resides? There is an option to do this during initdb. If you want to do it after the DB is created, move the contents of pg_xlog/ (when the DB is shut down) and make a symlink to the new directory. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Le vendredi 30 octobre 2009 à 03:24:05, Greg Smith a écrit : > On Thu, 29 Oct 2009, Vick Khera wrote: > > Greg, do you have a performance tuning book? If so, I really want to > > buy it! Your articles are awesome. > > Give me a few more months... > Are you kidding or is it a great teaser? it would love to know that it is the latter. Consider you have another reader of your book, if it's really what you wanted to say :) -- Guillaume. http://www.postgresqlfr.org http://dalibo.com
On Thu, Oct 29, 2009 at 4:42 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > First step to speed things up is putting pg_xlog on its own disk(s). > Since pg_xlog is mostly sequentially access, it's much faster when > Did that with a symlink. I always do that. My frustration is that I have this really fast RAID subsystem connected via fibre channel to a Sun X4100 with 20GB RAM and AMD Opterons. I don't seem to be able to squeeze out the update performance I really need out of it. I'm wondering if it is all the FK checks that are slowing me down... Thanks for responding to help me out.
On Thu, Oct 29, 2009 at 10:24 PM, Greg Smith <gsmith@gregsmith.com> wrote: > There are two ways that that writes can hang: > > 1) You've gotten to the point in the checkpoint cycle where it's calling > fsync to flush everything out of the filesystem. At this point you could > potentially have a big chunk of data that needs to go out to disk, and any > other client that needs to write something (or needs a resource locked by a > writer) is stuck behind that. > > 2) The entire OS write cache is filled. Once that happens, if a client > tries to write something else, the OS will force it to nap until there's > space again to hold that write. I'm on FreeBSD. I've got my shared_buffers set to 1200MB on a box with 20GB of RAM, and fast external RAID on a fibre channel. Is that perhaps too low to keep the "hot" bits of my DB in memory? I hear so much conflicting advice on how big to set shared_buffers... I've tweaked the checkpoint target to 0.8 and segments to 64. Is 15 minutes reasonable for the timeout value? I really don't think I'm saturating the disk bandwidth. I see spikes of 18MB/s or more, but it usually hovers at under 3MB/s according to iostat output.
On Friday 30 October 2009 15:30:34 Vick Khera wrote: > I really don't think I'm saturating the disk bandwidth. I see spikes > of 18MB/s or more, but it usually hovers at under 3MB/s according to > iostat output. Looking at the iostat -x output is often more interesting - especially the utilization and the average request latency. Andres