Thread: Long Running Commits - Not Checkpoints
I'm having a problem with long running commits appearing in my database logs. It may be hardware related, as the problem appeared when we moved the database to a new server connected to a different disk array. The disk array is a lower class array, but still more than powerful enough to handle the IO requirements. One big difference though is that the old array had 16 GB of cache, the new one has 4 GB. Running Postgres 8.1.8 on AIX 5.3 We have enough IO to spare that we have the bgwriter cranked up pretty high, dirty buffers are getting quickly. Vmstat indicates 0 io wait time, no swapping or anything nasty like that going on. The long running commits do not line up with checkpoint times. The postgresql.conf config are identical except that wal_buffers was 8 on the old master, and it is set to 16 on the new one. We have other installations of this product running on the same array (different servers though) and they are not suffering from this problem. The only other thing of note is that the wal files sit on the same disk as the data directory. This has not changed between the old and new config, but the installs that are running fine do have their wal files on a separate partition. Any ideas where the problem could lie? Could having the wal files on the same data partition cause long running commits when there is plenty of IO to spare? -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote: > I'm having a problem with long running commits appearing in my database > logs. It may be hardware related, as the problem appeared when we moved > the database to a new server connected to a different disk array. The > disk array is a lower class array, but still more than powerful enough > to handle the IO requirements. One big difference though is that the > old array had 16 GB of cache, the new one has 4 GB. > > Running Postgres 8.1.8 on AIX 5.3 > > We have enough IO to spare that we have the bgwriter cranked up pretty > high, dirty buffers are getting quickly. Vmstat indicates 0 io wait > time, no swapping or anything nasty like that going on. > > The long running commits do not line up with checkpoint times. > > The postgresql.conf config are identical except that wal_buffers was 8 > on the old master, and it is set to 16 on the new one. > > We have other installations of this product running on the same array > (different servers though) and they are not suffering from this > problem. > > The only other thing of note is that the wal files sit on the same disk > as the data directory. This has not changed between the old and new > config, but the installs that are running fine do have their wal files > on a separate partition. > > Any ideas where the problem could lie? Could having the wal files on > the same data partition cause long running commits when there is plenty > of IO to spare? More on this - we also have long running commits on installations that do have the wal files on a separate partition. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson <bnichols@ca.afilias.info> writes: > On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote: >> I'm having a problem with long running commits appearing in my database >> logs. It may be hardware related, as the problem appeared when we moved >> the database to a new server connected to a different disk array. > More on this - we also have long running commits on installations that > do have the wal files on a separate partition. What's your definition of "long running commit" --- seconds? milliseconds? Exactly what are you measuring? Can you correlate the problem with what the transaction was doing? regards, tom lane
On Thu, 2007-09-13 at 11:10 -0400, Tom Lane wrote: > Brad Nicholson <bnichols@ca.afilias.info> writes: > > On Thu, 2007-09-13 at 10:15 -0400, Brad Nicholson wrote: > >> I'm having a problem with long running commits appearing in my database > >> logs. It may be hardware related, as the problem appeared when we moved > >> the database to a new server connected to a different disk array. > > > More on this - we also have long running commits on installations that > > do have the wal files on a separate partition. > > What's your definition of "long running commit" --- seconds? milliseconds? > Exactly what are you measuring? Can you correlate the problem with what log_min_duration is set to 150ms Commits running over that up to 788ms. Here is what we see in the logs (with obfuscated dbname, username and IP): 2007-09-13 10:01:49.787 CUT [782426] dbname username 1.2.3.171 LOG: duration: 224.286 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 10:19:16.373 CUT [737404] dbname username 1.2.3.174 LOG: duration: 372.545 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 10:19:24.437 CUT [1806498] dbname username 11.2.3.171 LOG: duration: 351.544 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 10:33:11.204 CUT [962598] dbname username 1.2.3.170 LOG: duration: 504.057 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 10:40:33.735 CUT [1282104] dbname username 1.2.3.174 LOG: duration: 250.127 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 10:49:54.752 CUT [1188032] dbname username 1.2.3.170 LOG: duration: 382.781 ms statement: EXECUTE <unnamed> [PREPARE: commit] 2007-09-13 11:30:43.339 CUT [1589464] dbname username 1.2.3.172 LOG: duration: 408.463 ms statement: EXECUTE <unnamed> [PREPARE: commit] -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Thu, 13 Sep 2007, Brad Nicholson wrote: > One big difference though is that the old array had 16 GB of cache, the > new one has 4 GB. > > We have enough IO to spare that we have the bgwriter cranked up pretty > high, dirty buffers are getting quickly. If your system is very active, running the bgwriter very aggressively will result in the cache on the disk array being almost continuously filled with pending writes that then trickle their way onto real disk eventually. If the typical working set on this system results in the amount of background writer cached writes regularly being >4GB but <16GB, that would explain what you're seeing. The resolution is actually unexpected by most people: you make the background writer less aggressive so that it's spewing less redundant writes clogging the array's cache, leaving more cache to buffer the actual commits so they don't block. This will increase the odds that there will be a checkpoint block instead, but if you're seeing none of them right now you may have some margin there to reduce the BGW activity without aggrevating checkpoints. Since you're probably not monitoring I/O waits and similar statistics on how the disk array's cache is being used, whether this is happening or not to you won't be obvious from what the operating system is reporting. One or two blocked writes every couple of minutes won't even show up on the gross statistics for operating system I/O waits; on average, they'll still be 0. So it's possible you may think you have plenty of I/O to spare, when in fact you don't quite have enough--what you've got is enough cache that the OS can't see where the real I/O bottleneck is. I'd be curious to see how you've got your background writer configured to see if it matches situations like this I've seen in the past. The parameters controlling the all scan are the ones you'd might consider turning down, definately the percentage and possibly the maxpages as well. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote: > On Thu, 13 Sep 2007, Brad Nicholson wrote: > I'd be curious to see how you've got your background writer configured to > see if it matches situations like this I've seen in the past. The > parameters controlling the all scan are the ones you'd might consider > turning down, definately the percentage and possibly the maxpages as well. bgwriter_delay = 50 # 10-10000 milliseconds between rounds bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers scanned/round bgwriter_lru_maxpages = 300 # 0-1000 buffers max written/round bgwriter_all_percent = 20 # 0-100% of all buffers scanned/round bgwriter_all_maxpages = 600 # 0-1000 buffers max written/round -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Thu, 2007-09-13 at 12:19 -0400, Brad Nicholson wrote: > On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote: > > On Thu, 13 Sep 2007, Brad Nicholson wrote: > > > I'd be curious to see how you've got your background writer configured to > > see if it matches situations like this I've seen in the past. The > > parameters controlling the all scan are the ones you'd might consider > > turning down, definately the percentage and possibly the maxpages as well. > > > bgwriter_delay = 50 # 10-10000 milliseconds between > rounds > bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers > scanned/round > bgwriter_lru_maxpages = 300 # 0-1000 buffers max > written/round > bgwriter_all_percent = 20 # 0-100% of all buffers > scanned/round > bgwriter_all_maxpages = 600 # 0-1000 buffers max > written/round I should add, there are 6 back ends running on this disk array (different servers and different data partitions) with these bgwriter settings. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
Brad Nicholson wrote: > On Thu, 2007-09-13 at 12:19 -0400, Brad Nicholson wrote: > > On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote: > > > On Thu, 13 Sep 2007, Brad Nicholson wrote: > > > > > I'd be curious to see how you've got your background writer configured to > > > see if it matches situations like this I've seen in the past. The > > > parameters controlling the all scan are the ones you'd might consider > > > turning down, definately the percentage and possibly the maxpages as well. > > > > > > bgwriter_delay = 50 # 10-10000 milliseconds between > > rounds > > bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers > > scanned/round > > bgwriter_lru_maxpages = 300 # 0-1000 buffers max > > written/round > > bgwriter_all_percent = 20 # 0-100% of all buffers > > scanned/round > > bgwriter_all_maxpages = 600 # 0-1000 buffers max > > written/round > > I should add, there are 6 back ends running on this disk array > (different servers and different data partitions) with these bgwriter > settings. Maybe it is running deferred triggers or something? -- Alvaro Herrera http://www.advogato.org/person/alvherre "I suspect most samba developers are already technically insane... Of course, since many of them are Australians, you can't tell." (L. Torvalds)
On Thu, 2007-09-13 at 12:12 -0400, Greg Smith wrote: > Since you're probably not monitoring I/O waits and similar statistics on > how the disk array's cache is being used, whether this is happening or not > to you won't be obvious from what the operating system is reporting. A sysadmin looked at cache usage on the disk array. The read cache is being used heavily, and the write cache is not. -- Brad Nicholson 416-673-4106 Database Administrator, Afilias Canada Corp.
On Thu, 13 Sep 2007, Brad Nicholson wrote: > A sysadmin looked at cache usage on the disk array. The read cache is > being used heavily, and the write cache is not. Given that information, you can take the below (which I was just about to send before the above update came in) as something to think about and test but perhaps not your primary line of attack. Even if my theory about the exact mechanism involved isn't correct, the background writer is still problematic in terms of its impact on the system when run as aggressively as you're doing it; I'm not sure but I think that's even more true on 8.1 than it is on 8.2 where I did most my testing in this area. > bgwriter_delay = 50 > bgwriter_lru_percent = 20.0 > bgwriter_lru_maxpages = 300 > bgwriter_all_percent = 20 > bgwriter_all_maxpages = 600 That was what I was expecting. Your all scan has the potential to be writing 600*8K*(1/50 msec)=98MB/sec worth of data to your disk array. Since some of this data has a random access component to it, your array cannot be expected to keep with a real peak load; the only thing saving you if something starts dirtying buffers as far as possible is that the array cache is buffering things. And that 4GB worth of cache could be filling in very little time. Every time the all scan writes a buffer that is frequently used, that write has a good chance that it was wasted because the block will be modified again before checkpoint time. Your settings are beyond regular aggressive and into the hyperactive terrority where I'd expect such redundant writes are happening often. I'd suggest you try to move toward dropping bgwriter_all_percent dramatically from its current setting and see how far down you can go before it starts to introduce blocks at checkpoint time. With bgwriter_delay set to 1/4 the default, I would expect that even 5% would be a high setting for you. That may be a more dramatic change than you want to make at once though, so lowering it in that direction more slowly (perhaps drop 5% each day) and seeing whether things improve as that happens may make more sense. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On 13/09/2007, Greg Smith <gsmith@gregsmith.com> wrote:
Every time the all scan writes a buffer that is frequently used, that
write has a good chance that it was wasted because the block will be
modified again before checkpoint time. Your settings are beyond regular
aggressive and into the hyperactive terrority where I'd expect such
redundant writes are happening often. I'd suggest you try to move toward
dropping bgwriter_all_percent dramatically from its current setting and
see how far down you can go before it starts to introduce blocks at
checkpoint time. With bgwriter_delay set to 1/4 the default, I would
expect that even 5% would be a high setting for you. That may be a more
dramatic change than you want to make at once though, so lowering it in
that direction more slowly (perhaps drop 5% each day) and seeing whether
things improve as that happens may make more sense.
Are you suggesting that reducing bgwriter_delay and bg_writer_percent would reduce the time spent doing commits?
I get quite a few commits that take over 500ms (the point when i start logging queries). I always thought oh just one of those things but if they can be reduced by changing a few config variables that would be great. I'm just trying to workout what figures are worth trying to see if I can reduce them.
From time to time I get commits that take 6 or 7 seconds but not all the time.
I'm currently working with the defaults.
Peter Childs
On 14/09/2007, Peter Childs <peterachilds@gmail.com> wrote:
On 13/09/2007, Greg Smith < gsmith@gregsmith.com> wrote:
Every time the all scan writes a buffer that is frequently used, that
write has a good chance that it was wasted because the block will be
modified again before checkpoint time. Your settings are beyond regular
aggressive and into the hyperactive terrority where I'd expect such
redundant writes are happening often. I'd suggest you try to move toward
dropping bgwriter_all_percent dramatically from its current setting and
see how far down you can go before it starts to introduce blocks at
checkpoint time. With bgwriter_delay set to 1/4 the default, I would
expect that even 5% would be a high setting for you. That may be a more
dramatic change than you want to make at once though, so lowering it in
that direction more slowly (perhaps drop 5% each day) and seeing whether
things improve as that happens may make more sense.
Are you suggesting that reducing bgwriter_delay and bg_writer_percent would reduce the time spent doing commits?
I get quite a few commits that take over 500ms (the point when i start logging queries). I always thought oh just one of those things but if they can be reduced by changing a few config variables that would be great. I'm just trying to workout what figures are worth trying to see if I can reduce them.
From time to time I get commits that take 6 or 7 seconds but not all the time.
I'm currently working with the defaults.
Peter Childs
Hmm Always read the manual, Increase them from the defaults.......
Peter.
> I'm having a problem with long running commits appearing in my database > logs. It may be hardware related, as the problem appeared when we moved > the database to a new server connected to a different disk array. The > disk array is a lower class array, but still more than powerful enough > to handle the IO requirements. One big difference though is that the > old array had 16 GB of cache, the new one has 4 GB. Maybe the old disk array had battery backed up ram that was used as a write cache and the new only has read cache? Without battery backed up ram (or flash or whatever) then a commit need to flush data down onto the pysical disk. /Dennis
On Fri, 14 Sep 2007, Peter Childs wrote: > Are you suggesting that reducing bgwriter_delay and bg_writer_percent > would reduce the time spent doing commits? I get quite a few commits > that take over 500ms (the point when i start logging queries). One very common cause for transactions blocking for as much as several seconds is hitting a checkpoint, which in current versions causes a large amount of data to be written and synchronized to the physical disk. If you're already tracking long transactions and trying to figure out what's causing them, you should set checkpoint_warning to a high value (the maximum of 3600 is generally fine). That will give you a harmless warning every time a checkpoint occurs. If the slow transactions line up with checkpoints, then you might consider tuning or re-tuning the background writer to elimite the delay. In this particular case, their system had already been tuned so aggressively that I wondered if the background writer was being a problem rather than a solution. Reducing the percentage would turn it down a bit; so would *increasing* the delay--they had already decreased it considerably, making it 4X as active as the default. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD