Thread: Occasional Slow Commit
Hi, I've got an OLTP application which occasionally suffers from slow commit time. The process in question does something like this: 1. Do work 2. begin transaction 3. insert record 4. commit transaction 5. Do more work 6. begin transaction 7. update record 8. commit transaction 9. Do more work The vast majority of the time, everything runs very quickly. The median processing time for the whole thing is 7ms. However, occasionally, processing time will jump up significantly - the average processing time is around 20ms with the maximum processing time taking 2-4 seconds for a small percentage of transactions. Ouch! Turning on statement logging and analyzing the logs of the application itself shows that step #4 is the culprit of the vast majority of the slow transactions. Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database in question is replicated using slony) Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller Notable configuration changes: shared_buffers = 800MB temp_buffers = 200MB work_mem = 16M maintenance_work_mem = 800MB vacuum_cost_delay = 10 checkpoint_segments = 10 effective_cache_size = 2500MB I found this post[1] from a while back which was informative: Both situations affect me in that I have Slony which I believe executes triggers upon commit, and looking at the disk IO stats, there is an elevated level of IO activity during this time, but it doesn't appear to be heavy enough to cause the type of delays I am seeing. Reading this page[2] indicates that I may want to increase my checkpoint_segments, checkpoint_timeout and bgwriter settings, but looking at pg_stat_bgwriter seems to indicate that my current settings are probably OK? # select * from pg_stat_bgwriter; checkpoints_timed | checkpoints_req | buffers_checkpoint | buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc -------------------+-----------------+--------------------+---------------+------------------+-----------------+--------------- 3834 | 105 | 3091905 | 25876 | 110 | 2247576 | 2889873 Any suggestions on how to proceed and debug the problem from here? My only other guess is that there is some sort of locking issues going on which is slowing things down and that it may also be slony related, as I also see a high number of slony related queries taking longer than 1 second... Thanks -Dave [1] http://archives.postgresql.org/pgsql-performance/2008-01/msg00005.php [2] http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm
On Mon, Oct 27, 2008 at 8:23 PM, David Rees <drees76@gmail.com> wrote: > Hi, > > I've got an OLTP application which occasionally suffers from slow > commit time. The process in question does something like this: > > 1. Do work > 2. begin transaction > 3. insert record > 4. commit transaction > 5. Do more work > 6. begin transaction > 7. update record > 8. commit transaction > 9. Do more work > > The vast majority of the time, everything runs very quickly. The > median processing time for the whole thing is 7ms. > > However, occasionally, processing time will jump up significantly - > the average processing time is around 20ms with the maximum processing > time taking 2-4 seconds for a small percentage of transactions. Ouch! > > Turning on statement logging and analyzing the logs of the application > itself shows that step #4 is the culprit of the vast majority of the > slow transactions. > > Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database > in question is replicated using slony) > > Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller > > Notable configuration changes: > shared_buffers = 800MB > temp_buffers = 200MB > work_mem = 16M > maintenance_work_mem = 800MB > vacuum_cost_delay = 10 > checkpoint_segments = 10 > effective_cache_size = 2500MB > > I found this post[1] from a while back which was informative: > > Both situations affect me in that I have Slony which I believe > executes triggers upon commit, and looking at the disk IO stats, there > is an elevated level of IO activity during this time, but it doesn't > appear to be heavy enough to cause the type of delays I am seeing. > > Reading this page[2] indicates that I may want to increase my > checkpoint_segments, checkpoint_timeout and bgwriter settings, but > looking at pg_stat_bgwriter seems to indicate that my current settings > are probably OK? > > # select * from pg_stat_bgwriter; > checkpoints_timed | checkpoints_req | buffers_checkpoint | > buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc > -------------------+-----------------+--------------------+---------------+------------------+-----------------+--------------- > 3834 | 105 | 3091905 | > 25876 | 110 | 2247576 | 2889873 > > Any suggestions on how to proceed and debug the problem from here? > > My only other guess is that there is some sort of locking issues going > on which is slowing things down and that it may also be slony related, > as I also see a high number of slony related queries taking longer > than 1 second... I bet your problem is disk syncing. Your xlogs are on the data volume so any type of burst activity can push back commit times. If this is the case, you have basically three solutions to this problem: *) buy more disks (i's start with pushing the xlogs out to dedicated volume) *) disable fsync (very unsafe) or synchronous commit (somewhat less unsafe) *) checkpoint/bgwriter tuning: can provide incremental gains. This is not magic...at best you can smooth out bursty checkpoints. If your problems are really serious (yours don't seem to be), you have to look at the previous options. Have you temporarily disabling slony to see if the problem goes away? (My guess is it's not slony). merlin
On Mon, Oct 27, 2008 at 05:23:37PM -0700, David Rees wrote: > However, occasionally, processing time will jump up significantly - > the average processing time is around 20ms with the maximum processing > time taking 2-4 seconds for a small percentage of transactions. Ouch! > > Turning on statement logging and analyzing the logs of the application > itself shows that step #4 is the culprit of the vast majority of the > slow transactions. My bet is that you're waiting on checkpoints. Given that you're on 8.3, start fiddling with the checkpoint_completion_target parameter. 0.7 might help. A -- Andrew Sullivan ajs@commandprompt.com +1 503 667 4564 x104 http://www.commandprompt.com/
On Mon, 27 Oct 2008, David Rees wrote: > Software: CentOS 4.7, PostgreSQL 8.3.4, Slony-I 1.2.15 (the database > in question is replicated using slony) > Hardware: 2x Xeon 5130, 4GB RAM, 6-disk RAID10 15k RPM, BBU on the controller The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that much RAM, and the whole thing can get slammed onto disk during the final fsync portion of the checkpoint. What you should do first is confirm whether or not the slow commits line up with the end of the checkpoint, which is easy to see if you turn on log_checkpoints. That gives you timings for the write and fsync phases of the checkpoint which can also be informative. > Reading this page[2] indicates that I may want to increase my > checkpoint_segments, checkpoint_timeout and bgwriter settings, but > looking at pg_stat_bgwriter seems to indicate that my current settings > are probably OK? > > # select * from pg_stat_bgwriter; > checkpoints_timed | checkpoints_req | buffers_checkpoint | > 3834 | 105 | 3,091,905 | > buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc > 25876 | 110 | 2,247,576 | 2,889,873 I reformatted the above to show what's happening a bit better. Most of your checkpoints are the timed ones, which are unlikely to cause interference from a slow commit period (the writes are spread out over 5 minutes in those cases). It's quite possible the slow periods are coming only from the occasional requested checkpoints, which normally show up because checkpoint_segments is too low and you chew through segments too fast. If you problems line up with checkpoint time, you would likely gain some benefit from increasing checkpoint_segments to spread out the checkpoint writes further; the 10 you're using now is on the low side for your hardware. If the problems don't go away after that, you may be suffering from excessive Linux kernel buffering instead. I've got a blog entry showing how I tracked down a similar long pause on a Linux server at http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you may find helpful for determining if your issue is this one (which is pretty common on RHEL systems having relatively large amounts of RAM) or if it's something else, like the locking you mentioned. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith <gsmith@gregsmith.com> wrote: > The CentOS 4.7 kernel will happily buffer about 1.6GB of writes with that > much RAM, and the whole thing can get slammed onto disk during the final > fsync portion of the checkpoint. What you should do first is confirm > whether or not the slow commits line up with the end of the checkpoint, > which is easy to see if you turn on log_checkpoints. That gives you timings > for the write and fsync phases of the checkpoint which can also be > informative. OK, log_checkpoints is turned on to see if any delays correspond to checkpoint activity... >> Reading this page[2] indicates that I may want to increase my >> checkpoint_segments, checkpoint_timeout and bgwriter settings, but >> looking at pg_stat_bgwriter seems to indicate that my current settings >> are probably OK? >> >> # select * from pg_stat_bgwriter; >> checkpoints_timed | checkpoints_req | buffers_checkpoint | >> 3834 | 105 | 3,091,905 | >> buffers_clean | maxwritten_clean | buffers_backend | buffers_alloc >> 25876 | 110 | 2,247,576 | 2,889,873 > > I reformatted the above to show what's happening a bit better. Sorry, gmail killed the formatting. > Most of your > checkpoints are the timed ones, which are unlikely to cause interference > from a slow commit period (the writes are spread out over 5 minutes in those > cases). It's quite possible the slow periods are coming only from the > occasional requested checkpoints, which normally show up because > checkpoint_segments is too low and you chew through segments too fast. If > you problems line up with checkpoint time, you would likely gain some > benefit from increasing checkpoint_segments to spread out the checkpoint > writes further; the 10 you're using now is on the low side for your > hardware. OK, I've also bumped up checkpoint_segments to 20 and checkpoint_completion_target to 0.7 in an effort to reduce the effect of checkpoints. > If the problems don't go away after that, you may be suffering from > excessive Linux kernel buffering instead. I've got a blog entry showing how > I tracked down a similar long pause on a Linux server at > http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html you > may find helpful for determining if your issue is this one (which is pretty > common on RHEL systems having relatively large amounts of RAM) or if it's > something else, like the locking you mentioned. Ah, interesting. I've also turned down the dirty_ratio and dirty_background_ratio as suggested, but I don't think this would be affecting things here. The rate of IO on this server is very low compared to what it's capable of. Thanks for the suggestions, I'll report back with results. -Dave
(Resending this, the first one got bounced by mail.postgresql.org) On Wed, Oct 29, 2008 at 3:30 PM, David Rees <drees76@gmail.com> wrote: > On Wed, Oct 29, 2008 at 6:26 AM, Greg Smith <gsmith@gregsmith.com> wrote: >> What you should do first is confirm >> whether or not the slow commits line up with the end of the checkpoint, >> which is easy to see if you turn on log_checkpoints. That gives you timings >> for the write and fsync phases of the checkpoint which can also be >> informative. > > OK, log_checkpoints is turned on to see if any delays correspond to > checkpoint activity... Well, I'm pretty sure the delays are not checkpoint related. None of the slow commits line up at all with the end of checkpoints. The period of high delays occur during the same period of time each week, but it's not during a particularly high load period on the systems. It really seems like there must be something running in the background that is not showing up on the system activity logs, like a background RAID scrub or something. Here are a couple representative checkpoint complete messages from the logs: 2008-10-31 20:12:03 UTC: : : LOG: checkpoint complete: wrote 285 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=57.933 s, sync=0.053 s, total=57.990 s 2008-10-31 20:17:33 UTC: : : LOG: checkpoint complete: wrote 437 buffers (0.4%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=87.891 s, sync=0.528 s, total=88.444 s 2008-10-31 20:22:05 UTC: : : LOG: checkpoint complete: wrote 301 buffers (0.3%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=60.774 s, sync=0.033 s, total=60.827 s 2008-10-31 20:27:46 UTC: : : LOG: checkpoint complete: wrote 504 buffers (0.5%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=101.037 s, sync=0.049 s, total=101.122 s During this period of time there was probably 100 different queries writing/commiting data that took longer than a second (3-4 seconds seems to be the worst). The RAID controller on this machine is some sort of MegaRAID controller. I'll have to see if there is some sort of scheduled scan running during this period of time. One of the replicate nodes is an identical machine which I just noticed has the same slony commit slow downs logged even though it's only receiving data from slony from the primary node. There are two other nodes listening in on the same subscription, but these two nodes don't show the same slony commit slow downs, but these machines use a slightly different raid controller and are about 9 months newer than the other two. I'm still hoping that the checkpoint tuning has reduced commit latency during busy periods, I should have more data after the weekend. -Dave
On Fri, Oct 31, 2008 at 4:14 PM, David Rees <drees76@gmail.com> wrote: > Well, I'm pretty sure the delays are not checkpoint related. None of > the slow commits line up at all with the end of checkpoints. > > The period of high delays occur during the same period of time each > week, but it's not during a particularly high load period on the > systems. > > It really seems like there must be something running in the background > that is not showing up on the system activity logs, like a background > RAID scrub or something. OK, I finally had a chance to dig at this problem some more, and after futzing around with the MegaCli tools (major PITA, btw), I was able to confirm that there is a feature called "Patrol Read" on this LSI Megaraid SAS card which runs a weekly background read scan of the disks looking for errors. It is during this time period that I get lots of slow commits and transactions. FWIW, the card identifies itself from lspci as this: LSI Logic / Symbios Logic MegaRAID SAS Subsystem: Intel Corporation SROMBSAS18E RAID Controller I also found that my write cache was set to WriteThrough instead of WriteBack, defeating the purpose of having a BBU and that my secondary server apparently doesn't have a BBU on it. :-( Anyway, has anyone done any benchmarking of MegaRAID SAS controllers? I am configuring my arrays to use these settings: Read Policy: Normal (Normal, Read ahead & Adaptive read head) Write Policy: Writeback (Writeback, Writethrough) Disable Writeback if bad BBU IO Policy: Direct (Direct, Cached) Disk Cache: Enable (Enable, Disable, Unchanged) The only setting I'm really concerned about is the Disk Cache setting - is it safe to assume that the controller will do the right thing with regards to flushing the disk cache when appropriate to avoid data loss? LSI RAID cards seem to be pretty well respected, so I'd have to guess yes. Thanks Dave
> I also found that my write cache was set to WriteThrough instead of > WriteBack, defeating the purpose of having a BBU and that my secondary > server apparently doesn't have a BBU on it. :-( Note also that several RAID controllers will periodically drop the write-back mode during battery capacity tests. If you care about consistently/deterministically having full performance (with white-backed battery protected caching), you probably want to confirm your controller behavior here. (I've seen this on at least LSI based controllers in Dell 2950:s, and also on some 3wares.) -- / Peter Schuller PGP userID: 0xE9758B7D or 'Peter Schuller <peter.schuller@infidyne.com>' Key retrieval: Send an E-Mail to getpgpkey@scode.org E-Mail: peter.schuller@infidyne.com Web: http://www.scode.org
Attachment
On Thu, Nov 6, 2008 at 2:21 AM, Peter Schuller <peter.schuller@infidyne.com> wrote: >> I also found that my write cache was set to WriteThrough instead of >> WriteBack, defeating the purpose of having a BBU and that my secondary >> server apparently doesn't have a BBU on it. :-( > > Note also that several RAID controllers will periodically drop the > write-back mode during battery capacity tests. If you care about > consistently/deterministically having full performance (with > white-backed battery protected caching), you probably want to confirm > your controller behavior here. > > (I've seen this on at least LSI based controllers in Dell 2950:s, and > also on some 3wares.) I can confirm that this is the case by reviewing the logs stored on the MegaRAID controller that have a BBU and had WriteBack configured. The controller also lets you know (using the MegaCli utility) what setting is configured, and what setting is in effect. In the case of the machines without a BBU on them, they are configured to be in WriteBack, but are actually running in WriteThrough. -Dave
On Thu, Nov 6, 2008 at 8:47 AM, David Rees <drees76@gmail.com> wrote: > > In the case of the machines without a BBU on them, they are configured > to be in WriteBack, but are actually running in WriteThrough. I'm pretty sure the LSIs will refuse to actually run in writeback without a BBU.
On Thu, Nov 6, 2008 at 8:07 AM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Thu, Nov 6, 2008 at 8:47 AM, David Rees <drees76@gmail.com> wrote: >> >> In the case of the machines without a BBU on them, they are configured >> to be in WriteBack, but are actually running in WriteThrough. > > I'm pretty sure the LSIs will refuse to actually run in writeback without a BBU. It's configurable (at least on the MegaRAID cards I'm using). There's an option you can turn on that lets you run in writeback when the BBU is offline. -Dave