Thread: Checkpoint execution overrun impact?
Hi all I'm trying to track down the causes of an application crash and reviewing PG logs I'm seeing this: 2011-04-22 06:00:16 CEST LOG: checkpoint complete: wrote 140 buffers (3.4%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.937 s, sync=1.860 s, total=29.906 s 2011-04-22 06:04:47 CEST LOG: checkpoint starting: time 2011-04-22 06:05:20 CEST LOG: checkpoint complete: wrote 161 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=32.188 s, sync=0.890 s, total=33.157 s 2011-04-22 06:09:20 CEST ERROR: update or delete on table "automezzo" violates foreign key constraint "fk77e102615cffe609" on table "carico" 2011-04-22 06:09:20 CEST DETAIL: Key (id)=(13237) is still referenced from table "carico". 2011-04-22 06:09:20 CEST STATEMENT: delete from Automezzo where id=$1 2011-04-22 06:09:20 CEST ERROR: current transaction is aborted, commands ignored until end of transaction block 2011-04-22 06:09:20 CEST STATEMENT: SELECT key from CONFIGURAZIONE LIMIT 0 2011-04-22 06:09:47 CEST LOG: checkpoint starting: time 2011-04-22 06:10:15 CEST LOG: checkpoint complete: wrote 137 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.406 s, sync=0.781 s, total=28.360 s 2011-04-22 06:14:47 CEST LOG: checkpoint starting: time 2011-04-22 06:15:14 CEST LOG: checkpoint complete: wrote 136 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.078 s, sync=0.797 s, total=27.984 s 2011-04-22 06:19:47 CEST LOG: checkpoint starting: time 2011-04-22 06:20:20 CEST LOG: checkpoint complete: wrote 164 buffers (4.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=32.938 s, sync=0.594 s, total=33.625 s 2011-04-22 06:24:47 CEST LOG: checkpoint starting: time 2011-04-22 06:25:20 CEST LOG: checkpoint complete: wrote 160 buffers (3.9%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=31.875 s, sync=0.953 s, total=32.922 s 2011-04-22 06:29:47 CEST LOG: checkpoint starting: time 2011-04-22 06:30:14 CEST LOG: checkpoint complete: wrote 130 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=25.922 s, sync=0.937 s, total=26.937 s 2011-04-22 06:34:47 CEST LOG: checkpoint starting: time 2011-04-22 06:35:15 CEST LOG: checkpoint complete: wrote 132 buffers (3.2%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=26.578 s, sync=1.250 s, total=27.984 s 2011-04-22 06:39:47 CEST LOG: checkpoint starting: time 2011-04-22 06:40:15 CEST LOG: checkpoint complete: wrote 136 buffers (3.3%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=27.250 s, sync=0.984 s, total=28.453 s 2011-04-22 06:44:47 CEST LOG: checkpoint starting: time 2011-04-22 06:51:41 CEST LOG: checkpoint complete: wrote 108 buffers (2.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=409.007 s, sync=4.672 s, total=414.070 s 2011-04-22 06:55:42 CEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. <goes on like the above until application restart> Now, I'm fairly new with PG and checkpoint internals but even after reading around (PG docs, http://www.westnet.com/~gsmith/content/postgresql/chkp-bgw-83.htm and various other posts from this ml) I haven't been able to figure out the following: -given the configuration attached (which is basically a vanilla one) and the number of buffers written at each execution, are these execution times normal or above average? -in the case of the execution that overruns past the timeout, what are the implications wrt the client application? -AFAIU client connections are basically stalled during checkpoints. Is it reasonable to infer that the fact that the application blocking on a getConnection() might be related to checkpoints being executed? -considering some tuning on the PG side, should I try increasing checkpoint_timeout and rising checkpoint_completion_target to lessen the impact of IO on the client or should I shorten the period so there's less stuff to write? from the number of buffers written on average I'd assume the first option is the one to go for but I might miss some bit of reasoning here... The server both PG and the application server (Tomcat) are running on is a Stratus ft2300 machine, which I think is setup to do RAID1. I've read about RAID5 not being a wise setup for disks hosting PG, what about RAID1? Postgres version is 8.3 http://postgresql.1045698.n5.nabble.com/file/n4332601/postgresql.conf postgresql.conf Any help will be greatly appreciated:) Thanks, and sorry for the long post. cheers Francesco -- View this message in context: http://postgresql.1045698.n5.nabble.com/Checkpoint-execution-overrun-impact-tp4332601p4332601.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.
On Fri, Apr 22, 2011 at 5:21 AM, drvillo <f.vivoli@gmail.com> wrote: > -given the configuration attached (which is basically a vanilla one) and the > number of buffers written at each execution, are these execution times > normal or above average? They seem fine. Remember that the write is deliberately spread out; it's not as if the system couldn't write out 130-160 8k blocks in less than 30 s. > -in the case of the execution that overruns past the timeout, what are the > implications wrt the client application? Not sure what you are referring to here. > -AFAIU client connections are basically stalled during checkpoints. Is it > reasonable to infer that the fact that the application blocking on a > getConnection() might be related to checkpoints being executed? > -considering some tuning on the PG side, should I try increasing > checkpoint_timeout and rising checkpoint_completion_target to lessen the > impact of IO on the client or should I shorten the period so there's less > stuff to write? from the number of buffers written on average I'd assume the > first option is the one to go for but I might miss some bit of reasoning > here... I'm a bit puzzled by all of this because the logs you posted seem to reflect a system under very light load. Each checkpoint is writing no more than 4% of shared_buffers and the sync phases are generally completing in less than one second. I don't see why that would be causing stalls. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
drvillo wrote: > -given the configuration attached (which is basically a vanilla one) and the > number of buffers written at each execution, are these execution times > normal or above average? > Given the configuration attached, most of them are normal. One problem may be that your vanilla configuration has checkpoint_segments set to 3. There is some logic in the checkpoint code to try and spread checkpoint writes out over a longer period of time. The intention is for a slower write spread to disrupt concurrent client activity less. It doesn't work all that well unless you give it some more segments to work with. Also, with the default setting for shared_buffers, you are doing a lot more redundant writes than you should be. The following postgresql.conf changes should improve things for you: shared_buffers=256MB checkpoint_segments=10 wal_buffers=16MB You may have to adjust your kernel shared memory memory settings for that to work. See http://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server for an intro to these and the other common parameters you should consider adjusting. > -in the case of the execution that overruns past the timeout, what are the > implications wrt the client application? > There really aren't any in the database. The server will immediately begin another checkpoint. Some additional disk space is used. So long as the server doesn't run out of disk space from that, clients shouldn't care. > -AFAIU client connections are basically stalled during checkpoints. Is it > reasonable to infer that the fact that the application blocking on a > getConnection() might be related to checkpoints being executed? > It can be. What I suspect is happening during the bad one: 2011-04-22 06:51:41 CEST LOG: checkpoint complete: wrote 108 buffers (2.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=409.007 s, sync=4.672 s, total=414.070 s 2011-04-22 06:55:42 CEST LOG: could not receive data from client: No connection could be made because the target machine actively refused it. Is that something is happening on the disks of the server that keeps the database from being able to write efficiently during this checkpoint. It then slows the checkpoint so much that clients are timing out. The tuning changes I suggested will lower the total amount of I/O the server does between checkpoints, which will mean there is less information in the OS cache to write out when the checkpoint comes. That may help, if the problem is really in the database. > -considering some tuning on the PG side, should I try increasing > checkpoint_timeout and rising checkpoint_completion_target to lessen the > impact of IO on the client or should I shorten the period so there's less > stuff to write? from the number of buffers written on average I'd assume the > first option is the one to go for but I might miss some bit of reasoning > here... > Your problems are likely because the operating system cache is getting filled with something that is slowing checkpoints down. Maybe it's the regular database writes during the five minutes between checkpoints; maybe it's something else running on the server. Whatever is happening, you're unlikely to make it better by adjusting how often they happen. Either get the database to write less between checkpoints (like the changes I suggested), or figure out what else is doing the writes. I suspect they are coming from outside the database, only because if you really had high write activity on this server you'd also be having checkpoints more frequently, too. > I've read about > RAID5 not being a wise setup for disks hosting PG, what about RAID1? > The problem with RAID5 is that it lowers write performance of a larger number of disks so it's potentially no better than a single drive. RAID1 is essentially a single drive, too. You may discover you're just running over what one drive can do. Something odd does seem to be doing on though. Normally in your situation I would try to find some system downtime and test the read/write speed of the drives, look for issues there. As Robert said already, you shouldn't be running this slowly unless there's something going wrong. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books