Thread: Why is checkpoint so costly?
Folks, Going over some performance test results at OSDL, our single greatest performance issue seems to be checkpointing. Not matter how I fiddle with it, checkpoints seem to cost us 1/2 of our throughput while they're taking place. Overally, checkpointing costs us about 25% of our performance on OLTP workloads. Example: http://khack.osdl.org/stp/302671/results/0/ Can we break down everything that happens during a checkpoint so that we can see where this huge cost is coming from? Checkpointing should be limited to fsyncing to disk and marking WAL files as recyclable, but there seems to be something more. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Can we break down everything that happens during a checkpoint so that we > can see where this huge cost is coming from? Checkpointing should be > limited to fsyncing to disk and marking WAL files as recyclable, but there > seems to be something more. I already asked you to measure the thing I think is the likely candidate (to wit, dumping full page images into WAL). regards, tom lane
On Tue, Jun 21, 2005 at 12:00:56PM -0700, Josh Berkus wrote: > Folks, > > Going over some performance test results at OSDL, our single greatest > performance issue seems to be checkpointing. Not matter how I fiddle > with it, checkpoints seem to cost us 1/2 of our throughput while they're > taking place. Overally, checkpointing costs us about 25% of our > performance on OLTP workloads. > > Example: http://khack.osdl.org/stp/302671/results/0/ > > Can we break down everything that happens during a checkpoint so that we > can see where this huge cost is coming from? Checkpointing should be > limited to fsyncing to disk and marking WAL files as recyclable, but there > seems to be something more. Not only you have to fsync the files; you have to write them before as well. If the bgwriter is not able to keep up then at checkpoint time there is a lot of writing to do. One idea is to fiddle with bgwriter settings, or did you do that already? I see this for the URL above: bgwriter_delay | 200bgwriter_maxpages | 100bgwriter_percent | 1 Maybe it should be more aggressive. Another thing to blame is the dump-whole-pages-after-checkpoint business. Maybe the load you are seeing is not completely during checkpoint, but right after it as well. How do you tell from the results that the checkpoint is complete? -- Alvaro Herrera (<alvherre[a]surnet.cl>) "El miedo atento y previsor es la madre de la seguridad" (E. Burke)
Alvaro, Tom, > bgwriter_delay | 200 > bgwriter_maxpages | 100 > bgwriter_percent | 1 > > Maybe it should be more aggressive. Yeah, a bgwriter progression is running now. I don't expect it to make much difference. Most of sync impact is syncing the FS cache, which the bgwriter doesn't touch. > Another thing to blame is the dump-whole-pages-after-checkpoint > business. Maybe the load you are seeing is not completely during > checkpoint, but right after it as well. How do you tell from the > results that the checkpoint is complete? I can't relate that to the performance numbers, unfortunately. I think that the paging is probably the cause, but I don't know what to do about it. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
On Tue, Jun 21, 2005 at 02:45:32PM -0700, Josh Berkus wrote: > > Another thing to blame is the dump-whole-pages-after-checkpoint > > business. Maybe the load you are seeing is not completely during > > checkpoint, but right after it as well. How do you tell from the > > results that the checkpoint is complete? > > I can't relate that to the performance numbers, unfortunately. I think > that the paging is probably the cause, but I don't know what to do about > it. Tom gave instructions in a mail (to you I think) to patch the xlog.c file so page dumps stop happening. I'm too lazy to search for that mail (I deleted my local copy) but if you find it in your mailbox, resend it to me and I'll produce a patch for you to test. (I'd produce the patch myself but I don't know the xlog code well enough to find the right spot quickly.) -- Alvaro Herrera (<alvherre[a]surnet.cl>) Jason Tesser: You might not have understood me or I am not understanding you. Paul Thomas: It feels like we're 2 people divided by a common language...
Alvaro, > Tom gave instructions in a mail (to you I think) to patch the xlog.c > file so page dumps stop happening. I'm too lazy to search for that mail > (I deleted my local copy) but if you find it in your mailbox, resend it > to me and I'll produce a patch for you to test. (I'd produce the patch > myself but I don't know the xlog code well enough to find the right spot > quickly.) Found it. Testing now. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Josh Berkus <josh@agliodbs.com> writes: > Folks, > > Going over some performance test results at OSDL, our single greatest > performance issue seems to be checkpointing. Not matter how I fiddle > with it, checkpoints seem to cost us 1/2 of our throughput while they're > taking place. Overally, checkpointing costs us about 25% of our > performance on OLTP workloads. I think this is a silly statement. *Of course* checkpointing is a big performance "issue". Checkpointing basically *is* what the database's job is. It stores data; checkpointing is the name for the process of storing the data. Looking at the performance without counting the checkpoint time is cheating, the database hasn't actually completed processing the data; it's still sitting in the pipeline of the WAL log. The question should be why is there any time when a checkpoint *isn't* happening? For maximum performance the combination of bgwriter (basically preemptive checkpoint i/o) and the actual checkpoint i/o should be executing at a more or less even pace throughout the time interval between checkpoints. I do have one suggestion. Is the WAL log on a separate set of drives from the data files? If not then the checkpoint (and bgwriter i/o) will hurt WAL log performance by forcing the drive heads to move away from their sequential writing of WAL logs. That said, does checkpointing (and bgwriter i/o) require rereading the WAL logs? If so then if the buffers aren't found in cache then it'll cause some increase in seek latency just from that even if it does have a dedicated set of drives. -- greg
Greg Stark <gsstark@mit.edu> writes: > The question should be why is there any time when a checkpoint *isn't* > happening? For maximum performance the combination of bgwriter (basically > preemptive checkpoint i/o) and the actual checkpoint i/o should be executing > at a more or less even pace throughout the time interval between checkpoints. I think Josh's complaint has to do with the fact that performance remains visibly affected after the checkpoint is over. (It'd be nice if those TPM graphs could be marked with the actual checkpoint begin and end instants, so we could confirm or deny that we are looking at a post-checkpoint recovery curve and not some very weird behavior inside the checkpoint.) It's certainly true that tuning the bgwriter ought to help in reducing the amount of I/O done by a checkpoint, but why is there a persistent effect? > That said, does checkpointing (and bgwriter i/o) require rereading the WAL > logs? No. In fact, the WAL is never read at all, except during actual post-crash recovery. regards, tom lane