Thread: Why is checkpoint so costly?

Why is checkpoint so costly?

From

Josh Berkus

Date:

21 June 2005, 18:59:12

Folks,

Going over some performance test results at OSDL, our single greatest 
performance issue seems to be checkpointing.    Not matter how I fiddle 
with it, checkpoints seem to cost us 1/2 of our throughput while they're 
taking place.  Overally, checkpointing costs us about 25% of our 
performance on OLTP workloads.

Example: http://khack.osdl.org/stp/302671/results/0/

Can we break down everything that happens during a checkpoint so that we 
can see where this huge cost is coming from?     Checkpointing should be 
limited to fsyncing to disk and marking WAL files as recyclable, but there 
seems to be something more.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Why is checkpoint so costly?

From

Tom Lane

Date:

21 June 2005, 21:16:46

Josh Berkus <josh@agliodbs.com> writes:
> Can we break down everything that happens during a checkpoint so that we 
> can see where this huge cost is coming from?     Checkpointing should be 
> limited to fsyncing to disk and marking WAL files as recyclable, but there 
> seems to be something more.

I already asked you to measure the thing I think is the likely candidate
(to wit, dumping full page images into WAL).
        regards, tom lane

Re: Why is checkpoint so costly?

From

Alvaro Herrera

Date:

21 June 2005, 21:37:24

On Tue, Jun 21, 2005 at 12:00:56PM -0700, Josh Berkus wrote:
> Folks,
> 
> Going over some performance test results at OSDL, our single greatest 
> performance issue seems to be checkpointing.    Not matter how I fiddle 
> with it, checkpoints seem to cost us 1/2 of our throughput while they're 
> taking place.  Overally, checkpointing costs us about 25% of our 
> performance on OLTP workloads.
> 
> Example: http://khack.osdl.org/stp/302671/results/0/
> 
> Can we break down everything that happens during a checkpoint so that we 
> can see where this huge cost is coming from?     Checkpointing should be 
> limited to fsyncing to disk and marking WAL files as recyclable, but there 
> seems to be something more.

Not only you have to fsync the files; you have to write them before as
well.  If the bgwriter is not able to keep up then at checkpoint time
there is a lot of writing to do.  One idea is to fiddle with bgwriter
settings, or did you do that already?  I see this for the URL above:
bgwriter_delay                 | 200bgwriter_maxpages              | 100bgwriter_percent               | 1

Maybe it should be more aggressive.

Another thing to blame is the dump-whole-pages-after-checkpoint
business.  Maybe the load you are seeing is not completely during
checkpoint, but right after it as well.  How do you tell from the
results that the checkpoint is complete?

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
"El miedo atento y previsor es la madre de la seguridad" (E. Burke)

Re: Why is checkpoint so costly?

From

Josh Berkus

Date:

21 June 2005, 21:43:33

Alvaro, Tom,

>  bgwriter_delay                 | 200
>  bgwriter_maxpages              | 100
>  bgwriter_percent               | 1
>
> Maybe it should be more aggressive.

Yeah, a bgwriter progression is running now.  I don't expect it to make 
much difference.  Most of sync impact is syncing the FS cache, which the 
bgwriter doesn't touch.

> Another thing to blame is the dump-whole-pages-after-checkpoint
> business.  Maybe the load you are seeing is not completely during
> checkpoint, but right after it as well.  How do you tell from the
> results that the checkpoint is complete?

I can't relate that to the performance numbers, unfortunately.  I think 
that the paging is probably the cause, but I don't know what to do about 
it.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Why is checkpoint so costly?

From

Alvaro Herrera

Date:

21 June 2005, 23:18:15

On Tue, Jun 21, 2005 at 02:45:32PM -0700, Josh Berkus wrote:

> > Another thing to blame is the dump-whole-pages-after-checkpoint
> > business.  Maybe the load you are seeing is not completely during
> > checkpoint, but right after it as well.  How do you tell from the
> > results that the checkpoint is complete?
> 
> I can't relate that to the performance numbers, unfortunately.  I think 
> that the paging is probably the cause, but I don't know what to do about 
> it.

Tom gave instructions in a mail (to you I think) to patch the xlog.c
file so page dumps stop happening.  I'm too lazy to search for that mail
(I deleted my local copy) but if you find it in your mailbox, resend it
to me and I'll produce a patch for you to test.  (I'd produce the patch
myself but I don't know the xlog code well enough to find the right spot
quickly.)

-- 
Alvaro Herrera (<alvherre[a]surnet.cl>)
Jason Tesser: You might not have understood me or I am not understanding you.
Paul Thomas: It feels like we're 2 people divided by a common language...

Re: Why is checkpoint so costly?

From

Josh Berkus

Date:

21 June 2005, 23:29:01

Alvaro,

> Tom gave instructions in a mail (to you I think) to patch the xlog.c
> file so page dumps stop happening.  I'm too lazy to search for that mail
> (I deleted my local copy) but if you find it in your mailbox, resend it
> to me and I'll produce a patch for you to test.  (I'd produce the patch
> myself but I don't know the xlog code well enough to find the right spot
> quickly.)

Found it.  Testing now.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Why is checkpoint so costly?

From

Greg Stark

Date:

22 June 2005, 19:39:19

Josh Berkus <josh@agliodbs.com> writes:

> Folks,
> 
> Going over some performance test results at OSDL, our single greatest 
> performance issue seems to be checkpointing.    Not matter how I fiddle 
> with it, checkpoints seem to cost us 1/2 of our throughput while they're 
> taking place.  Overally, checkpointing costs us about 25% of our 
> performance on OLTP workloads.

I think this is a silly statement. *Of course* checkpointing is a big
performance "issue". Checkpointing basically *is* what the database's job is.
It stores data; checkpointing is the name for the process of storing the data.

Looking at the performance without counting the checkpoint time is cheating,
the database hasn't actually completed processing the data; it's still sitting
in the pipeline of the WAL log.

The question should be why is there any time when a checkpoint *isn't*
happening? For maximum performance the combination of bgwriter (basically
preemptive checkpoint i/o) and the actual checkpoint i/o should be executing
at a more or less even pace throughout the time interval between checkpoints.

I do have one suggestion. Is the WAL log on a separate set of drives from the
data files? If not then the checkpoint (and bgwriter i/o) will hurt WAL log
performance by forcing the drive heads to move away from their sequential
writing of WAL logs.

That said, does checkpointing (and bgwriter i/o) require rereading the WAL
logs? If so then if the buffers aren't found in cache then it'll cause some
increase in seek latency just from that even if it does have a dedicated
set of drives.

-- 
greg

Re: Why is checkpoint so costly?

From

Tom Lane

Date:

22 June 2005, 19:59:49

Greg Stark <gsstark@mit.edu> writes:
> The question should be why is there any time when a checkpoint *isn't*
> happening? For maximum performance the combination of bgwriter (basically
> preemptive checkpoint i/o) and the actual checkpoint i/o should be executing
> at a more or less even pace throughout the time interval between checkpoints.

I think Josh's complaint has to do with the fact that performance
remains visibly affected after the checkpoint is over.  (It'd be nice
if those TPM graphs could be marked with the actual checkpoint begin
and end instants, so we could confirm or deny that we are looking at a
post-checkpoint recovery curve and not some very weird behavior inside
the checkpoint.)  It's certainly true that tuning the bgwriter ought to
help in reducing the amount of I/O done by a checkpoint, but why is
there a persistent effect?

> That said, does checkpointing (and bgwriter i/o) require rereading the WAL
> logs?

No.  In fact, the WAL is never read at all, except during actual
post-crash recovery.
        regards, tom lane