Thread: Checkpoint question

Checkpoint question

From
Qingqing Zhou
Date:
I understand checkpoint code doing something like this:
Get RedoRecPtr;Flush all dirty buffers no matter what's its LSN;Write down checkpoint xlog record;

So I wonder is it possible flush only dirty buffers with LSN < RedoRecPtr
to improve checkpoint caused delay? Because even we flush every dirty
buffers, we still have to replay from the RedoRecPtr. Of course, this only
applies to non-critical checkpoints (critical ones like startup and
shutdown).

Regards,
Qingqing



Re: Checkpoint question

From
Simon Riggs
Date:
On Wed, 2006-01-11 at 18:24 -0500, Qingqing Zhou wrote:
> I understand checkpoint code doing something like this:
> 
>     Get RedoRecPtr;
>     Flush all dirty buffers no matter what's its LSN;
>     Write down checkpoint xlog record;
> 
> So I wonder is it possible flush only dirty buffers with LSN < RedoRecPtr
> to improve checkpoint caused delay? Because even we flush every dirty
> buffers, we still have to replay from the RedoRecPtr. Of course, this only
> applies to non-critical checkpoints (critical ones like startup and
> shutdown).

Probably good idea to read Gray & Reuter or Vekum & Vossen books on
transactional systems theory before any such discussion.

Incidentally, it was suggested to me that we write odd/even numbered
blocks on alternate checkpoints as a way of reducing checkpoint impact.
Apparently this has been implemented on another RDBMS in a galaxy far,
far away. But I have enough to do right now.

Best Regards, Simon Riggs



Re: Checkpoint question

From
Qingqing Zhou
Date:

On Wed, 11 Jan 2006, Simon Riggs wrote:

>
> Probably good idea to read Gray & Reuter or Vekum & Vossen books on
> transactional systems theory before any such discussion.
>
So can you give me some hints why my thoughts are just wrong?

Regards,
Qingqing


Re: Checkpoint question

From
Tom Lane
Date:
Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> So I wonder is it possible flush only dirty buffers with LSN < RedoRecPtr
> to improve checkpoint caused delay?

Certainly not.  If LSN > RedoRecPtr then you know the buffer contains
some changes more recent than the checkpoint, but you cannot tell
whether it also contains changes older than the checkpoint.  For
correctness you must flush it.

It'd be possible to do something like this: after establishing
RedoRecPtr, make one quick pass through the buffers and make a list of
what needs to be dumped at that instant.  Then go back and do the actual
I/O for only those buffers.  I'm dubious that this will really improve
matters though, as the net effect is just to postpone I/O that will
happen anyway soon after the checkpoint (as soon as the bgwriter goes
back to normal activity).
        regards, tom lane


Re: Checkpoint question

From
Qingqing Zhou
Date:

On Wed, 11 Jan 2006, Tom Lane wrote:

> Qingqing Zhou <zhouqq@cs.toronto.edu> writes:
> > So I wonder is it possible flush only dirty buffers with LSN < RedoRecPtr
> > to improve checkpoint caused delay?
>
> Certainly not.  If LSN > RedoRecPtr then you know the buffer contains
> some changes more recent than the checkpoint, but you cannot tell
> whether it also contains changes older than the checkpoint.  For
> correctness you must flush it.
>
Right.

> It'd be possible to do something like this: after establishing
> RedoRecPtr, make one quick pass through the buffers and make a list of
> what needs to be dumped at that instant.  Then go back and do the actual
> I/O for only those buffers.  I'm dubious that this will really improve
> matters though, as the net effect is just to postpone I/O that will
> happen anyway soon after the checkpoint (as soon as the bgwriter goes
> back to normal activity).
>
Looks like a good idea. I don't worry too much about the problem you
mentioned. AFAIK, checkpoint has two targets: (1) cleanup buffer pool; (2)
reduce recovery time;

For (2), it is clear that the above idea will work since the recovery will
always read the data page to check its LSN -- the is the source of the
cost. For (1), we have bgwriter, and part of reason it is desiged is to
cleanup buffer pool.

On a separate matter, it will be interesting to add a io-status-dector in
bgwriter to make it know when the disk is free so a cleanup is valuable.

Regards,
Qingqing


Re: Checkpoint question

From
Qingqing Zhou
Date:

On Wed, 11 Jan 2006, Tom Lane wrote:

>
> It'd be possible to do something like this: after establishing
> RedoRecPtr, make one quick pass through the buffers and make a list of
> what needs to be dumped at that instant.  Then go back and do the actual
> I/O for only those buffers.  I'm dubious that this will really improve
> matters though, as the net effect is just to postpone I/O that will
> happen anyway soon after the checkpoint (as soon as the bgwriter goes
> back to normal activity).
>

We could extend the grammar to CHECKPOINT [FULL/PARTIAL] to let the user
decide what to do. If PARTIAL, an important reason is for fast recovery.

Regards,
Qingqing


Re: Checkpoint question

From
Simon Riggs
Date:
On Wed, 2006-01-11 at 20:46 -0500, Qingqing Zhou wrote:
> 
> On Wed, 11 Jan 2006, Simon Riggs wrote:
> 
> >
> > Probably good idea to read Gray & Reuter or Vekum & Vossen books on
> > transactional systems theory before any such discussion.
> >
> So can you give me some hints why my thoughts are just wrong?

Your thoughts are very often good ones, ISTM. This area is a minefield
of misunderstanding, as are many others, but in this case some external
wisdom is readily available. Those books helped me understand things
better; it was wrong of me to assume you hadn't already read them.

Best Regards, Simon Riggs



Re: Checkpoint question

From
Simon Riggs
Date:
On Wed, 2006-01-11 at 22:33 -0500, Qingqing Zhou wrote: 
> On Wed, 11 Jan 2006, Tom Lane wrote:

> > It'd be possible to do something like this: after establishing
> > RedoRecPtr, make one quick pass through the buffers and make a list of
> > what needs to be dumped at that instant.  Then go back and do the actual
> > I/O for only those buffers.  I'm dubious that this will really improve
> > matters though, as the net effect is just to postpone I/O that will
> > happen anyway soon after the checkpoint (as soon as the bgwriter goes
> > back to normal activity).
> >
> Looks like a good idea. I don't worry too much about the problem you
> mentioned. AFAIK, checkpoint has two targets: (1) cleanup buffer pool; (2)
> reduce recovery time;

I think its a good idea, but agree it does not save much in practice.

The only buffers this will miss are ones that were clean throughout the
whole of the last checkpoint cycle, yet have been dirtied between the
start of the checkpoint pass and when the pass reaches it. Given the
relative durations of those two intervals, I would guess that this would
yield very few buffers.

Further, if you miss a buffer on one checkpoint it will not be able to
avoid being written at the next. If we write the buffer again in next
checkpoint cycle then we combine the two I/Os and save effort. If the
buffer is not written to in the next cycle, and this seems likely since
it wasn't written to in the last, we do not avoid I/O, we just defer it
to next checkpoint. 

So the only buffer I/O we would save is for buffers that
- are not written to in checkpoint cycle, n (by definition)
- are written to *during* the checkpoint
- are written to again during the next checkpoint cycle, n+1

You could do math, or measure that, though my guess is that this
wouldn't save more than a few percentage points on the checkpoint
process.

To compile the list, you'd need to stop all buffer write activity while
you compile it, which sounds a high price for the benefit.

> For (2), it is clear that the above idea will work since the recovery will
> always read the data page to check its LSN -- the is the source of the
> cost. For (1), we have bgwriter, and part of reason it is desiged is to
> cleanup buffer pool.

Deferring I/O gains us nothing in the long run, though would speed up
recovery time by a fraction - but then crash recovery time is not much
an issue is it? If it is, there are other optimizations. 

Best Regards, Simon Riggs



Re: Checkpoint question

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
>> On Wed, 11 Jan 2006, Tom Lane wrote:
>>> It'd be possible to do something like this: after establishing
>>> RedoRecPtr, make one quick pass through the buffers and make a list of
>>> what needs to be dumped at that instant.  Then go back and do the actual
>>> I/O for only those buffers.

> To compile the list, you'd need to stop all buffer write activity while
> you compile it, which sounds a high price for the benefit.

Not really --- I was only thinking of narrowing the window for "extra"
writes to get in, not removing the window entirely.  Don't need any sort
of global lock for that.

But I agree with your analysis that the extra cycles won't save much in
practice.  The objection I see is that two lock cycles on each targeted
buffer are a nontrivial expense in SMP machines.
        regards, tom lane


Re: Checkpoint question

From
Qingqing Zhou
Date:

On Thu, 12 Jan 2006, Simon Riggs wrote:
>
> The only buffers this will miss are ones that were clean throughout the
> whole of the last checkpoint cycle, yet have been dirtied between the
> start of the checkpoint pass and when the pass reaches it.

I agree on the analysis but I am not sure current interval of doing a
checkpoint. So it depends. If the checkpoint on an io-intensive machine
the interval I guess would not be small. Also, in this environment, one
more round of lock cycle should be relatively cheap. But we currently
don't have any numbers on hand ...

Regards,
Qingqing




Re: Checkpoint question

From
"Jim C. Nasby"
Date:
On Thu, Jan 12, 2006 at 04:50:30AM -0500, Qingqing Zhou wrote:
> 
> 
> On Thu, 12 Jan 2006, Simon Riggs wrote:
> >
> > The only buffers this will miss are ones that were clean throughout the
> > whole of the last checkpoint cycle, yet have been dirtied between the
> > start of the checkpoint pass and when the pass reaches it.
> 
> I agree on the analysis but I am not sure current interval of doing a
> checkpoint. So it depends. If the checkpoint on an io-intensive machine
> the interval I guess would not be small. Also, in this environment, one
> more round of lock cycle should be relatively cheap. But we currently
> don't have any numbers on hand ...

It sounds like worrying about this would be much more interesting on a
machine that is seeing both a fairly heavy IO load (meaning checkpoint
will both take longer and affect other workloads more) and is seeing a
pretty high rate of buffer updates (meaning that we'd likely do a bunch
of extra work as part of the checkpoint if we didn't take note of
exactly what buffers needed to be flushed). Unfortunately I don't think
there's any way for the backend to know much about either condition
right now, so it couldn't decide when it made sense to make a list of
buffers to flush. Maybe in the future...

As for the questionable benefit to delaying work for bgwriter or next
checkpoint, I think there's a number of scenarios where it would make
sense. A simple example is doing some kind of processing once a minute
that's IO intensive with default checkpoint timing. Sometimes a
checkpoint will occur at the same time as the once-a-minute process, and
in those cases reducing the amount of work the checkpoint does will
definately help even out the load on the machine.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461


Re: Checkpoint question

From
Qingqing Zhou
Date:

On Thu, 12 Jan 2006, Jim C. Nasby wrote:

>
> It sounds like worrying about this would be much more interesting on a
> machine that is seeing both a fairly heavy IO load (meaning checkpoint
> will both take longer and affect other workloads more) and is seeing a
> pretty high rate of buffer updates (meaning that we'd likely do a bunch
> of extra work as part of the checkpoint if we didn't take note of
> exactly what buffers needed to be flushed). Unfortunately I don't think
> there's any way for the backend to know much about either condition
> right now, so it couldn't decide when it made sense to make a list of
> buffers to flush. Maybe in the future...
>

The senario you mentioned is happened in many OLTP applications. No need
for backend to know this -- we can leave the decision to the DBA:
CHECKPOINT FULL or CHECPOINT PARTIAL. If you got some machines can observe
its CHECKPOINT duration, that would be sweet.

Regards,
Qingqing



Re: Checkpoint question

From
"Jim C. Nasby"
Date:
On Thu, Jan 12, 2006 at 05:00:49PM -0500, Qingqing Zhou wrote:
> 
> 
> On Thu, 12 Jan 2006, Jim C. Nasby wrote:
> 
> >
> > It sounds like worrying about this would be much more interesting on a
> > machine that is seeing both a fairly heavy IO load (meaning checkpoint
> > will both take longer and affect other workloads more) and is seeing a
> > pretty high rate of buffer updates (meaning that we'd likely do a bunch
> > of extra work as part of the checkpoint if we didn't take note of
> > exactly what buffers needed to be flushed). Unfortunately I don't think
> > there's any way for the backend to know much about either condition
> > right now, so it couldn't decide when it made sense to make a list of
> > buffers to flush. Maybe in the future...
> >
> 
> The senario you mentioned is happened in many OLTP applications. No need
> for backend to know this -- we can leave the decision to the DBA:
> CHECKPOINT FULL or CHECPOINT PARTIAL. If you got some machines can observe
> its CHECKPOINT duration, that would be sweet.

Maybe I'm missing something here, but wouldn't that only help if you
were manually issuing checkpoints?
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461