Thread: Re: CHECKPOINT unlogged data

Re: CHECKPOINT unlogged data

From

Andres Freund

Date:

30 May, 19:39:02

Hi,

On 2025-05-30 18:17:45 +0200, Christoph Berg wrote:
> A customer reported to use CHECKPOINT before shutdowns to make the
> shutdowns themselves faster and asked if it was possible to make
> CHECKPOINT optionally also write out unlogged table data for that
> purpose. I think the idea makes sense, so there's the patch.

I've seen the need for this quite a few times, fwiw. So +1 to the idea.

Greetings,

Andres Freund

Re: CHECKPOINT unlogged data

From

Andres Freund

Date:

30 May, 20:02:41

Hi,

On May 30, 2025 12:55:28 PM EDT, Christoph Berg <myon@debian.org> wrote:
>Re: Andres Freund
>> Hi,
>>
>> On 2025-05-30 18:17:45 +0200, Christoph Berg wrote:
>> > A customer reported to use CHECKPOINT before shutdowns to make the
>> > shutdowns themselves faster and asked if it was possible to make
>> > CHECKPOINT optionally also write out unlogged table data for that
>> > purpose. I think the idea makes sense, so there's the patch.
>>
>> I've seen the need for this quite a few times, fwiw. So +1 to the idea.
>
>Do we want to officially mention this use case in checkpoint.sgml? I've
>already replaced the "is not intended for use during normal operation"
>wording by "not required during normal operation", but we might go one
>step further and endorse it.

Yes, I think it's good to mention what it is useful for.

Greetings,

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: CHECKPOINT unlogged data

From

Nathan Bossart

Date:

30 May, 20:07:48

On Fri, May 30, 2025 at 12:39:02PM -0400, Andres Freund wrote:
> On 2025-05-30 18:17:45 +0200, Christoph Berg wrote:
>> A customer reported to use CHECKPOINT before shutdowns to make the
>> shutdowns themselves faster and asked if it was possible to make
>> CHECKPOINT optionally also write out unlogged table data for that
>> purpose. I think the idea makes sense, so there's the patch.
> 
> I've seen the need for this quite a few times, fwiw. So +1 to the idea.

This patch also adds an IMMEDIATE option, which I proposed some time ago
[0].  I ended up withdrawing it due to general skepticism about its
usefulness.  FWIW I have no concerns about adding a few retail options to
CHECKPOINT, but others might balk at options without solid use-cases.  The
unlogged table one seems reasonable enough.

[0] https://postgr.es/m/17A03557-CF5C-4D4B-B719-A1D98DD75E75%40amazon.com

-- 
nathan

Re: CHECKPOINT unlogged data

From

Christoph Berg

Date:

30 May, 20:23:04

Re: Nathan Bossart
> This patch also adds an IMMEDIATE option, which I proposed some time ago
> [0].  I ended up withdrawing it due to general skepticism about its

Thanks for the pointer, I did not go that far back when looking for
older threads.

When writing the patch, I was also thinking about naming the option
"fast" or "spread" but ultimately went with "immediate" because that's
what the log message is using:

=# checkpoint;
2025-05-30 18:23:17.433 CEST [579834] LOG:  Checkpoint beginnt: immediate force wait

SQL command "(options)" tend to be booleans, hence "immediate {on|off}".
Introducing two separate keywords "fast" and "spread" seemed
confusing, and there is no precedent for "fast=on" in other tools or
the replication protocol.

> usefulness.  FWIW I have no concerns about adding a few retail options to
> CHECKPOINT, but others might balk at options without solid use-cases.  The
> unlogged table one seems reasonable enough.

I think the two options immediate and flush_all are actually useful in
combination for the shutdown case. If operation is to continue
normally until just before the shutdown, it might make sense to run
these 3 commands (or just #1 and #3):

checkpoint (flush_all, immediate false);
checkpoint (flush_all);
pg_ctl stop

(I also thought about a VERBOSE option, but since the checkpoint
messages are generated by a different process, it's probably harder
than I initially thought.)

Christoph

Re: CHECKPOINT unlogged data

From

Andres Freund

Date:

30 May, 20:29:08

On 2025-05-30 19:23:04 +0200, Christoph Berg wrote:
> Re: Nathan Bossart
> > This patch also adds an IMMEDIATE option, which I proposed some time ago
> > [0].  I ended up withdrawing it due to general skepticism about its
> 
> Thanks for the pointer, I did not go that far back when looking for
> older threads.
> 
> When writing the patch, I was also thinking about naming the option
> "fast" or "spread" but ultimately went with "immediate" because that's
> what the log message is using:
> 
> =# checkpoint;
> 2025-05-30 18:23:17.433 CEST [579834] LOG:  Checkpoint beginnt: immediate force wait
> 
> SQL command "(options)" tend to be booleans, hence "immediate {on|off}".
> Introducing two separate keywords "fast" and "spread" seemed
> confusing, and there is no precedent for "fast=on" in other tools or
> the replication protocol.

I'd add a 'mode' that can be set to an arbitrary string, which then can be
validated in C code. That seems more future proof.

Re: CHECKPOINT unlogged data

From

Christoph Berg

Date:

06 June, 13:03:28

Re: Andres Freund
> I'd add a 'mode' that can be set to an arbitrary string, which then can be
> validated in C code. That seems more future proof.

Changed in the attached v2, thanks.

Christoph

Attachment

v2-0001-Add-mode-and-flush_all-options-to-checkpoint.patch

Re: CHECKPOINT unlogged data

From

Fujii Masao

Date:

06 June, 13:31:25


On 2025/06/06 19:03, Christoph Berg wrote:
> Re: Andres Freund
>> I'd add a 'mode' that can be set to an arbitrary string, which then can be
>> validated in C code. That seems more future proof.
> 
> Changed in the attached v2, thanks.

When I applied the patch and compiled it, I got the following warnings:

utility.c:946:4: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]
   946 |                         CheckPointStmt   *stmt = (CheckPointStmt *) parsetree;
       |                         ^
utility.c:947:16: warning: mixing declarations and code is incompatible with standards before C99
[-Wdeclaration-after-statement]
   947 |                         ListCell   *lc;
       |                                     ^
2 warnings generated.


            RequestCheckpoint(CHECKPOINT_WAIT |
+                              (immediate ? CHECKPOINT_IMMEDIATE : 0) |
+                              (flush_all ? CHECKPOINT_FLUSH_ALL : 0) |

Some users might want to trigger a spread checkpoint but not wait for
it to finish, since it could take a long time? If that's a valid use case,
maybe we should add a WAIT option to let users choose whether to wait for
the checkpoint to complete or not?

Regards,

-- 
Fujii Masao
NTT DATA Japan Corporation

Re: CHECKPOINT unlogged data

From

Christoph Berg

Date:

06 June, 17:26:56

Re: Fujii Masao
> utility.c:946:4: warning: label followed by a declaration is a C23 extension [-Wc23-extensions]

Thanks, my compiler didn't throw that. { } block added in v3.

>             RequestCheckpoint(CHECKPOINT_WAIT |
> +                              (immediate ? CHECKPOINT_IMMEDIATE : 0) |
> +                              (flush_all ? CHECKPOINT_FLUSH_ALL : 0) |
> 
> Some users might want to trigger a spread checkpoint but not wait for
> it to finish, since it could take a long time? If that's a valid use case,
> maybe we should add a WAIT option to let users choose whether to wait for
> the checkpoint to complete or not?

Do we want that? The checkpoint is only effective when it's finished,
and running `psql -c "checkpoint (wait false)"` might make people
shoot themselves into the foot.

Christoph

Attachment

v3-0001-Add-mode-and-flush_all-options-to-checkpoint.patch

Re: CHECKPOINT unlogged data

From

Nathan Bossart

Date:

06 June, 18:26:36

On Fri, Jun 06, 2025 at 04:26:56PM +0200, Christoph Berg wrote:
> Re: Fujii Masao
>> Some users might want to trigger a spread checkpoint but not wait for
>> it to finish, since it could take a long time? If that's a valid use case,
>> maybe we should add a WAIT option to let users choose whether to wait for
>> the checkpoint to complete or not?
> 
> Do we want that? The checkpoint is only effective when it's finished,
> and running `psql -c "checkpoint (wait false)"` might make people
> shoot themselves into the foot.

*shrug*

I imagine the documentation will pretty clearly indicate that setting WAIT
to "false" will cause CHECKPOINT to not wait for it to finish.

> +    <term><literal>MODE</literal></term>
> +    <listitem>
> +     <para>
> +      The <command>CHECKPOINT</command> command forces an immediate
> +      checkpoint by default when the command is issued, without waiting for a
> +      regular checkpoint scheduled by the system. <literal>FAST</literal> is a
> +      synonym for <literal>IMMEDIATE</literal>.
> +     </para>
> +     <para>
> +      A <literal>SPREAD</literal> checkpoint will instead spread out the write load
> +      as determined by the <xref linkend="guc-checkpoint-completion-target"/>
> +      setting, like the system-scheduled checkpoints.
> +     </para>
> +    </listitem>
> +   </varlistentry>

I don't understand why we need to add both FAST and IMMEDIATE.

> +   <varlistentry>
> +    <term><literal>FLUSH_ALL</literal></term>
> +    <listitem>
> +     <para>
> +      Requests the checkpoint to also flush data of <literal>UNLOGGED</literal>
> +      relations. Defaults to off.
> +     </para>
> +    </listitem>
> +   </varlistentry>

Could we rename this to something like FLUSH_UNLOGGED or INCLUDE_UNLOGGED?
IMHO that's more descriptive.

My attempt at this patch back in 2020 included the following note, which
seems relevant here:

+   Note that the server may consolidate concurrently requested checkpoints or
+   restartpoints.  Such consolidated requests will contain a combined set of
+   options.  For example, if one session requested an immediate checkpoint and
+   another session requested a non-immediate checkpoint, the server may combine
+   these requests and perform one immediate checkpoint.

We might also want to make sure it's clear that CHECKPOINT does nothing if
there's been no database activity since the last one (or, in the case of a
restartpoint, if there hasn't been a checkpoint record).

-- 
nathan

Re: CHECKPOINT unlogged data

From

Christoph Berg

Date:

06 June, 19:20:21

Re: Nathan Bossart
> I imagine the documentation will pretty clearly indicate that setting WAIT
> to "false" will cause CHECKPOINT to not wait for it to finish.

I can add it, it's easy enough...

> I don't understand why we need to add both FAST and IMMEDIATE.

We have both:

=# checkpoint ;
2025-06-06 18:09:25.743 CEST [872379] LOG:  checkpoint starting: immediate force wait

pg_basebackup --checkpoint=fast

Could we settle for one official name for that? Then we could use that
name in all contexts.

> > +    <term><literal>FLUSH_ALL</literal></term>
> 
> Could we rename this to something like FLUSH_UNLOGGED or INCLUDE_UNLOGGED?
> IMHO that's more descriptive.

That's again coming from what the log message is saying:

=# checkpoint (flush_all);
2025-06-06 18:12:46.298 CEST [873436] LOG:  checkpoint starting: immediate force wait flush-all

I think we should be consistent there.

#define CHECKPOINT_FLUSH_ALL    0x0010  /* Flush all pages, including those
                                         * belonging to unlogged tables */

Maybe CHECKPOINT_FLUSH_UNLOGGED would be more explicit?

> My attempt at this patch back in 2020 included the following note, which
> seems relevant here:
> 
> +   Note that the server may consolidate concurrently requested checkpoints or
> +   restartpoints.  Such consolidated requests will contain a combined set of
> +   options.  For example, if one session requested an immediate checkpoint and
> +   another session requested a non-immediate checkpoint, the server may combine
> +   these requests and perform one immediate checkpoint.

The CHECKPOINT documentation links to `28.5. WAL Configuration`,
should this be mentioned there instead?

> We might also want to make sure it's clear that CHECKPOINT does nothing if
> there's been no database activity since the last one (or, in the case of a
> restartpoint, if there hasn't been a checkpoint record).

That's taken care of by "force":

#define CHECKPOINT_FORCE        0x0008  /* Force even if no activity */

Christoph

Re: CHECKPOINT unlogged data

From

Nathan Bossart

Date:

06 June, 19:33:36

On Fri, Jun 06, 2025 at 06:20:21PM +0200, Christoph Berg wrote:
> Re: Nathan Bossart
>> I don't understand why we need to add both FAST and IMMEDIATE.
> 
> We have both:
> 
> =# checkpoint ;
> 2025-06-06 18:09:25.743 CEST [872379] LOG:  checkpoint starting: immediate force wait
> 
> pg_basebackup --checkpoint=fast
> 
> Could we settle for one official name for that? Then we could use that
> name in all contexts.

That seems like a good idea to me.  I'm tempted to say that "fast" more
accurately describes what's happening than "immediate."  "Immediate" sounds
like it happens instantaneously, but it's actually just happening "fast,"
i.e., as fast as possible.

>> > +    <term><literal>FLUSH_ALL</literal></term>
>> 
>> Could we rename this to something like FLUSH_UNLOGGED or INCLUDE_UNLOGGED?
>> IMHO that's more descriptive.
> 
> That's again coming from what the log message is saying:
> 
> =# checkpoint (flush_all);
> 2025-06-06 18:12:46.298 CEST [873436] LOG:  checkpoint starting: immediate force wait flush-all
> 
> I think we should be consistent there.
> 
> #define CHECKPOINT_FLUSH_ALL    0x0010  /* Flush all pages, including those
>                                          * belonging to unlogged tables */
> 
> Maybe CHECKPOINT_FLUSH_UNLOGGED would be more explicit?

WFM.

>> My attempt at this patch back in 2020 included the following note, which
>> seems relevant here:
>> 
>> +   Note that the server may consolidate concurrently requested checkpoints or
>> +   restartpoints.  Such consolidated requests will contain a combined set of
>> +   options.  For example, if one session requested an immediate checkpoint and
>> +   another session requested a non-immediate checkpoint, the server may combine
>> +   these requests and perform one immediate checkpoint.
> 
> The CHECKPOINT documentation links to `28.5. WAL Configuration`,
> should this be mentioned there instead?

I thought it would make sense to put it closer to where these options are
described, since it'll be most evident for manually-initiated checkpoints.

>> We might also want to make sure it's clear that CHECKPOINT does nothing if
>> there's been no database activity since the last one (or, in the case of a
>> restartpoint, if there hasn't been a checkpoint record).
> 
> That's taken care of by "force":
> 
> #define CHECKPOINT_FORCE        0x0008  /* Force even if no activity */

Oh, I see that we always specify that for CHECKPOINT commands, except for
restartpoints.  IIRC even if you do specify CHECKPOINT_FORCE for a
restartpoint, it'll have no effect.  It's proably worth mentioning that
case, at least.

-- 
nathan

Re: CHECKPOINT unlogged data

From

Christoph Berg

Date:

11 June, 16:45:46

Re: Nathan Bossart
> That seems like a good idea to me.  I'm tempted to say that "fast" more
> accurately describes what's happening than "immediate."  "Immediate" sounds
> like it happens instantaneously, but it's actually just happening "fast,"
> i.e., as fast as possible.

Ack.

> > #define CHECKPOINT_FLUSH_ALL    0x0010  /* Flush all pages, including those
> >                                          * belonging to unlogged tables */
> > 
> > Maybe CHECKPOINT_FLUSH_UNLOGGED would be more explicit?
> 
> WFM.

Do we want to change the checkpoint log message (and the new options)
only, or include the CHECKPOINT_* flags? (I would guess there aren't
many external users of these flags, but mmmv.)

> I thought it would make sense to put it closer to where these options are
> described, since it'll be most evident for manually-initiated checkpoints.

Ack, I'll add that.

> >> We might also want to make sure it's clear that CHECKPOINT does nothing if
> >> there's been no database activity since the last one (or, in the case of a
> >> restartpoint, if there hasn't been a checkpoint record).
> > 
> > That's taken care of by "force":
> > 
> > #define CHECKPOINT_FORCE        0x0008  /* Force even if no activity */
> 
> Oh, I see that we always specify that for CHECKPOINT commands, except for
> restartpoints.  IIRC even if you do specify CHECKPOINT_FORCE for a
> restartpoint, it'll have no effect.  It's proably worth mentioning that
> case, at least.

Right, will do.

Christoph

Re: CHECKPOINT unlogged data

From

Nathan Bossart

Date:

11 June, 16:59:42

On Wed, Jun 11, 2025 at 03:45:46PM +0200, Christoph Berg wrote:
> Do we want to change the checkpoint log message (and the new options)
> only, or include the CHECKPOINT_* flags? (I would guess there aren't
> many external users of these flags, but mmmv.)

IMO we should try to make the terminology consistent everywhere.  I'd
suggest putting the renaming stuff in separate prerequisite patches for
your new CHECKPOINT option.

-- 
nathan

Re: CHECKPOINT unlogged data

From

Nathan Bossart

Date:

16 June, 18:18:41

On Mon, Jun 16, 2025 at 04:36:59PM +0200, Christoph Berg wrote:
> I spent some time digging through the code, but I'm still not entirely
> sure what's happening. There are several parts to it:
> 
> 1) the list of buffers to flush is determined at the beginning of the
> checkpoint, so running a 2nd FLUSH_UNLOGGED checkpoint will not make
> the running checkpoint write these
> 
> 2) running CHECKPOINT updates the checkpoint flags in shared memory so
> I think the currently running checkpoint picks "MODE FAST" up and
> speeds up. (But I'm not entirely sure, the call stack is quite deep
> there.)
> 
> 3) running CHECKPOINT (at least when waiting for it) seems to actually
> start a new checkpoint, so FLUSH_UNLOGGED should still be effective.
> (See the code arount "start_cv" in checkpointer.c)
> 
> Admittedly, adding these points together raises some question marks
> about the flag handling, so I would welcome clarification by someone
> more knowledgeable in this area.

I think you've got it right.  With CHECKPOINT_WAIT set, RequestCheckpoint()
will wait for a new checkpoint to start, at which point we know that the
new flags have been seen by the checkpointer.  If an immediate checkpoint
is pending, CheckpointWriteDelay() will skip sleeping in the
currently-running one, so the current checkpoint will be "upgraded" to
immediate in some sense, but IIUC there will still be another immediate
checkpoint after it completes.  But AFAICT it doesn't pick up
FLUSH_UNLOGGED until the next checkpoint begins.

Another thing to note is what I mentioned earlier:

+   Note that the server may consolidate concurrently requested checkpoints or
+   restartpoints.  Such consolidated requests will contain a combined set of
+   options.  For example, if one session requested an immediate checkpoint and
+   another session requested a non-immediate checkpoint, the server may combine
+   these requests and perform one immediate checkpoint.

-- 
nathan