Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs) - Mailing list pgsql-hackers

From Matthias van de Meent
Subject Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Date
Msg-id CAEze2WhfmEWDKcFx8jhcSFhbSoP7ettdYiikNSsVWCNpe5ME0g@mail.gmail.com
Whole thread Raw
In response to Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)  (Nitin Jadhav <nitinjadhavpostgres@gmail.com>)
Responses Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
Re: Report checkpoint progress with pg_stat_progress_checkpoint (was: Report checkpoint progress in server logs)
List pgsql-hackers
On Thu, 10 Feb 2022 at 07:53, Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:
>
> > > We need at least a trace of the number of buffers to sync (num_to_scan) before the checkpoint start, instead of
justemitting the stats at the end.
 
> > >
> > > Bharat, it would be good to show the buffers synced counter and the total buffers to sync, checkpointer pid,
substepit is running, whether it is on target for completion, checkpoint_Reason
 
> > > (manual/times/forced). BufferSync has several variables tracking the sync progress locally, and we may need some
refactoringhere.
 
> >
> > I agree to provide above mentioned information as part of showing the
> > progress of current checkpoint operation. I am currently looking into
> > the code to know if any other information can be added.
>
> Here is the initial patch to show the progress of checkpoint through
> pg_stat_progress_checkpoint view. Please find the attachment.
>
> The information added to this view are pid - process ID of a
> CHECKPOINTER process, kind - kind of checkpoint indicates the reason
> for checkpoint (values can be wal, time or force), phase - indicates
> the current phase of checkpoint operation, total_buffer_writes - total
> number of buffers to be written, buffers_processed - number of buffers
> processed, buffers_written - number of buffers written,
> total_file_syncs - total number of files to be synced, files_synced -
> number of files synced.
>
> There are many operations happen as part of checkpoint. For each of
> the operation I am updating the phase field of
> pg_stat_progress_checkpoint view. The values supported for this field
> are initializing, checkpointing replication slots, checkpointing
> snapshots, checkpointing logical rewrite mappings, checkpointing CLOG
> pages, checkpointing CommitTs pages, checkpointing SUBTRANS pages,
> checkpointing MULTIXACT pages, checkpointing SLRU pages, checkpointing
> buffers, performing sync requests, performing two phase checkpoint,
> recycling old XLOG files and Finalizing. In case of checkpointing
> buffers phase, the fields total_buffer_writes, buffers_processed and
> buffers_written shows the detailed progress of writing buffers. In
> case of performing sync requests phase, the fields total_file_syncs
> and files_synced shows the detailed progress of syncing files. In
> other phases, only the phase field is getting updated and it is
> difficult to show the progress because we do not get the total number
> of files count without traversing the directory. It is not worth to
> calculate that as it affects the performance of the checkpoint. I also
> gave a thought to just mention the number of files processed, but this
> wont give a meaningful progress information (It can be treated as
> statistics). Hence just updating the phase field in those scenarios.
>
> Apart from above fields, I am planning to add few more fields to the
> view in the next patch. That is, process ID of the backend process
> which triggered a CHECKPOINT command, checkpoint start location, filed
> to indicate whether it is a checkpoint or restartpoint and elapsed
> time of the checkpoint operation. Please share your thoughts. I would
> be happy to add any other information that contributes to showing the
> progress of checkpoint.
>
> As per the discussion in this thread, there should be some mechanism
> to show the progress of checkpoint during shutdown and end-of-recovery
> cases as we cannot access pg_stat_progress_checkpoint in those cases.
> I am working on this to use log_startup_progress_interval mechanism to
> log the progress in the server logs.
>
> Kindly review the patch and share your thoughts.

Interesting idea, and overall a nice addition to the
pg_stat_progress_* reporting infrastructure.

Could you add your patch to the current commitfest at
https://commitfest.postgresql.org/37/?

See below for some comments on the patch:

> xlog.c @ checkpoint_progress_start, checkpoint_progress_update_param, checkpoint_progress_end
> +    /* In bootstrap mode, we don't actually record anything. */
> +    if (IsBootstrapProcessingMode())
> +        return;

Why do you check against the state of the system?
pgstat_progress_update_* already provides protections against updating
the progress tables if the progress infrastructure is not loaded; and
otherwise (in the happy path) the cost of updating the progress fields
will be quite a bit higher than normal. Updating stat_progress isn't
very expensive (quite cheap, really), so I don't quite get why you
guard against reporting stats when you expect no other client to be
listening.

I think you can simplify this a lot by directly using
pgstat_progress_update_param() instead.

> xlog.c @ checkpoint_progress_start
> +        pgstat_progress_start_command(PROGRESS_COMMAND_CHECKPOINT, InvalidOid);
> +        checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_PHASE,
> +                                         PROGRESS_CHECKPOINT_PHASE_INIT);
> +        if (flags & CHECKPOINT_CAUSE_XLOG)
> +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> +                                             PROGRESS_CHECKPOINT_KIND_WAL);
> +        else if (flags & CHECKPOINT_CAUSE_TIME)
> +            checkpoint_progress_update_param(flags, PROGRESS_CHECKPOINT_KIND,
> +                                             PROGRESS_CHECKPOINT_KIND_TIME);
> + [...]

Could you assign the kind of checkpoint to a local variable, and then
update the "phase" and "kind" parameters at the same time through
pgstat_progress_update_multi_param(2, ...)? See
BuildRelationExtStatistics in extended_stats.c for an example usage.
Note that regardless of whether checkpoint_progress_update* will
remain, the checks done in that function already have been checked in
this function as well, so you can use the pgstat_* functions directly.

> monitoring.sgml
> +   <structname>pg_stat_progress_checkpoint</structname> view will contain a
> +   single row indicating the progress of checkpoint operation.

... add "if a checkpoint is currently active".

> +       <structfield>total_buffer_writes</structfield> <type>bigint</type>
> +       <structfield>total_file_syncs</structfield> <type>bigint</type>

The other progress tables use [type]_total as column names for counter
targets (e.g. backup_total for backup_streamed, heap_blks_total for
heap_blks_scanned, etc.). I think that `buffers_total` and
`files_total` would be better column names.

> +       The checkpoint operation is requested due to XLOG filling.

+ The checkpoint was started because >max_wal_size< of WAL was written.

> +       The checkpoint operation is requested due to timeout.

+ The checkpoint was started due to the expiration of a
>checkpoint_timeout< interval

> +       The checkpoint operation is forced even if no XLOG activity has occurred
> +       since the last one.

+ Some operation forced a checkpoint.

> +      <entry><literal>checkpointing CommitTs pages</literal></entry>

CommitTs -> Commit time stamp

Thanks for working on this.

Kind regards,

Matthias van de Meent



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Time to drop plpython2?
Next
From: Chapman Flack
Date:
Subject: Re: Time to drop plpython2?