Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display) - Mailing list pgsql-hackers

From Bharath Rupireddy
Subject Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
Date
Msg-id CALj2ACVc5TZ87asuYDc2Lsk4_W=_Pb2vOpP8ogQAxWhBEfpYKg@mail.gmail.com
Whole thread Raw
In response to Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2022-May-05, Bharath Rupireddy wrote:
>
> > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress
> > > view or something like that, using the backend_progress.c functionality?
> > > I don't see it mentioned in the thread.
> >
> > IMO, progress reporting works well on a running server and at the
> > moment. The WAL recovery/replay can happen even before the server
> > opens up for connections
>
> It's definitely true that you wouldn't be able to use it for when the
> server is not accepting connections.
>
> > and the progress report view can't be used
> > for later analysis like how much time the restoring WAL files from
> > archive location took
>
> This is true too -- progress report doesn't store historical data, only
> current status.
>
> > and also the WAL file names can't be reported in progress reporting
> > mechanism
>
> Also true.
>
> > (only integers columns, of course if required we can add text columns
> > to pg_stat_get_progress_info).
>
> Yeah, I don't think adding text columns is terribly easy, because the
> whole point of the progress reporting infrastructure is that it can be
> updated very cheaply as atomic operations, and if you want to transmit
> text columns, that's no longer possible.
>
> > Having the recovery info in server logs might help.
>
> I suppose it might.
>
> > I think reporting a long-running file processing operation (removing
> > or syncing) within postgres is a generic problem (for snapshot,
> > mapping, temporary (pgsql_tmp), temp relation files, old WAL file
> > processing, WAL file processing during recovery etc.) and needs to be
> > solved
>
> I agree up to here.
>
> > in two ways: 1) logging progress into server logs (which helps
> > for analysis and report when the server isn't available for
> > connections, crash recovery), a generic GUC
> > log_file_processing_traffic = {none, medium, high} might help here
> > (also proposed in [1]) and 2) pg_stat_file_processing_progress
> > (extending progress reporting pg_stat_get_progress_info to have few
> > text columns for current file name and directory path).
>
> I think using the server log to store telemetry data is not a great fit.
> It bloats the log files and can be so slow as to block other operations
> in the server.  Server logs should normally be considered critical info
> that's not okay to lose; telemetry tends to be of secondary importance
> and in a pinch you can drop a few messages without hurting too much.
>
> We've done moderately okay so far with having some system views where
> some telemetry readings can be obtained, but there several drawbacks to
> that approach that we should at some point solve.  My opinion on this is
> that we need to bite the bullet and develop separate infrastructure for
> reporting server metrics.
>
> I just
> think that we should look into a new mechanism going forward.

I completely agree that we must have new ways to report important
server metrics in an easily consumable fashion - it could be something
like server computing metrics (time taken, the file being processed
etc.) important/time-taking operations such as execution of archive
command, restore command, recovery, checkpoint, old WAL files
removal/recycling, processing of various files - snapshot, mapping,
pgsql_tmp files, temp relation files etc. and writing these metrics to
a file or a stat table on the database itself (this table can have an
automatic mechanism of clean up or limit on number of rows or size of
the table so that the clients can read those metrics and use it in
whichever way they want. To start with, a simple structure of the
metrics can be {OPERATION text, START_TIME timestamp, END_TIME
timestamp, few text, double and/or integer flexible columns that each
operation can use to store metrics}.

I can start a discussion in a separate thread to seek more thoughts on
the server metrics part.

> That said, I'm not opposed to having a patch somewhat as posted.

Thanks. I'm thinking if we should have a generic GUC
log_file_processing_traffic = {none, medium, high} that might help
reporting other time taking file processing operations such as [1].

[1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com

Regards,
Bharath Rupireddy.



pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Comments on Custom RMGRs
Next
From: Robert Haas
Date:
Subject: Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)