Thread: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
Hi, As discussed in [1], isn't it a better idea to add some of activity messages [2] such as recovery, archive, backup, streaming etc. to server logs at LOG level? They are currently being set into ps display which is good if the postgres is being run on a standalone box/VM where users can see the ps display, but it doesn't help much in case the postgres is being run on a cloud environment where users don't have access to ps display output. Moreover, the ps display is transient and will not help to analyze after an issue occurs. Having the above messages in the server logs will be useful to understand how the system is/was doing/progressing with these (sometimes time-intensive) operations. Thoughts? [1] - https://www.postgresql.org/message-id/CALj2ACWE6D%2BbbKnA9tQ0d-pPN1n4xrsFPOmiURm2my3qt0w3OA%40mail.gmail.com [2] - snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", snprintf(activitymsg, sizeof(activitymsg), "recovering %s", snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s", snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog); snprintf(activitymsg, sizeof(activitymsg), "failed on %s", xlog); elog(DEBUG1, "archived write-ahead log file \"%s\"", xlog); snprintf(activitymsg, sizeof(activitymsg), "sending backup \"%s\"", snprintf(activitymsg, sizeof(activitymsg), "restarting at %X/%X", snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X", snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X", Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Bossart, Nathan"
Date:
On 11/10/21, 9:43 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > As discussed in [1], isn't it a better idea to add some of activity > messages [2] such as recovery, archive, backup, streaming etc. to > server logs at LOG level? They are currently being set into ps display > which is good if the postgres is being run on a standalone box/VM > where users can see the ps display, but it doesn't help much in case > the postgres is being run on a cloud environment where users don't > have access to ps display output. Moreover, the ps display is > transient and will not help to analyze after an issue occurs. > > Having the above messages in the server logs will be useful to > understand how the system is/was doing/progressing with these > (sometimes time-intensive) operations. I think this would make the logs far too noisy for many servers. For archiving alone, this could cause tens of thousands more log messages per hour on a busy system. I think you can already see such information at a debug level, anyway. Nathan
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Euler Taveira"
Date:
On Wed, Nov 10, 2021, at 2:38 PM, Bharath Rupireddy wrote:
As discussed in [1], isn't it a better idea to add some of activitymessages [2] such as recovery, archive, backup, streaming etc. toserver logs at LOG level? They are currently being set into ps displaywhich is good if the postgres is being run on a standalone box/VMwhere users can see the ps display, but it doesn't help much in casethe postgres is being run on a cloud environment where users don'thave access to ps display output. Moreover, the ps display istransient and will not help to analyze after an issue occurs.
Besides recovery, the other activities already provide information through
views. I might be missing something but it seems the current views already
provide the information that ps displays.
Having the above messages in the server logs will be useful tounderstand how the system is/was doing/progressing with these(sometimes time-intensive) operations.
It could be useful for a root cause analysis, however, you are also increasing
the number of LOG messages in a system that doesn't/won't require such
information. It is fine to add additional DEBUG messages if there isn't a
similar one yet. If at least the message level were module-controlled, you
could modify a setting to gather more messages from a specific module.
Unfortunately, that's not possible so we should avoid superfluous messages.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Nov 11, 2021 at 12:01 AM Bossart, Nathan <bossartn@amazon.com> wrote: > > On 11/10/21, 9:43 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > > As discussed in [1], isn't it a better idea to add some of activity > > messages [2] such as recovery, archive, backup, streaming etc. to > > server logs at LOG level? They are currently being set into ps display > > which is good if the postgres is being run on a standalone box/VM > > where users can see the ps display, but it doesn't help much in case > > the postgres is being run on a cloud environment where users don't > > have access to ps display output. Moreover, the ps display is > > transient and will not help to analyze after an issue occurs. > > > > Having the above messages in the server logs will be useful to > > understand how the system is/was doing/progressing with these > > (sometimes time-intensive) operations. > > I think this would make the logs far too noisy for many servers. For > archiving alone, this could cause tens of thousands more log messages > per hour on a busy system. I think you can already see such > information at a debug level, anyway. You are right, having them at LOG level may blow up the log files and disk storage, having them at DEBUG1 like other messages would at least help than having them as activitymsg in ps display. And not all of the activitymsgs are logged at DEBUGX level, at least I don't see them(on a quick search). snprintf(activitymsg, sizeof(activitymsg), "waiting for %s", snprintf(activitymsg, sizeof(activitymsg), "recovering %s", snprintf(activitymsg, sizeof(activitymsg), "performing %s%s%s", snprintf(activitymsg, sizeof(activitymsg), "sending backup \"%s\"", snprintf(activitymsg, sizeof(activitymsg), "restarting at %X/%X", snprintf(activitymsg, sizeof(activitymsg), "streaming %X/%X", But for the snprintf(activitymsg, sizeof(activitymsg), "archiving %s", xlog); we have elog(DEBUG1, "archived write-ahead log file \"%s\"", xlog); after the archiving command. It is also good to have a similar debug message before archive command execution, we can get to know whether archive command is executed or not, if yes how much time did it take etc.? I'm not sure whether it is okay to do the following, for every of the activitymsg, before the command execution, we have an elog(DEBUG1 message and after the command execution another elog(DEBUG1 message. Thoughts? Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Nov 11, 2021 at 1:40 AM Euler Taveira <euler@eulerto.com> wrote: > > On Wed, Nov 10, 2021, at 2:38 PM, Bharath Rupireddy wrote: > > As discussed in [1], isn't it a better idea to add some of activity > messages [2] such as recovery, archive, backup, streaming etc. to > server logs at LOG level? They are currently being set into ps display > which is good if the postgres is being run on a standalone box/VM > where users can see the ps display, but it doesn't help much in case > the postgres is being run on a cloud environment where users don't > have access to ps display output. Moreover, the ps display is > transient and will not help to analyze after an issue occurs. > > Besides recovery, the other activities already provide information through > views. I might be missing something but it seems the current views already > provide the information that ps displays. > > Having the above messages in the server logs will be useful to > understand how the system is/was doing/progressing with these > (sometimes time-intensive) operations. > > It could be useful for a root cause analysis, however, you are also increasing > the number of LOG messages in a system that doesn't/won't require such > information. It is fine to add additional DEBUG messages if there isn't a > similar one yet. If at least the message level were module-controlled, you > could modify a setting to gather more messages from a specific module. > Unfortunately, that's not possible so we should avoid superfluous messages. Yes, at least having the messages at DEBUG1 level would help do the RCA after an issue occurs. Please see my response upthread at [1]. [1] - https://www.postgresql.org/message-id/CALj2ACXVwnHhek9bo9wgamtW9SBinj4ohF1f1Yf3Su2o5WarWg%40mail.gmail.com Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Alvaro Herrera
Date:
On 2021-Nov-10, Bossart, Nathan wrote: > On 11/10/21, 9:43 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > > As discussed in [1], isn't it a better idea to add some of activity > > messages [2] such as recovery, archive, backup, streaming etc. to > > server logs at LOG level? > I think this would make the logs far too noisy for many servers. For > archiving alone, this could cause tens of thousands more log messages > per hour on a busy system. I think you can already see such > information at a debug level, anyway. Yeah. If we had some sort of ring buffer in which to store these messages, the user could examine them through a view; they would still be accessible in a running server, but they would not be written to the server log. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Bossart, Nathan"
Date:
On 11/10/21, 9:51 PM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > But for the snprintf(activitymsg, sizeof(activitymsg), "archiving %s", > xlog); we have elog(DEBUG1, "archived write-ahead log file \"%s\"", > xlog); after the archiving command. It is also good to have a similar > debug message before archive command execution, we can get to know > whether archive command is executed or not, if yes how much time did > it take etc.? There is a DEBUG3 that emits "executing archive command..." prior to executing the archive command. > I'm not sure whether it is okay to do the following, for every of the > activitymsg, before the command execution, we have an elog(DEBUG1 > message and after the command execution another elog(DEBUG1 message. The general idea seems reasonable to me. Nathan
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Bossart, Nathan"
Date:
On 11/11/21, 4:38 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > Yeah. If we had some sort of ring buffer in which to store these > messages, the user could examine them through a view; they would still > be accessible in a running server, but they would not be written to the > server log. That's an interesting idea. It'd be nice to have a way to see more debugging information without exploding the server log. Nathan
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Euler Taveira"
Date:
On Thu, Nov 11, 2021, at 2:50 AM, Bharath Rupireddy wrote:
But for the snprintf(activitymsg, sizeof(activitymsg), "archiving %s",xlog); we have elog(DEBUG1, "archived write-ahead log file \"%s\"",xlog); after the archiving command. It is also good to have a similardebug message before archive command execution, we can get to knowwhether archive command is executed or not, if yes how much time didit take etc.?
If you are only interested in how much time it took to archive a WAL file, you
don't need 2 messages (before and after); one message is sufficient. If you
calculate the difference and then print it as part of the existent message.
Calculate the difference iif the log level is at least DEBUG1 (same as the log
message).
Messages related to system() e.g. archive and recovery, you could measure the
execution time in your own script/program.
I like Alvaro's idea of implementing a ring buffer for this kind of activity.
This implementation could be applied to checkpoints, archiving, recovery,
replication, base backup, WAL usage, connections and others.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bruce Momjian
Date:
On Thu, Nov 11, 2021 at 03:18:16PM +0000, Bossart, Nathan wrote: > On 11/11/21, 4:38 AM, "Alvaro Herrera" <alvherre@alvh.no-ip.org> wrote: > > Yeah. If we had some sort of ring buffer in which to store these > > messages, the user could examine them through a view; they would still > > be accessible in a running server, but they would not be written to the > > server log. > > That's an interesting idea. It'd be nice to have a way to see more > debugging information without exploding the server log. Uh, that's what dmesg does, which I mentioned earlier. -- Bruce Momjian <bruce@momjian.us> https://momjian.us EDB https://enterprisedb.com If only the physical world exists, free will is an illusion.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Ashutosh Sharma
Date:
On Thu, Nov 11, 2021 at 6:07 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
On 2021-Nov-10, Bossart, Nathan wrote:
> On 11/10/21, 9:43 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote:
> > As discussed in [1], isn't it a better idea to add some of activity
> > messages [2] such as recovery, archive, backup, streaming etc. to
> > server logs at LOG level?
> I think this would make the logs far too noisy for many servers. For
> archiving alone, this could cause tens of thousands more log messages
> per hour on a busy system. I think you can already see such
> information at a debug level, anyway.
Yeah. If we had some sort of ring buffer in which to store these
messages, the user could examine them through a view; they would still
be accessible in a running server, but they would not be written to the
server log.
That's a good idea. How about also adding some GUC(s) to the log archive, recovery related log messages just like we have for checkpoints, autovacuum etc? Maybe something like log_archive, log_recovery etc.
--
--
With Regards,
Ashutosh Sharma.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Nov 11, 2021 at 8:41 PM Bossart, Nathan <bossartn@amazon.com> wrote: > > I'm not sure whether it is okay to do the following, for every of the > > activitymsg, before the command execution, we have an elog(DEBUG1 > > message and after the command execution another elog(DEBUG1 message. > > The general idea seems reasonable to me. Here's a simple patch that I could come up with. Please have a look. Regards, Bharath Rupireddy.
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Alvaro Herrera
Date:
A bunch of these now execute snprintf()s unnecessarily. I think these should be made conditional on message_level_is_interesting(DEBUG1) to avoid that overhead. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "You're _really_ hosed if the person doing the hiring doesn't understand relational systems: you end up with a whole raft of programmers, none of whom has had a Date with the clue stick." (Andrew Sullivan)
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Sat, Nov 13, 2021 at 9:04 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> A bunch of these now execute snprintf()s unnecessarily. I think these
> should be made conditional on message_level_is_interesting(DEBUG1) to
> avoid that overhead.
Thanks. Attaching the v2 to avoid that by directly using the message in ereport instead of activitymsg.
Regards,
Bharath Rupireddy.
>
> A bunch of these now execute snprintf()s unnecessarily. I think these
> should be made conditional on message_level_is_interesting(DEBUG1) to
> avoid that overhead.
Thanks. Attaching the v2 to avoid that by directly using the message in ereport instead of activitymsg.
Regards,
Bharath Rupireddy.
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Bossart, Nathan"
Date:
On 11/14/21, 4:18 AM, "Bharath Rupireddy" <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks. Attaching the v2 to avoid that by directly using the message > in ereport instead of activitymsg. + ereport(DEBUG1, + (errmsg_internal("waiting for %s", xlogfname))); As a general comment, I think we should include more detail than what goes in the ps title. While we probably want to keep the ps title, brief, I don't see any issue with including a bit more context in debug statements. + ereport(DEBUG1, + (errmsg_internal("executing archive command \"%s\" for archving %s", + xlogarchcmd, xlog))); I wonder if it's necessary to include the "for archiving" part in this one. Typically, the xlogarchcmd will include it somewhere. Nathan
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Mon, Nov 15, 2021 at 10:36:09PM +0000, Bossart, Nathan wrote: > + ereport(DEBUG1, > + (errmsg_internal("executing archive command \"%s\" for archving %s", > + xlogarchcmd, xlog))); > > I wonder if it's necessary to include the "for archiving" part in this > one. Typically, the xlogarchcmd will include it somewhere. Don't see any need to add that either (s/archving/archiving/, btw). @@ -1000,6 +1004,10 @@ XLogWalRcvFlush(bool dying, TimeLineID tli) + ereport(DEBUG1, + (errmsg_internal("streaming %X/%X", + LSN_FORMAT_ARGS(LogstreamResult.Write)))); Having the same wording for the WAL receiver and the WAL sender could be confusing when it comes to a cascading standby node, particularly if log_line_prefix does not include the process PID. + ereport(DEBUG1, + (errmsg_internal("streaming %X/%X", + LSN_FORMAT_ARGS(sentPtr)))); Anyway, those two ones are going to make the logs much more noisy, no? The same could be said about XLogFileRead(), joining the point of Nathan's upthread. So I cannot get excited by this change. -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
"Bossart, Nathan"
Date:
On 11/15/21, 7:14 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > + ereport(DEBUG1, > + (errmsg_internal("streaming %X/%X", > + LSN_FORMAT_ARGS(sentPtr)))); > Anyway, those two ones are going to make the logs much more noisy, no? > The same could be said about XLogFileRead(), joining the point of > Nathan's upthread. So I cannot get excited by this change. Yeah, this might even be too noisy for DEBUG5. Nathan
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Alvaro Herrera
Date:
On 2021-Nov-16, Bossart, Nathan wrote: > On 11/15/21, 7:14 PM, "Michael Paquier" <michael@paquier.xyz> wrote: > > + ereport(DEBUG1, > > + (errmsg_internal("streaming %X/%X", > > + LSN_FORMAT_ARGS(sentPtr)))); > > Anyway, those two ones are going to make the logs much more noisy, no? > > The same could be said about XLogFileRead(), joining the point of > > Nathan's upthread. So I cannot get excited by this change. > > Yeah, this might even be too noisy for DEBUG5. Nod. And if you're at DEBUG5, the log contains so much other crap that it is unusable for any purposes anyway. My opinion is that adding these things willy-nilly is not a solution to any actual problem. Adding a few additional log lines that are low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's not good for anything other than specific development, IMO. At least this particular one for streaming replication I think we should not include. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Tue, Nov 16, 2021 at 01:26:49PM -0300, Alvaro Herrera wrote: > My opinion is that adding these things willy-nilly is not a solution to > any actual problem. Adding a few additional log lines that are > low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's > not good for anything other than specific development, IMO. At least > this particular one for streaming replication I think we should not > include. Looking at v2, I think that this leaves the additions of the DEBUG1 entries in SendBaseBackup() and WalRcvWaitForStartPosition(), then. The one in pgarch.c does not provide any additional information as the segment to-be-archived should be part of the command. -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Wed, Nov 17, 2021 at 8:01 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Nov 16, 2021 at 01:26:49PM -0300, Alvaro Herrera wrote: > > My opinion is that adding these things willy-nilly is not a solution to > > any actual problem. Adding a few additional log lines that are > > low-volume at DEBUG1 might be useful, but below that (DEBUG2 etc) it's > > not good for anything other than specific development, IMO. At least > > this particular one for streaming replication I think we should not > > include. > > Looking at v2, I think that this leaves the additions of the DEBUG1 > entries in SendBaseBackup() and WalRcvWaitForStartPosition(), then. > The one in pgarch.c does not provide any additional information as the > segment to-be-archived should be part of the command. Thank you all for the inputs. Here's the patch that I've come up with. Upon thinking further, having at least the messages at LOG level [1] would be helpful to know what's happening with the system while in recovery. Although these messages at LOG level seem to be filling up the server logs, having a good log consumption and rotation mechanism (I'm sure every major postgres vendor would have one) would be sufficient to allay that concern. These LOG messages would help us know how much time a restore command takes to fetch the WAL file and what is the current WAL file the server is recovering and where is it recovering from. The customer often asks questions like: when will my server come up? how much time does the recovery of my server take? As a developer or admin, one can monitor these logs and do bunch of things: 1) see how many WAL files left to be recovered by looking at the WAL files in the archive location or pg_wal directory or from primary 2) provide an approximate estimation of when the server will come up or how much more the recovery takes by looking at these previous LOG messages, one can know the number of WAL files that server recovered over a minute and with the help of left-over WAL files calculated from (1). [1] ereport(LOG, (errmsg("waiting for WAL segment \"%s\" from archive", xlogfname))); ereport(LOG, (errmsg("restored WAL segment \"%s\" from archive", xlogfname))); ereport(LOG, (errmsg("recovering WAL segment \"%s\" from source \"%s\"", xlogfname, srcname))); Regards, Bharath Rupireddy.
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote: > Upon thinking further, having at least the messages at LOG level [1] > would be helpful to know what's happening with the system while in > recovery. Although these messages at LOG level seem to be filling up > the server logs, having a good log consumption and rotation mechanism > (I'm sure every major postgres vendor would have one) would be > sufficient to allay that concern. + ereport(LOG, + (errmsg("recovering WAL segment \"%s\" from source \"%s\"", + xlogfname, srcname))); Isn't this approach an issue for translations? It seems to me that terms like "stream" or "archive" had better be translated properly, meaning that this needs a fully-constructed sentence. -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Wed, Dec 8, 2021 at 10:05 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Dec 07, 2021 at 06:28:24PM +0530, Bharath Rupireddy wrote: > > Upon thinking further, having at least the messages at LOG level [1] > > would be helpful to know what's happening with the system while in > > recovery. Although these messages at LOG level seem to be filling up > > the server logs, having a good log consumption and rotation mechanism > > (I'm sure every major postgres vendor would have one) would be > > sufficient to allay that concern. > > + ereport(LOG, > + (errmsg("recovering WAL segment \"%s\" from source \"%s\"", > + xlogfname, srcname))); > Isn't this approach an issue for translations? It seems to me that > terms like "stream" or "archive" had better be translated properly, > meaning that this needs a fully-constructed sentence. Thanks for taking a look at the patch. How about the attached v4? I added a CF entry - https://commitfest.postgresql.org/36/3443/ Regards, Bharath Rupireddy.
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Wed, Dec 08, 2021 at 10:47:57AM +0530, Bharath Rupireddy wrote: > Thanks for taking a look at the patch. How about the attached v4? > > I added a CF entry - https://commitfest.postgresql.org/36/3443/ + else if (source == XLOG_FROM_STREAM) + ereport(LOG, + (errmsg("recovering WAL segment \"%s\" received from primary", + xlogfname))); This is incorrect when using a cascading standby. And perhaps this could use a switch/case? While quickly testing, I got reminded that the LOG for a segment retrieved from the local pg_wal would generate some noise when running for example the bootstrap process. Is this one really interesting compared to the two others? -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Wed, Dec 8, 2021 at 12:52 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Dec 08, 2021 at 10:47:57AM +0530, Bharath Rupireddy wrote: > > Thanks for taking a look at the patch. How about the attached v4? > > > > I added a CF entry - https://commitfest.postgresql.org/36/3443/ > > + else if (source == XLOG_FROM_STREAM) > + ereport(LOG, > + (errmsg("recovering WAL segment \"%s\" received from primary", > + xlogfname))); > This is incorrect when using a cascading standby. I visualize the setup like this: primary -> standby -> cascading standby For standby, "received from primary" makes sense. For cascading standby too it makes sense because the standby still acts as primary for cascading standby, no? And we don't distinguish any other existing messages for a standby acting as a cascading standby right? The comments around the enum XLOG_FROM_STREAM and usage of it still says from "primary". Do you want me to add "received from stream" in general? > And perhaps this could use a switch/case? Yes, I will change it to switch-case in the next version. > While quickly testing, I got reminded that the LOG for a segment > retrieved from the local pg_wal would generate some noise when running > for example the bootstrap process. Is this one really interesting > compared to the two others? Yes. I observed that for a simple primary-standby setup, inserting 10million rows generated around 40 WAL files and so 40 LOG messages in the standby server logs, in a typical production environment with heavy-write workload, the amount of LOG generation is going to be much more. And we don't want to have them as DEBUG1 messages and enable DEBUG1 just for recovery messages that one is interested in. How about we invent a new GUC log_recovery, with default set to false, similar to log_checkpoints? And, these messages can be like: ereport(log_recovery ? LOG : DEBUG1, .....)? This GUC can be useful to add some recovery stats in future as well. Thoughts? Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Wed, Dec 08, 2021 at 02:00:12PM +0530, Bharath Rupireddy wrote: > For standby, "received from primary" makes sense. For cascading > standby too it makes sense because the standby still acts as primary > for cascading standby, no? And we don't distinguish any other existing > messages for a standby acting as a cascading standby right? The > comments around the enum XLOG_FROM_STREAM and usage of it still says > from "primary". Quoting the docs: https://www.postgresql.org/docs/devel/glossary.html#GLOSSARY-PRIMARY-SERVER And a standby is no such thing. > Do you want me to add "received from stream" in general? "from stream" would be fine, IMHO. > How about we invent a new GUC log_recovery, with default set to false, > similar to log_checkpoints? And, these messages can be like: > ereport(log_recovery ? LOG : DEBUG1, .....)? This GUC can be useful to > add some recovery stats in future as well. Thoughts? This discussion does not justify a new GUC at this stage IMO. -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Wed, Dec 8, 2021 at 2:08 PM Michael Paquier <michael@paquier.xyz> wrote: > > Do you want me to add "received from stream" in general? > > "from stream" would be fine, IMHO. Done in the attached v5 patch. > > How about we invent a new GUC log_recovery, with default set to false, > > similar to log_checkpoints? And, these messages can be like: > > ereport(log_recovery ? LOG : DEBUG1, .....)? This GUC can be useful to > > add some recovery stats in future as well. Thoughts? > > This discussion does not justify a new GUC at this stage IMO. Having "log_recovery" GUC gives the flexibility to the users to enable the recovery logs added by this patch if they are concerned otherwise the logs just be at DEBUG1 level. Setting log_min_messages level to DEBUG1 would definitely flood the server logs. > While quickly testing, I got reminded that the LOG for a segment > retrieved from the local pg_wal would generate some noise when running > for example the bootstrap process. Is this one really interesting > compared to the two others? If we just have the archive related message at LOG level and others at DEBUG1 level, it is a kind of partial recovery log information that one has in the log files, missing recovery messages from the pg_wal directory in the logs. Another option is to just have all the messages (added in the patch) at DEBUG1 level, let's users deal with it. IMO, the new GUC "log_recovery" is way better as it can be extended to include some recovery stats info as well. Thoughts? Regards, Bharath Rupireddy.
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Michael Paquier
Date:
On Wed, Dec 08, 2021 at 04:57:49PM +0530, Bharath Rupireddy wrote: > Done in the attached v5 patch. Thanks for the new version. FWIW, as the information logged when recovering a WAL segment from the local pg_wal could be rather confusing at bootstrap phase, I would suggest to simplify things as of the attached and call it a day. -- Michael
Attachment
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Dec 9, 2021 at 7:56 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Dec 08, 2021 at 04:57:49PM +0530, Bharath Rupireddy wrote: > > Done in the attached v5 patch. > > Thanks for the new version. FWIW, as the information logged when > recovering a WAL segment from the local pg_wal could be rather > confusing at bootstrap phase, I would suggest to simplify things as of > the attached and call it a day. Thanks for the patch. It looks good to me. I have a suggestion: It's better if we specify why we are not emitting any message if the WAL segment is found in pg_wal directory, something like "we are intentionally not emitting any messages in case if the required WAL segment is found in the pg_wal directory to avoid noise in the server logs". + case XLOG_FROM_PG_WAL: + case XLOG_FROM_ANY: + break; Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Dec 9, 2021 at 11:28 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 9, 2021 at 7:56 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Dec 08, 2021 at 04:57:49PM +0530, Bharath Rupireddy wrote: > > > Done in the attached v5 patch. > > > > Thanks for the new version. FWIW, as the information logged when > > recovering a WAL segment from the local pg_wal could be rather > > confusing at bootstrap phase, I would suggest to simplify things as of > > the attached and call it a day. > > Thanks for the patch. It looks good to me. I have a suggestion: It's > better if we specify why we are not emitting any message if the WAL > segment is found in pg_wal directory, something like "we are > intentionally not emitting any messages in case if the required WAL > segment is found in the pg_wal directory to avoid noise in the server > logs". > > + case XLOG_FROM_PG_WAL: > + case XLOG_FROM_ANY: > + break; I just want to call this out: an insertion of 10 million rows in the primary generates a bunch of messages in the standby [1] within 40 sec (size of the standby server log grows by 5K). [1] 2021-12-08 08:11:53.906 UTC [1219898] LOG: recovering WAL segment "000000010000000000000004" from stream 2021-12-08 08:11:55.864 UTC [1219898] LOG: recovering WAL segment "000000010000000000000005" from stream 2021-12-08 08:11:56.713 UTC [1219898] LOG: recovering WAL segment "000000010000000000000006" from stream 2021-12-08 08:11:57.495 UTC [1219898] LOG: recovering WAL segment "000000010000000000000007" from stream 2021-12-08 08:11:58.301 UTC [1219898] LOG: recovering WAL segment "000000010000000000000008" from stream 2021-12-08 08:11:59.062 UTC [1219898] LOG: recovering WAL segment "000000010000000000000009" from stream 2021-12-08 08:11:59.853 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000A" from stream 2021-12-08 08:12:00.573 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000B" from stream 2021-12-08 08:12:01.346 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000C" from stream 2021-12-08 08:12:02.048 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000D" from stream 2021-12-08 08:12:02.798 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000E" from stream 2021-12-08 08:12:03.544 UTC [1219898] LOG: recovering WAL segment "00000001000000000000000F" from stream 2021-12-08 08:12:04.310 UTC [1219898] LOG: recovering WAL segment "000000010000000000000010" from stream 2021-12-08 08:12:05.096 UTC [1219898] LOG: recovering WAL segment "000000010000000000000011" from stream 2021-12-08 08:12:06.003 UTC [1219898] LOG: recovering WAL segment "000000010000000000000012" from stream 2021-12-08 08:12:06.762 UTC [1219898] LOG: recovering WAL segment "000000010000000000000013" from stream 2021-12-08 08:12:07.595 UTC [1219898] LOG: recovering WAL segment "000000010000000000000014" from stream 2021-12-08 08:12:08.431 UTC [1219898] LOG: recovering WAL segment "000000010000000000000015" from stream 2021-12-08 08:12:09.246 UTC [1219898] LOG: recovering WAL segment "000000010000000000000016" from stream 2021-12-08 08:12:10.144 UTC [1219898] LOG: recovering WAL segment "000000010000000000000017" from stream 2021-12-08 08:12:14.944 UTC [1219898] LOG: recovering WAL segment "000000010000000000000018" from stream 2021-12-08 08:12:15.784 UTC [1219898] LOG: recovering WAL segment "000000010000000000000019" from stream 2021-12-08 08:12:16.757 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001A" from stream 2021-12-08 08:12:17.571 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001B" from stream 2021-12-08 08:12:18.375 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001C" from stream 2021-12-08 08:12:19.179 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001D" from stream 2021-12-08 08:12:19.964 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001E" from stream 2021-12-08 08:12:20.839 UTC [1219898] LOG: recovering WAL segment "00000001000000000000001F" from stream 2021-12-08 08:12:21.782 UTC [1219898] LOG: recovering WAL segment "000000010000000000000020" from stream 2021-12-08 08:12:24.422 UTC [1219898] LOG: recovering WAL segment "000000010000000000000021" from stream 2021-12-08 08:12:25.164 UTC [1219898] LOG: recovering WAL segment "000000010000000000000022" from stream 2021-12-08 08:12:25.989 UTC [1219898] LOG: recovering WAL segment "000000010000000000000023" from stream 2021-12-08 08:12:26.874 UTC [1219898] LOG: recovering WAL segment "000000010000000000000024" from stream 2021-12-08 08:12:27.592 UTC [1219898] LOG: recovering WAL segment "000000010000000000000025" from stream 2021-12-08 08:12:28.471 UTC [1219898] LOG: recovering WAL segment "000000010000000000000026" from stream 2021-12-08 08:12:29.158 UTC [1219898] LOG: recovering WAL segment "000000010000000000000027" from stream 2021-12-08 08:12:30.017 UTC [1219898] LOG: recovering WAL segment "000000010000000000000028" from stream 2021-12-08 08:12:30.778 UTC [1219898] LOG: recovering WAL segment "000000010000000000000029" from stream 2021-12-08 08:12:31.859 UTC [1219898] LOG: recovering WAL segment "00000001000000000000002A" from stream Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Alvaro Herrera
Date:
On 2021-Dec-09, Bharath Rupireddy wrote: > I just want to call this out: an insertion of 10 million rows in the > primary generates a bunch of messages in the standby [1] within 40 sec > (size of the standby server log grows by 5K). Hmm, that does sound excessive to me in terms of log bloat increase. Remember the discussion about turning log_checkpoints on by default? -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Las mujeres son como hondas: mientras más resistencia tienen, más lejos puedes llegar con ellas" (Jonas Nightingale, Leap of Faith)
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Dec 9, 2021 at 6:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > I just want to call this out: an insertion of 10 million rows in the > > primary generates a bunch of messages in the standby [1] within 40 sec > > (size of the standby server log grows by 5K). > > Hmm, that does sound excessive to me in terms of log bloat increase. > Remember the discussion about turning log_checkpoints on by default? The amount of LOG messages generated when the log_checkpoints GUC is set to on, are quite less, hardly 4 messages per-checkpoint (5min). I think enabling log_checkpoints is still acceptable as most of the hackers agreed in another thread [1] that the advantages with it are more and it doesn't seem to be bloating the server logs (in a day at max 1152 messages). I'm not sure if it is worth having a GUC log_recovery if enabled the recovery messages can be emitted at LOG level otherwise DEBUG1 level. log_recovery GUC can also be used to collect and emit some recovery stats like log_checkpoints. [1] - https://www.postgresql.org/message-id/CA%2BTgmoaDFpFmNQuaWj6%2B78CPVBrF_WPT1wFHBTvio%3DtRmxzUcQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Alvaro Herrera
Date:
On 2021-Dec-09, Bharath Rupireddy wrote: > On Thu, Dec 9, 2021 at 6:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > > > I just want to call this out: an insertion of 10 million rows in the > > > primary generates a bunch of messages in the standby [1] within 40 sec > > > (size of the standby server log grows by 5K). > > > > Hmm, that does sound excessive to me in terms of log bloat increase. > > Remember the discussion about turning log_checkpoints on by default? > > The amount of LOG messages generated when the log_checkpoints GUC is > set to on, are quite less, hardly 4 messages per-checkpoint (5min). I > think enabling log_checkpoints is still acceptable as most of the > hackers agreed in another thread [1] that the advantages with it are > more and it doesn't seem to be bloating the server logs (in a day at > max 1152 messages). My point is that it was argued, in that thread, that setting log_checkpoints to on by default would cause too much additional log volume. That argument was shot down, but in this thread we're arguing that we want five kilobytes of messages in forty seconds. That sounds much less acceptable to me, so making it default behavior or unconditional behavior is not going to fly very high. > I'm not sure if it is worth having a GUC log_recovery if enabled the > recovery messages can be emitted at LOG level otherwise DEBUG1 level. Maybe some tunable like log_wal_traffic = { none, medium, high } where "none" is current behavior of no noise, "medium" gets (say) once every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets you one message per segment. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > On Thu, Dec 9, 2021 at 6:00 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > On 2021-Dec-09, Bharath Rupireddy wrote: > > > > > > > I just want to call this out: an insertion of 10 million rows in the > > > > primary generates a bunch of messages in the standby [1] within 40 sec > > > > (size of the standby server log grows by 5K). > > > > > > Hmm, that does sound excessive to me in terms of log bloat increase. > > > Remember the discussion about turning log_checkpoints on by default? > > > > The amount of LOG messages generated when the log_checkpoints GUC is > > set to on, are quite less, hardly 4 messages per-checkpoint (5min). I > > think enabling log_checkpoints is still acceptable as most of the > > hackers agreed in another thread [1] that the advantages with it are > > more and it doesn't seem to be bloating the server logs (in a day at > > max 1152 messages). > > My point is that it was argued, in that thread, that setting > log_checkpoints to on by default would cause too much additional log > volume. That argument was shot down, but in this thread we're arguing > that we want five kilobytes of messages in forty seconds. That sounds > much less acceptable to me, so making it default behavior or > unconditional behavior is not going to fly very high. > > > I'm not sure if it is worth having a GUC log_recovery if enabled the > > recovery messages can be emitted at LOG level otherwise DEBUG1 level. > > Maybe some tunable like > log_wal_traffic = { none, medium, high } > where "none" is current behavior of no noise, "medium" gets (say) once > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > you one message per segment. Yeah, that can be an option. Another idea could be to use the infrastructure laid out by the commit 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of current recovering WAL file) for every log_startup_progress_interval seconds/milliseconds. One problem is that ereport_startup_progress doesn't work on StandbyMode, maybe we can remove this restriction unless we have a major reason for not allowing it on the standby. /* Prepare to report progress of the redo phase. */ if (!StandbyMode) begin_startup_progress_phase(); Thoughts? [1] commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Oct 25 11:51:57 2021 -0400 Report progress of startup operations that take a long time. Discussion: https://postgr.es/m/CA+TgmoaHQrgDFOBwgY16XCoMtXxsrVGFB2jNCvb7-ubuEe1MGg@mail.gmail.com Discussion: https://postgr.es/m/CAMm1aWaHF7VE69572_OLQ+MgpT5RUiUDgF1x5RrtkJBLdpRj3Q@mail.gmail.com Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Sun, Dec 12, 2021 at 06:08:05PM +0530, Bharath Rupireddy wrote: > > Another idea could be to use the infrastructure laid out by the commit > > 9ce346e [1]. With ereport_startup_progress, we can emit the LOGs(of > > current recovering WAL file) for every log_startup_progress_interval > > seconds/milliseconds. > > > > One problem is that ereport_startup_progress doesn't work on > > StandbyMode, maybe we can remove this restriction unless we have a > > major reason for not allowing it on the standby. > > /* Prepare to report progress of the redo phase. */ > > if (!StandbyMode) > > begin_startup_progress_phase(); > > The relevant conversation starts here. > https://www.postgresql.org/message-id/20210814214700.GO10479%40telsasoft.com > > There was a lot of confusion in that thread, though. > > The understanding was that it didn't make sense for that feature to > continuously log messages on a standby (every 10sec by default). That seems > like too much - the issue of a checkpointed logged every 5min was enough of a > hurdle. > > If you're talking about a new feature that uses the infrastructre from 9ce3, > but is controlled by a separate GUC like log_wal_traffic, that could be okay. Do you see any problems using the same GUC log_startup_progress_interval and ereport_startup_progress API introduced by 9ce346e? I prefer this instead of a new GUC. Thoughts? Regards, Bharath Rupireddy.
Re: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display
From
Bharath Rupireddy
Date:
On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Maybe some tunable like > log_wal_traffic = { none, medium, high } > where "none" is current behavior of no noise, "medium" gets (say) once > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > you one message per segment. On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > If you're talking about a new feature that uses the infrastructre from 9ce3, > but is controlled by a separate GUC like log_wal_traffic, that could be okay. Thanks for the suggestion. I've added a new GUC log_wal_traffic as suggested above. PSA v7 patch. Regards, Bharath Rupireddy.
Attachment
Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Maybe some tunable like > > log_wal_traffic = { none, medium, high } > > where "none" is current behavior of no noise, "medium" gets (say) once > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > > you one message per segment. > > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > If you're talking about a new feature that uses the infrastructre from 9ce3, > > but is controlled by a separate GUC like log_wal_traffic, that could be okay. > > Thanks for the suggestion. I've added a new GUC log_wal_traffic as > suggested above. PSA v7 patch. Here's the rebased v8 patch, please review it. https://commitfest.postgresql.org/37/3443/ Regards, Bharath Rupireddy.
Attachment
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Sun, Feb 27, 2022 at 12:50 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Dec 29, 2021 at 6:50 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > On Thu, Dec 9, 2021 at 9:28 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Maybe some tunable like > > > log_wal_traffic = { none, medium, high } > > > where "none" is current behavior of no noise, "medium" gets (say) once > > > every 256 segments (e.g., when going from XFF to (X+1)00), "high" gets > > > you one message per segment. > > > > On Fri, Dec 24, 2021 at 7:19 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > > > > If you're talking about a new feature that uses the infrastructre from 9ce3, > > > but is controlled by a separate GUC like log_wal_traffic, that could be okay. > > > > Thanks for the suggestion. I've added a new GUC log_wal_traffic as > > suggested above. PSA v7 patch. > > Here's the rebased v8 patch, please review it. > > https://commitfest.postgresql.org/37/3443/ Here's the rebased v9 patch. Regards, Bharath Rupireddy.
Attachment
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Alvaro Herrera
Date:
Did we ever consider the idea of using a new pg_stat_wal_activity_progress view or something like that, using the backend_progress.c functionality? I don't see it mentioned in the thread. It's not an *exact* fit, since this is not about one "command" being executed by a "backend", but it seems a near enough fit, and it would unobtrusive enough that we don't need to worry about the progress-update rate or have a GUC to determine how frequently to update (with a gauge that's heavily workload-dependant and is likely to make little sense to some users -- consider differently-sized WAL segments, for one thing). -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > view or something like that, using the backend_progress.c functionality? > I don't see it mentioned in the thread. IMO, progress reporting works well on a running server and at the moment. The WAL recovery/replay can happen even before the server opens up for connections and the progress report view can't be used for later analysis like how much time the restoring WAL files from archive location took and also the WAL file names can't be reported in progress reporting mechanism (only integers columns, of course if required we can add text columns to pg_stat_get_progress_info). Having the recovery info in server logs might help. > It's not an *exact* fit, since this is not about one "command" being > executed by a "backend", but it seems a near enough fit, and it would > unobtrusive enough that we don't need to worry about the progress-update > rate or have a GUC to determine how frequently to update (with a gauge > that's heavily workload-dependant and is likely to make little sense to > some users -- consider differently-sized WAL segments, for one thing). I think reporting a long-running file processing operation (removing or syncing) within postgres is a generic problem (for snapshot, mapping, temporary (pgsql_tmp), temp relation files, old WAL file processing, WAL file processing during recovery etc.) and needs to be solved in two ways: 1) logging progress into server logs (which helps for analysis and report when the server isn't available for connections, crash recovery), a generic GUC log_file_processing_traffic = {none, medium, high} might help here (also proposed in [1]) and 2) pg_stat_file_processing_progress (extending progress reporting pg_stat_get_progress_info to have few text columns for current file name and directory path). Thoughts? Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Alvaro Herrera
Date:
On 2022-May-05, Bharath Rupireddy wrote: > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > > view or something like that, using the backend_progress.c functionality? > > I don't see it mentioned in the thread. > > IMO, progress reporting works well on a running server and at the > moment. The WAL recovery/replay can happen even before the server > opens up for connections It's definitely true that you wouldn't be able to use it for when the server is not accepting connections. > and the progress report view can't be used > for later analysis like how much time the restoring WAL files from > archive location took This is true too -- progress report doesn't store historical data, only current status. > and also the WAL file names can't be reported in progress reporting > mechanism Also true. > (only integers columns, of course if required we can add text columns > to pg_stat_get_progress_info). Yeah, I don't think adding text columns is terribly easy, because the whole point of the progress reporting infrastructure is that it can be updated very cheaply as atomic operations, and if you want to transmit text columns, that's no longer possible. > Having the recovery info in server logs might help. I suppose it might. > I think reporting a long-running file processing operation (removing > or syncing) within postgres is a generic problem (for snapshot, > mapping, temporary (pgsql_tmp), temp relation files, old WAL file > processing, WAL file processing during recovery etc.) and needs to be > solved I agree up to here. > in two ways: 1) logging progress into server logs (which helps > for analysis and report when the server isn't available for > connections, crash recovery), a generic GUC > log_file_processing_traffic = {none, medium, high} might help here > (also proposed in [1]) and 2) pg_stat_file_processing_progress > (extending progress reporting pg_stat_get_progress_info to have few > text columns for current file name and directory path). I think using the server log to store telemetry data is not a great fit. It bloats the log files and can be so slow as to block other operations in the server. Server logs should normally be considered critical info that's not okay to lose; telemetry tends to be of secondary importance and in a pinch you can drop a few messages without hurting too much. We've done moderately okay so far with having some system views where some telemetry readings can be obtained, but there several drawbacks to that approach that we should at some point solve. My opinion on this is that we need to bite the bullet and develop separate infrastructure for reporting server metrics. That said, I'm not opposed to having a patch somewhat as posted. I just think that we should look into a new mechanism going forward. -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Thu, May 5, 2022 at 2:07 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2022-May-05, Bharath Rupireddy wrote: > > > On Fri, Apr 29, 2022 at 4:11 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > Did we ever consider the idea of using a new pg_stat_wal_activity_progress > > > view or something like that, using the backend_progress.c functionality? > > > I don't see it mentioned in the thread. > > > > IMO, progress reporting works well on a running server and at the > > moment. The WAL recovery/replay can happen even before the server > > opens up for connections > > It's definitely true that you wouldn't be able to use it for when the > server is not accepting connections. > > > and the progress report view can't be used > > for later analysis like how much time the restoring WAL files from > > archive location took > > This is true too -- progress report doesn't store historical data, only > current status. > > > and also the WAL file names can't be reported in progress reporting > > mechanism > > Also true. > > > (only integers columns, of course if required we can add text columns > > to pg_stat_get_progress_info). > > Yeah, I don't think adding text columns is terribly easy, because the > whole point of the progress reporting infrastructure is that it can be > updated very cheaply as atomic operations, and if you want to transmit > text columns, that's no longer possible. > > > Having the recovery info in server logs might help. > > I suppose it might. > > > I think reporting a long-running file processing operation (removing > > or syncing) within postgres is a generic problem (for snapshot, > > mapping, temporary (pgsql_tmp), temp relation files, old WAL file > > processing, WAL file processing during recovery etc.) and needs to be > > solved > > I agree up to here. > > > in two ways: 1) logging progress into server logs (which helps > > for analysis and report when the server isn't available for > > connections, crash recovery), a generic GUC > > log_file_processing_traffic = {none, medium, high} might help here > > (also proposed in [1]) and 2) pg_stat_file_processing_progress > > (extending progress reporting pg_stat_get_progress_info to have few > > text columns for current file name and directory path). > > I think using the server log to store telemetry data is not a great fit. > It bloats the log files and can be so slow as to block other operations > in the server. Server logs should normally be considered critical info > that's not okay to lose; telemetry tends to be of secondary importance > and in a pinch you can drop a few messages without hurting too much. > > We've done moderately okay so far with having some system views where > some telemetry readings can be obtained, but there several drawbacks to > that approach that we should at some point solve. My opinion on this is > that we need to bite the bullet and develop separate infrastructure for > reporting server metrics. > > I just > think that we should look into a new mechanism going forward. I completely agree that we must have new ways to report important server metrics in an easily consumable fashion - it could be something like server computing metrics (time taken, the file being processed etc.) important/time-taking operations such as execution of archive command, restore command, recovery, checkpoint, old WAL files removal/recycling, processing of various files - snapshot, mapping, pgsql_tmp files, temp relation files etc. and writing these metrics to a file or a stat table on the database itself (this table can have an automatic mechanism of clean up or limit on number of rows or size of the table so that the clients can read those metrics and use it in whichever way they want. To start with, a simple structure of the metrics can be {OPERATION text, START_TIME timestamp, END_TIME timestamp, few text, double and/or integer flexible columns that each operation can use to store metrics}. I can start a discussion in a separate thread to seek more thoughts on the server metrics part. > That said, I'm not opposed to having a patch somewhat as posted. Thanks. I'm thinking if we should have a generic GUC log_file_processing_traffic = {none, medium, high} that might help reporting other time taking file processing operations such as [1]. [1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Robert Haas
Date:
On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Here's the rebased v9 patch. This seems like it has enormous overlap with the existing functionality that we have from log_startup_progress_interval. I think that facility is also better-designed than this one. It prints out a message based on elapsed time, whereas this patch prints out a message based progress through the WAL. That means that if WAL replay isn't actually advancing for some reason, you just won't get any log messages and you don't know whether it's advancing slowly or not at all or the server is just hung. With that facility you can distinguish those cases. Also, if for some reason we do think that amount of WAL replayed is the right metric, rather than time, why would we only allow high=1 segment and low=128 segments, rather than say any number of MB or GB that the user would like to configure? I suggest that if log_startup_progress_interval doesn't meet your needs here, we should try to understand why not and maybe enhance it, instead of adding a separate facility. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Fri, May 13, 2022 at 6:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Here's the rebased v9 patch. > > This seems like it has enormous overlap with the existing > functionality that we have from log_startup_progress_interval. > > I think that facility is also better-designed than this one. It prints > out a message based on elapsed time, whereas this patch prints out a > message based progress through the WAL. That means that if WAL replay > isn't actually advancing for some reason, you just won't get any log > messages and you don't know whether it's advancing slowly or not at > all or the server is just hung. With that facility you can distinguish > those cases. > > Also, if for some reason we do think that amount of WAL replayed is > the right metric, rather than time, why would we only allow high=1 > segment and low=128 segments, rather than say any number of MB or GB > that the user would like to configure? > > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. Thanks Robert! In a production environment (of course with a better management of server logs) one can set log_wal_traffic to "high" and emit the required info to answer some of the customer questions like - "How far the server is in recovery? How much time recovery of each WAL file approximately took? How much time will it take to recover all the WAL files? What's the rate of recovery - time per WAL file? etc." Whereas ereport_startup_progress facility will help to emit log messages only if "some" operation takes longer than set log_startup_progress_interval time which may not serve the above purpose. Actually, IMO a generic GUC log_file_processing_traffic = {none, medium, high} to help server emit logs for all the critical file processing operations - WAL file recovery (as proposed here in this thread), temp file processing during server startup or restarts (log_startup_progress_interval can't be used here as postmaster doesn't register for any timeouts) [1], snapshot and mapping file processing during checkpoint, temp relation files, removing old WAL files and so on. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACW-ELOF5JT2zPavs95wbZ0BrLPrqvSZ7Ac%2BpjxCkmXtEQ%40mail.gmail.com Regards, Bharath Rupireddy.
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Robert Haas
Date:
On Tue, May 17, 2022 at 1:32 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > In a production environment (of course with a better management of > server logs) one can set log_wal_traffic to "high" and emit the > required info to answer some of the customer questions like - "How far > the server is in recovery? How much time recovery of each WAL file > approximately took? How much time will it take to recover all the WAL > files? What's the rate of recovery - time per WAL file? etc." > > Whereas ereport_startup_progress facility will help to emit log > messages only if "some" operation takes longer than set > log_startup_progress_interval time which may not serve the above > purpose. I think you are misunderstanding what an "operation" is here. If you take the time to test out that facility, I think you will find that it is quite capable of answering all of those questions. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Nathan Bossart
Date:
On Fri, May 13, 2022 at 09:10:52AM -0400, Robert Haas wrote: > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. +1. AFAICT it should be possible to make the log_startup_progress_interval machinery extensible enough to be reused for several other tasks that can take a while but have little visibility. Since this thread has been inactive for over 2 months, I'm marking the commitfest entry as Waiting on Author. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
Re: Add WAL recovery messages with log_wal_traffic GUC (was: add recovery, backup, archive, streaming etc. activity messages to server logs along with ps display)
From
Bharath Rupireddy
Date:
On Fri, May 13, 2022 at 6:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Apr 29, 2022 at 5:11 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Here's the rebased v9 patch. > > This seems like it has enormous overlap with the existing > functionality that we have from log_startup_progress_interval. > > I think that facility is also better-designed than this one. It prints > out a message based on elapsed time, whereas this patch prints out a > message based progress through the WAL. That means that if WAL replay > isn't actually advancing for some reason, you just won't get any log > messages and you don't know whether it's advancing slowly or not at > all or the server is just hung. With that facility you can distinguish > those cases. > > Also, if for some reason we do think that amount of WAL replayed is > the right metric, rather than time, why would we only allow high=1 > segment and low=128 segments, rather than say any number of MB or GB > that the user would like to configure? > > I suggest that if log_startup_progress_interval doesn't meet your > needs here, we should try to understand why not and maybe enhance it, > instead of adding a separate facility. After thinking for a while, I agree with Robert and others that we could leverage the existing log_startup_progress_interval mechanism for reporting which WAL file currently is being replayed. I added current TLI (helps to construct the WAL file name from current LSN) and current WAL file source (helps to know where the WAL files was fetched from) to the existing "redo in progress, elapsed time:..." message. This very well serves the purpose of identifying the issues such as the restore command taking a lot of time (> log_startup_progress_interval for instance), WAL replay rate on standby or primary for long recoveries and so on. However, ereport_startup_progress isn't enabled on standby to not let it bloat the server logs. I believe the "redo in progress, elapsed time:..." message can provide some important info/metric for standby too and there's no way for the users to enable it on standbys today. For instance, users can know how well the standby fares in replaying, they can figure this out, by looking at two or more such messages. If enabled, with default value of 10 sec for log_startup_progress_interval, the standby can emit 8640 messages per day which is too much. I'm not sure if we are okay to change the default value of log_startup_progress_interval to say 1min or 5min so that 1440 messages are emitted. In production environments, typically users may or may not be interested if recovery takes just 10sec, but they really are interested if it takes in the order of minutes. Basically, I would like to enable "redo in progress, elapsed time:..." message for standbys too. Thoughts? PSA v10 patch with enhanced "redo in progress, elapsed time:..." message. Note that it's not a final patch though. Regards, Bharath Rupireddy.