Thread: Generalize ereport_startup_progress infrastructure
Hi, ereport_startup_progress infrastructure added by commit 9ce346e [1] will be super-useful for reporting progress of any long-running server operations, not just the startup process operations. For instance, postmaster can use it for reporting progress of temp file and temp relation file removals [2], checkpointer can use it for reporting progress of snapshot or mapping file processing or even WAL file processing and so on. And I'm sure there can be many places in the code where we have while or for loops which can, at times, take a long time to finish and having a log message there would definitely help. Here's an attempt to generalize the ereport_startup_progress infrastructure. The attached v1 patch places the code in elog.c/.h, renames associated functions and variables, something like ereport_startup_progress to ereport_progress, log_startup_progress_interval to log_progress_report_interval and so on. Thoughts? Thanks Robert for an offlist chat. [1] commit 9ce346eabf350a130bba46be3f8c50ba28506969 Author: Robert Haas <rhaas@postgresql.org> Date: Mon Oct 25 11:51:57 2021 -0400 Report progress of startup operations that take a long time. [2] https://www.postgresql.org/message-id/CALj2ACWeUFhhnDJKm6R5YxCsF4K7aB2pmRMvqP0BVTxdyce3EA%40mail.gmail.com -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/
Attachment
On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > ereport_startup_progress infrastructure added by commit 9ce346e [1] > will be super-useful for reporting progress of any long-running server > operations, not just the startup process operations. For instance, > postmaster can use it for reporting progress of temp file and temp > relation file removals [2], checkpointer can use it for reporting > progress of snapshot or mapping file processing or even WAL file > processing and so on. And I'm sure there can be many places in the > code where we have while or for loops which can, at times, take a long > time to finish and having a log message there would definitely help. > > Here's an attempt to generalize the ereport_startup_progress > infrastructure. The attached v1 patch places the code in elog.c/.h, > renames associated functions and variables, something like > ereport_startup_progress to ereport_progress, > log_startup_progress_interval to log_progress_report_interval and so > on. I'm not averse to reusing this infrastructure in other places, but I doubt we'd want all of those places to be controlled by a single GUC, especially because that GUC is also the on/off switch for the feature. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 8/2/22 8:40 PM, Robert Haas wrote: > CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you canconfirm the sender and know the content is safe. > > > > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: >> ereport_startup_progress infrastructure added by commit 9ce346e [1] >> will be super-useful for reporting progress of any long-running server >> operations, not just the startup process operations. For instance, >> postmaster can use it for reporting progress of temp file and temp >> relation file removals [2], checkpointer can use it for reporting >> progress of snapshot or mapping file processing or even WAL file >> processing and so on. And I'm sure there can be many places in the >> code where we have while or for loops which can, at times, take a long >> time to finish and having a log message there would definitely help. >> >> Here's an attempt to generalize the ereport_startup_progress >> infrastructure. The attached v1 patch places the code in elog.c/.h, >> renames associated functions and variables, something like >> ereport_startup_progress to ereport_progress, >> log_startup_progress_interval to log_progress_report_interval and so >> on. > I'm not averse to reusing this infrastructure in other places, but I > doubt we'd want all of those places to be controlled by a single GUC, > especially because that GUC is also the on/off switch for the feature. +1 on the idea to generalize this infrastructure in other places. I also doubt about having one single GUC to control all the places: What about adding in the patch the calls to the new API where you think it could be useful too? (and in the same time make use of dedicated GUC(s) where it makes sense?) Regards, -- Bertrand Drouvot Amazon Web Services: https://aws.amazon.com
On Wed, Aug 3, 2022 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > ereport_startup_progress infrastructure added by commit 9ce346e [1] > > will be super-useful for reporting progress of any long-running server > > operations, not just the startup process operations. For instance, > > postmaster can use it for reporting progress of temp file and temp > > relation file removals [2], checkpointer can use it for reporting > > progress of snapshot or mapping file processing or even WAL file > > processing and so on. And I'm sure there can be many places in the > > code where we have while or for loops which can, at times, take a long > > time to finish and having a log message there would definitely help. > > > > Here's an attempt to generalize the ereport_startup_progress > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > renames associated functions and variables, something like > > ereport_startup_progress to ereport_progress, > > log_startup_progress_interval to log_progress_report_interval and so > > on. > > I'm not averse to reusing this infrastructure in other places, but I > doubt we'd want all of those places to be controlled by a single GUC, > especially because that GUC is also the on/off switch for the feature. Thanks Robert! How about we tweak the function a bit - begin_progress_report_phase(int timeout), so that each process can use their own timeout interval? In this case, do we want to retain log_startup_progress_interval as-is specific to the startup process? If yes, other processes might come up with their own GUCs (if they don't want to use hard-coded timeouts) similar to log_startup_progress_interval, which isn't the right way IMO. I think the notion of ereport_progress feature being disabled when the timeout is 0, makes sense to me at least. On the flip side, what if we just have a single GUC log_progress_report_interval (as proposed in the v1 patch)? Do we ever want different processes to emit progress report messages at different frequencies? Well, I can think of the startup process during standby recovery needing to emit recovery progress report messages at a much lower frequency than the startup process during the crash recovery. Again, controlling the frequencies with different GUCs isn't the way forward. But we can do something like: process 1 emits messages with a frequency of 2*log_progress_report_interval, process 2 with a frequency 4*log_progress_report_interval and so on without needing additional GUCs. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Thu, Aug 4, 2022 at 9:57 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Aug 3, 2022 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > ereport_startup_progress infrastructure added by commit 9ce346e [1] > > > will be super-useful for reporting progress of any long-running server > > > operations, not just the startup process operations. For instance, > > > postmaster can use it for reporting progress of temp file and temp > > > relation file removals [2], checkpointer can use it for reporting > > > progress of snapshot or mapping file processing or even WAL file > > > processing and so on. And I'm sure there can be many places in the > > > code where we have while or for loops which can, at times, take a long > > > time to finish and having a log message there would definitely help. > > > > > > Here's an attempt to generalize the ereport_startup_progress > > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > > renames associated functions and variables, something like > > > ereport_startup_progress to ereport_progress, > > > log_startup_progress_interval to log_progress_report_interval and so > > > on. > > > > I'm not averse to reusing this infrastructure in other places, but I > > doubt we'd want all of those places to be controlled by a single GUC, > > especially because that GUC is also the on/off switch for the feature. > > Thanks Robert! How about we tweak the function a bit - > begin_progress_report_phase(int timeout), so that each process can use > their own timeout interval? In this case, do we want to retain > log_startup_progress_interval as-is specific to the startup process? > If yes, other processes might come up with their own GUCs (if they > don't want to use hard-coded timeouts) similar to > log_startup_progress_interval, which isn't the right way IMO. > > I think the notion of ereport_progress feature being disabled when the > timeout is 0, makes sense to me at least. > > On the flip side, what if we just have a single GUC > log_progress_report_interval (as proposed in the v1 patch)? Do we ever > want different processes to emit progress report messages at different > frequencies? Well, I can think of the startup process during standby > recovery needing to emit recovery progress report messages at a much > lower frequency than the startup process during the crash recovery. > Again, controlling the frequencies with different GUCs isn't the way > forward. But we can do something like: process 1 emits messages with a > frequency of 2*log_progress_report_interval, process 2 with a > frequency 4*log_progress_report_interval and so on without needing > additional GUCs. > > Thoughts? Here's v2 patch, passing progress report interval as an input to begin_progress_report_phase() so that the processes can use their own intervals(hard-coded or GUC) if they wish to not use the generic GUC log_progress_report_interval. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Attachment
On Mon, Aug 8, 2022 at 12:29 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Here's v2 patch, passing progress report interval as an input to > begin_progress_report_phase() so that the processes can use their own > intervals(hard-coded or GUC) if they wish to not use the generic GUC > log_progress_report_interval. I don't think we should rename the GUC to be more generic. I like it the way that it is. I also think you should extend this patch series with 1 or 2 additional patches showing where else you think we should be using this infrastructure. If no such places exist, this is pointless. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, Aug 9, 2022 at 6:05 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Aug 8, 2022 at 12:29 AM Bharath Rupireddy > <bharath.rupireddyforpostgres@gmail.com> wrote: > > Here's v2 patch, passing progress report interval as an input to > > begin_progress_report_phase() so that the processes can use their own > > intervals(hard-coded or GUC) if they wish to not use the generic GUC > > log_progress_report_interval. > > I don't think we should rename the GUC to be more generic. I like it > the way that it is. Done. > I also think you should extend this patch series with 1 or 2 > additional patches showing where else you think we should be using > this infrastructure. > > If no such places exist, this is pointless. I'm attaching 0002 for reporting removal of temp files and temp relation files by postmaster. If this looks okay, I can code 0003 for reporting processing of snapshot, mapping and old WAL files by checkpointer. Thoughts? -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
Attachment
> > > Here's an attempt to generalize the ereport_startup_progress > > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > > renames associated functions and variables, something like > > > ereport_startup_progress to ereport_progress, > > > log_startup_progress_interval to log_progress_report_interval and so > > > on. > > > > I'm not averse to reusing this infrastructure in other places, but I > > doubt we'd want all of those places to be controlled by a single GUC, > > especially because that GUC is also the on/off switch for the feature. > > Thanks Robert! How about we tweak the function a bit - > begin_progress_report_phase(int timeout), so that each process can use > their own timeout interval? In this case, do we want to retain > log_startup_progress_interval as-is specific to the startup process? > If yes, other processes might come up with their own GUCs (if they > don't want to use hard-coded timeouts) similar to > log_startup_progress_interval, which isn't the right way IMO. > > I think the notion of ereport_progress feature being disabled when the > timeout is 0, makes sense to me at least. > > On the flip side, what if we just have a single GUC > log_progress_report_interval (as proposed in the v1 patch)? Do we ever > want different processes to emit progress report messages at different > frequencies? Well, I can think of the startup process during standby > recovery needing to emit recovery progress report messages at a much > lower frequency than the startup process during the crash recovery. > Again, controlling the frequencies with different GUCs isn't the way > forward. But we can do something like: process 1 emits messages with a > frequency of 2*log_progress_report_interval, process 2 with a > frequency 4*log_progress_report_interval and so on without needing > additional GUCs. > > Thoughts? +1 for the idea to generalize the infrastructure. Given two options, option-1 is to use a single GUC across all kind of log running operations and option-2 is to use multiple GUCs (one for each kind of long running operations), I go with option-1 because if a user is interested to see a log message after every 10s for startup operations (or any other long running operations) then it is likely that he is interested to see other long running operations after every 10s only. It does not make sense to use different intervals for each kind of long running operation here. It also increases the number of GUCs which makes things complex. So it is a good idea to use a single GUC here. But I am worried about the on/off switch as Robert mentioned. How about using a new GUC to indicate features on/off. Say "log_long_running_operations" which contains a comma separated string which indicates the features to be enabled. For example, "log_long_running_operations = startup, postmaster" will enable logging for startup and postmaster operations and disables logging of other long running operations. With this the number of GUCs will be limited to 2 and it is simple and easy for the user. Thanks & Regards, Nitin Jadhav On Thu, Aug 4, 2022 at 9:58 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > > On Wed, Aug 3, 2022 at 12:11 AM Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Tue, Aug 2, 2022 at 3:25 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > ereport_startup_progress infrastructure added by commit 9ce346e [1] > > > will be super-useful for reporting progress of any long-running server > > > operations, not just the startup process operations. For instance, > > > postmaster can use it for reporting progress of temp file and temp > > > relation file removals [2], checkpointer can use it for reporting > > > progress of snapshot or mapping file processing or even WAL file > > > processing and so on. And I'm sure there can be many places in the > > > code where we have while or for loops which can, at times, take a long > > > time to finish and having a log message there would definitely help. > > > > > > Here's an attempt to generalize the ereport_startup_progress > > > infrastructure. The attached v1 patch places the code in elog.c/.h, > > > renames associated functions and variables, something like > > > ereport_startup_progress to ereport_progress, > > > log_startup_progress_interval to log_progress_report_interval and so > > > on. > > > > I'm not averse to reusing this infrastructure in other places, but I > > doubt we'd want all of those places to be controlled by a single GUC, > > especially because that GUC is also the on/off switch for the feature. > > Thanks Robert! How about we tweak the function a bit - > begin_progress_report_phase(int timeout), so that each process can use > their own timeout interval? In this case, do we want to retain > log_startup_progress_interval as-is specific to the startup process? > If yes, other processes might come up with their own GUCs (if they > don't want to use hard-coded timeouts) similar to > log_startup_progress_interval, which isn't the right way IMO. > > I think the notion of ereport_progress feature being disabled when the > timeout is 0, makes sense to me at least. > > On the flip side, what if we just have a single GUC > log_progress_report_interval (as proposed in the v1 patch)? Do we ever > want different processes to emit progress report messages at different > frequencies? Well, I can think of the startup process during standby > recovery needing to emit recovery progress report messages at a much > lower frequency than the startup process during the crash recovery. > Again, controlling the frequencies with different GUCs isn't the way > forward. But we can do something like: process 1 emits messages with a > frequency of 2*log_progress_report_interval, process 2 with a > frequency 4*log_progress_report_interval and so on without needing > additional GUCs. > > Thoughts? > > -- > Bharath Rupireddy > RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/ > >
On Tue, Aug 9, 2022 at 11:54 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > I'm attaching 0002 for reporting removal of temp files and temp > relation files by postmaster. > > If this looks okay, I can code 0003 for reporting processing of > snapshot, mapping and old WAL files by checkpointer. I think that someone is going to complain about the changes to timeout.c. Some trouble has been taken to allow things like SetLatch(MyLatch) to be unconditional. Aside from that, I am unsure how generally safe it is to use the timeout infrastructure in the postmaster. From a user-interface point of view, log_postmaster_progress_interval seems a bit awkward. It's really quite narrow, basically just checking for one thing. I'm not sure I like adding a GUC for something that specific, although I also don't have another idea at the moment either. Hmm. Maybe the checkpointer is a better candidate, but somehow I feel that we can't consider this sort of thing separate from the existing progress reporting that checkpointer already does. Perhaps we need to think of changing or improving that in some way rather than adding something wholly new alongside the existing system. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 10, 2022 at 11:00:20AM -0400, Robert Haas wrote: > Maybe the checkpointer is a better candidate, but somehow I feel that > we can't consider this sort of thing separate from the existing > progress reporting that checkpointer already does. Perhaps we need to > think of changing or improving that in some way rather than adding > something wholly new alongside the existing system. I agree that the checkpointer has a good chance of being a better candidate. Are you thinking of integrating this into log_checkpoints somehow? Perhaps this parameter could optionally accept an interval for logging the progress of ongoing checkpoints. -- Nathan Bossart Amazon Web Services: https://aws.amazon.com
On Wed, Aug 10, 2022 at 6:21 PM Nitin Jadhav <nitinjadhavpostgres@gmail.com> wrote: > > Given two options, option-1 is to use a single GUC across all kind of > log running operations and option-2 is to use multiple GUCs (one for > each kind of long running operations), I go with option-1 because if a > user is interested to see a log message after every 10s for startup > operations (or any other long running operations) then it is likely > that he is interested to see other long running operations after every > 10s only. It does not make sense to use different intervals for each > kind of long running operation here. It also increases the number of > GUCs which makes things complex. So it is a good idea to use a single > GUC here. +1. > But I am worried about the on/off switch as Robert > mentioned. Are you worried that users might want to switch off the progress report messages at process level, for instance, they want to log the startup process' long running operations progress but not, say, checkpointer or postmaster? IMO, a long running operation, if it is happening in any of the processes, is a concern for the users and having progress report log messages for them would help users debug any issues or improve observability of the server as a whole. With single GUC, the server log might contain progress reports of all the long running (wherever we use this ereport_progress()) operations in the entire server's lifecycle, which isn't bad IMO. I'd still vote for a single GUC log_progress_report_interval without worrying much about process-level enable/disable capability. However, let's hear what other hackers think about this. -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Wed, Aug 17, 2022 at 2:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote: > > On Wed, Aug 10, 2022 at 11:00:20AM -0400, Robert Haas wrote: > > Maybe the checkpointer is a better candidate, but somehow I feel that > > we can't consider this sort of thing separate from the existing > > progress reporting that checkpointer already does. Perhaps we need to > > think of changing or improving that in some way rather than adding > > something wholly new alongside the existing system. > > I agree that the checkpointer has a good chance of being a better > candidate. Are you thinking of integrating this into log_checkpoints > somehow? Perhaps this parameter could optionally accept an interval for > logging the progress of ongoing checkpoints. Certainly the checkpointer is an immediate candidate. For instance, I can think of using ereport_progress() in CheckPointSnapBuild() for snapshot files processing, CheckPointLogicalRewriteHeap() for mapping files processing, BufferSync() for checkpointing dirty buffers (?), ProcessSyncRequests() for processing fsync() requests, RemoveOldXlogFiles(), RemoveNonParentXlogFiles()(?). I personally have seen cases where some of these checkpoint operations take a lot of time in production environments and a better observability would help a lot. However, I'm not sure if turning log_checkpoints to an integer type to use for checkpoint progress reporting is a good idea here. As I explained upthread [1], I'd vote for a single GUC at the entire server level. If the users/customers request per-process or per-operation progress report GUCs, we can then consider it. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACUJA73nCK_Li7v4_OOkRqwQBX14Fx2ALb7GDRwUTNGK-Q%40mail.gmail.com -- Bharath Rupireddy RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/
On Wed, Aug 17, 2022 at 4:30 AM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > As I explained upthread [1], I'd vote for a single GUC at the entire > server level. If the users/customers request per-process or > per-operation progress report GUCs, we can then consider it. Well, I don't agree that either of the proposed new uses of this infrastructure are the right way to solve the problems in question, so worrying about how to name the GUCs when we have a bunch of uses of this infrastructure seems to me to be premature. The proposed use in the postmaster doesn't look very safe, so you either need to give up on that or figure out a way to make it safe. The proposed use in the checkpointer looks like it needs more design work, because it's not clear whether or how it should interact with log_checkpoints. While I agree that changing log_checkpoints into an integer value doesn't necessarily make sense, having some kind of new checkpoint logging that is completely unrelated to existing checkpoint logging doesn't necessarily make sense to me either. I do have some sympathy with the idea that if people care about operations that unexpectedly run for a long time, they probably care about all of them, and probably don't care about changing the timeout or even the enable switch for each one individually. Honestly, it's not very clear to me who would want to ever turn off the startup progress stuff, or why they'd want to change the interval. I added a GUC for it out of an abundance of caution, but I don't know why you'd really want a different setting. Maybe there's some reason, but it's not clear to me. At the same time, I don't think the overall picture here is too clear yet. I'm reluctant to commit to a specific UI for a feature whose scope we don't seem to know. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Aug 17, 2022 at 8:44 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Well, I don't agree that either of the proposed new uses of this > infrastructure are the right way to solve the problems in question, so > worrying about how to name the GUCs when we have a bunch of uses of > this infrastructure seems to me to be premature. Agreed. > The proposed use in > the postmaster doesn't look very safe, so you either need to give up > on that or figure out a way to make it safe. Is registering a SIGALRM handler in postmaster not a good idea? Is setting the MyLatch conditionally [1] a concern? I agree that the handle_sig_alarm() code for postmaster may not look good as it holds interrupts and does a bunch of other things. But is it a bigger issue? > The proposed use in the > checkpointer looks like it needs more design work, because it's not > clear whether or how it should interact with log_checkpoints. While I > agree that changing log_checkpoints into an integer value doesn't > necessarily make sense, having some kind of new checkpoint logging > that is completely unrelated to existing checkpoint logging doesn't > necessarily make sense to me either. Hm. Yes, we cannot forget about log_checkpoints while considering adding more logs and controls with other GUCs. We could say that one needs to enable both log_checkpoints and the progress report GUC, but that's not great from usability perspective. > I do have some sympathy with the idea that if people care about > operations that unexpectedly run for a long time, they probably care > about all of them, and probably don't care about changing the timeout > or even the enable switch for each one individually. I've seen the cases myself and asked by many about the server being unresponsive in the cases where it processes files, for instance, temp files in postmaster after a restart or snapshot or mapping or BufferSync() during checkpoint where this sort of progress reporting would've helped. Thinking of another approach for reporting file processing alone - a GUC log_file_processing_traffic = {none, medium, high} or {0, 1, 2, ..... limit} that users can set to emit a file processing log after a certain number of files. It doesn't require a timeout mechanism, so it can be used by any process. But, it is specific to just files. Similar to above but a bit generic, not specific to just file processing, a GUC log_processing_traffic = {none, medium, high} or {0, 1, 2, ..... limit}. Thoughts? [1] /* * SIGALRM is always cause for waking anything waiting on the process * latch. + * + * Postmaster has no latch associated with it. */ - SetLatch(MyLatch); + if (MyLatch) + SetLatch(MyLatch); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com