Thread: Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
On Mon, 30 Dec 2024 at 09:34, Shubham Khanna <khannashubham1197@gmail.com> wrote: > > Hi, > > Currently, there is a risk that pg_createsubscriber may fail to > complete successfully when the max_slot_wal_keep_size value is set too > low. This can occur if the WAL is removed before the standby using the > replication slot is able to complete replication, as the required WAL > files are no longer available. > > I was able to reproduce this issue using the following steps: > Set up a streaming replication environment. > Run pg_createsubscriber in a debugger. > Pause pg_createsubscriber at the setup_recovery stage. > Perform several operations on the primary node to generate a large > volume of WAL, causing older WAL segments to be removed due to the low > max_slot_wal_keep_size setting. > Once the necessary WAL segments are deleted, continue the execution of > pg_createsubscriber. > At this point, pg_createsubscriber fails with the following error: > 2024-12-29 01:21:37.590 IST [427353] FATAL: could not receive data > from WAL stream: ERROR: requested WAL segment > 000000010000000000000003 has already been removed > 2024-12-29 01:21:37.592 IST [427345] LOG: waiting for WAL to become > available at 0/3000110 > 2024-12-29 01:21:42.593 IST [427358] LOG: started streaming WAL from > primary at 0/3000000 on timeline 1 > 2024-12-29 01:21:42.593 IST [427358] FATAL: could not receive data > from WAL stream: ERROR: requested WAL segment > 000000010000000000000003 has already been removed > > This issue was previously reported in [1], with a suggestion to raise > a warning in [2]. I’ve implemented a patch that logs a warning in > dry-run mode. This will give users the opportunity to adjust the > max_slot_wal_keep_size value before running the command. > > Thoughts? +1 for throwing a warning in dry-run mode Few comments: 1) We can have this check only in dry-run mode, it is not required in non dry-run mode as there is nothing much user can do once the tool is running, we can change this: + if (max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } to: + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } 2) This error message is not quite right, can we change it to "publisher requires max_slot_wal_keep_size to be -1, but is set to %d" + if (max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires 'max_slot_wal_keep_size = -1', but only %d remain", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the 'max_slot_wal_keep_size' configuration on the publisher to -1."); + } 3) Also the configuration could be specified in format specifier like it is specified in the earlier case Regards, Vignesh
On Mon, 30 Dec 2024 at 12:04, Shubham Khanna <khannashubham1197@gmail.com> wrote: > > On Mon, Dec 30, 2024 at 10:10 AM vignesh C <vignesh21@gmail.com> wrote: > > > > On Mon, 30 Dec 2024 at 09:34, Shubham Khanna > > <khannashubham1197@gmail.com> wrote: > > > > > > Hi, > > > > > > Currently, there is a risk that pg_createsubscriber may fail to > > > complete successfully when the max_slot_wal_keep_size value is set too > > > low. This can occur if the WAL is removed before the standby using the > > > replication slot is able to complete replication, as the required WAL > > > files are no longer available. > > > > > > I was able to reproduce this issue using the following steps: > > > Set up a streaming replication environment. > > > Run pg_createsubscriber in a debugger. > > > Pause pg_createsubscriber at the setup_recovery stage. > > > Perform several operations on the primary node to generate a large > > > volume of WAL, causing older WAL segments to be removed due to the low > > > max_slot_wal_keep_size setting. > > > Once the necessary WAL segments are deleted, continue the execution of > > > pg_createsubscriber. > > > At this point, pg_createsubscriber fails with the following error: > > > 2024-12-29 01:21:37.590 IST [427353] FATAL: could not receive data > > > from WAL stream: ERROR: requested WAL segment > > > 000000010000000000000003 has already been removed > > > 2024-12-29 01:21:37.592 IST [427345] LOG: waiting for WAL to become > > > available at 0/3000110 > > > 2024-12-29 01:21:42.593 IST [427358] LOG: started streaming WAL from > > > primary at 0/3000000 on timeline 1 > > > 2024-12-29 01:21:42.593 IST [427358] FATAL: could not receive data > > > from WAL stream: ERROR: requested WAL segment > > > 000000010000000000000003 has already been removed > > > > > > This issue was previously reported in [1], with a suggestion to raise > > > a warning in [2]. I’ve implemented a patch that logs a warning in > > > dry-run mode. This will give users the opportunity to adjust the > > > max_slot_wal_keep_size value before running the command. > > > > > > Thoughts? > > > > +1 for throwing a warning in dry-run mode > > > > Few comments: > > 1) We can have this check only in dry-run mode, it is not required in > > non dry-run mode as there is nothing much user can do once the tool is > > running, we can change this: > > + if (max_slot_wal_keep_size != -1) > > + { > > + pg_log_warning("publisher requires > > 'max_slot_wal_keep_size = -1', but only %d remain", > > + max_slot_wal_keep_size); > > + pg_log_warning_detail("Change the > > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > > + } > > > > to: > > + if (dry_run && max_slot_wal_keep_size != -1) > > + { > > + pg_log_warning("publisher requires > > 'max_slot_wal_keep_size = -1', but only %d remain", > > + max_slot_wal_keep_size); > > + pg_log_warning_detail("Change the > > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > > + } > > > > 2) This error message is not quite right, can we change it to > > "publisher requires max_slot_wal_keep_size to be -1, but is set to %d" > > + if (max_slot_wal_keep_size != -1) > > + { > > + pg_log_warning("publisher requires > > 'max_slot_wal_keep_size = -1', but only %d remain", > > + max_slot_wal_keep_size); > > + pg_log_warning_detail("Change the > > 'max_slot_wal_keep_size' configuration on the publisher to -1."); > > + } > > > > 3) Also the configuration could be specified in format specifier like > > it is specified in the earlier case > > > > I have fixed the given comments. The attached patch contains the > suggested changes. Few comments: 1) Since this configuration will be updated after reload, you can reload instead of restarting: +# Configure 'max_slot_wal_keep_size = 1' on the publisher and verify the +# warning message +$node_p->append_conf('postgresql.conf', 'max_slot_wal_keep_size = 1'); +$node_p->restart; 2) You can reset max_slot_wal_keep_size configuration after this test: + 0, + [qr/./], + [ + qr/pg_createsubscriber: warning: publisher requires max_slot_wal_keep_size to be -1/, + qr/Change the configuration parameter "max_slot_wal_keep_size" on the publisher to -1./, + ], + 'Validate warning for misconfigured max_slot_wal_keep_size on the publisher' +); 3) We could update the comment to also mention the possibility of required wal files being deleted with non-default max_slot_wal_keep_size. + /* Validate max_slot_wal_keep_size */ + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires max_slot_wal_keep_size to be -1, but is set to %d", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the configuration parameter \"%s\" on the publisher to %d.", + "max_slot_wal_keep_size", -1); + } 4) You can update the commit message saying "the possibility of required wal files being deleted with non-default max_slot_wal_keep_size" and also mention that this warning will be raised in dry_run mode. Regards, Vignesh
Hi Shubham. Here are some review comments for the patch v3-0001. ====== Commit message. 1. I can't pinpoint anything specifically wrong about this commit message, but it seems to be repeating the same information over and over. ====== Missing pg_createsubscriber documentation? 2. I thought any prerequisite for 'max_slot_wal_keep_size' should be documented here [1] along with the other setting requirements. ====== src/bin/pg_basebackup/pg_createsubscriber.c 3. - " pg_catalog.current_setting('max_prepared_transactions')"); + " pg_catalog.current_setting('max_prepared_transactions')," + " pg_catalog.current_setting('max_slot_wal_keep_size')"); The code comment above this SQL looks like it should also mention the 'max_slot_wal_keep_size' value requirement. ~~~ 4. +/* + * Validate max_slot_wal_keep_size + * Logical replication requires max_slot_wal_keep_size to be set to -1 on the + * publisher to prevent the deletion of WAL files that are still needed by + * replication slots. If this parameter is set to a non-default value, it may + * cause replication failures due to required WAL files being + * prematurely removed. + */ + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires max_slot_wal_keep_size to be -1, but is set to %d", + max_slot_wal_keep_size); + pg_log_warning_detail("Change the configuration parameter \"%s\" on the publisher to %d.", + "max_slot_wal_keep_size", -1); + } + 4a. The code comment is not indented correctly. ~ 4b. It seems strange that the 'max_slot_wal_keep_size' GUC name is not substituted in the pg_log_warning, but it is in the pg_log_warning_detail. Furthermore, GUC names appearing in messages should always be quoted. ~ 4c. Was there some reason to make this only a warning, instead of an error? The risk "may cause replication failures" seems quite serious, so in that case it might be a poor user experience if we are effectively saying: "Too bad, it's all broken now; we warned you (only if you tried a dry run), but you did not listen". In other words, why not make this an error condition up-front to entirely remove any chance of this failure? ====== [1] https://www.postgresql.org/docs/17/app-pgcreatesubscriber.html Kind Regards, Peter Smith. Fujitsu Australia
Hi Shubham. Here are some review comments for the patch v4-0001. ====== Commit message. 1. The 'pg_createsubscriber' utility is updated to fetch and validate the 'max_slot_wal_keep_size' setting from the publisher. A warning is raised during the '--dry-run' mode if the configuration is set to a non-default value. ~ This says a "warning is raised", but patch v4 changed the warning to an error, so this description is incompatible with the code. ====== doc/src/sgml/ref/pg_createsubscriber.sgml 2. + <para> + The 'max_slot_wal_keep_size' must be set to -1 to prevent the automatic + removal of WAL files needed by replication slots. Setting this parameter to + a specific size may lead to replication failures if required WAL files are + prematurely deleted. + </para> The 'max_slot_wal_keep_size' should not be single-quoted like that. It should use the same markup as the other nearby GUC names and also have a link like those others do. ====== src/bin/pg_basebackup/pg_createsubscriber.c 3. +/* + * Validate max_slot_wal_keep_size + * Logical replication requires max_slot_wal_keep_size to be set to -1 on the + * publisher to prevent the deletion of WAL files that are still needed by + * replication slots. If this parameter is set to a non-default value, it may + * cause replication failures due to required WAL files being prematurely + * removed. + */ 3a. Not fixed? For patch v3 I had already posted [1-#4a] that this entire comment is wrongly indented. ~ 3b. That first sentence looks like it is missing a period and a following blank line. OTOH, maybe the first sentence is not even necessary. ~~~ 4. + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_error("publisher requires max_slot_wal_keep_size to be -1, but only %d remain", + max_slot_wal_keep_size); + pg_log_error_hint("Change the configuration parameter \"%s\" on the publisher to %d.", + "max_slot_wal_keep_size", -1); + failed = true; + } + 4a. Not fixed? For patch v3 I had already posted [1-#4b] that it seemed strange you did not use the \"%s\" substitution for the GUC name of the first message (unlike the 2nd message). I also think it is strange that the 1st message uses a hardwired -1, but the 2nd message uses a parameter for the -1. ~~ 4b. I don't think "but only %d remain" is the appropriate wording for this GUC. It looks like a cut/paste mistake from a previous message. Anyway, maybe this message should be something quite different so it is not just duplicating the same information as the error_hint. e.g. see below for one idea. This could also resolve the previous review comment. SUGGESTION "publisher requires WAL size must not be restricted" ~ 4c. I saw that this only emits the message in dry-mode, which is apparently based on the suggestion from Vignesh [2-#1]. Meanwhile, all the comments/docs say "it may cause replication failures", so I felt it might be better to always stop everything up-front if there is a wrong configuration, instead of waiting for potential failures to happen -- that's why I had suggested using error instead of a warning, but maybe I am in the minority. IIUC there are two ways to address this configuration problem: A. Give warnings, but only in dry-run mode. If the warnings are ignored (or if the dry-run was not even done), there may be replication failures later, but we HOPE it will not happen. B. Give errors in all modes. The early config error prevents any chance of later replication errors. So, I think the implementation needs to choose either A or B. Currently, it seems a mixture. ====== [1] my v3 review - https://www.postgresql.org/message-id/CAHut%2BPsgEphCa-P-3Q7cORA%3D1TRv15UUsaxN9ZkX56M1-J_QRg%40mail.gmail.com [2] https://www.postgresql.org/message-id/CALDaNm09cRzke52UN5zx33PT390whU92oXY4gfOSZEo17CLPjw%40mail.gmail.com Kind Regards, Peter Smith. Fujitsu Australia
Hi Shubham, Here are some review comments for patch v5-0001. ====== doc/src/sgml/ref/pg_createsubscriber.sgml 1. + <para> + The source server must have <xref linkend="guc-max-slot-wal-keep-size"/> to + be set to <literal>-1</literal> to prevent the automatic removal of WAL + replication slots. Setting this parameter to files needed by a specific size + may lead to replication failures if required WAL files are + prematurely deleted. + </para> IIUC, this problem is all about the removal of WAL *files*, not "WAL replication slots". Also, the "Setting this parameter to files needed by a specific size" part sounds strange. I think this can be simplified anyhow like below. SUGGESTION: Replication failures can occur if required WAL files are prematurely deleted. To prevent this, the source server must set <xref linkend="guc-max-slot-wal-keep-size"/> to <literal>-1</literal>, ensuring WAL files are not automatically removed. ====== src/bin/pg_basebackup/pg_createsubscriber.c check_publisher: 2. + if (dry_run && max_slot_wal_keep_size != -1) + { + pg_log_warning("publisher requires WAL size must not be restricted"); + pg_log_warning_detail("The max_slot_wal_keep_size parameter is currently set to a non-default value, which may lead to replication failures. " + "This parameter must be set to -1 to ensure that required WAL files are not prematurely removed."); + } 2a. Whenever you mention a GUC name in a PostgreSQL message then (by convention) it must be double-quoted. ~ 2b. The detailed message seems verbose and repetitive to me. ~ 2c. The other nearby messages use the terminology "configuration parameter", so this should be the same. ~ 2d. The other nearby messages use \"%s\" substitution for the GUC name, so this should be the same. ~ 2e. IMO advising the user to change a configuration parameter should be done using the pg_log_warning_hint function (e.g. _hint, not _details). ~~ Result: CURRENT (pg_log_warning_details) The max_slot_wal_keep_size parameter is currently set to a non-default value, which may lead to replication failures. This parameter must be set to -1 to ensure that required WAL files are not prematurely removed. SUGGESTION (pg_log_warning_hint) Set the configuration parameter \"%s\" to -1 to ensure that required WAL files are not prematurely removed. ====== Kind Regards, Peter Smith. Fujitsu Australia
RE: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
From
"Hayato Kuroda (Fujitsu)"
Date:
Dear Shubham, Thanks for creating a patch. I have one comment about it. check_publisher() assumed that the SQL function `pg_catalog.current_setting('max_slot_wal_keep_size')` will return the numeric, but it just return the text representation. I.e., if the parameter is set to 10MB, the function returns like below: ``` postgres=# SELECT * FROM pg_catalog.current_setting('max_slot_wal_keep_size'); current_setting ----------------- 10MB (1 row) ``` Your patch can work well because atoi() ignores the latter part of the string, e.g., "10MB" is converted to "10", but this is not clean. I suggest either of 1) accepting the value as the string, or 2) using an SQL function pg_size_bytes() to get max_slot_wal_keep_size. Best regards, Hayato Kuroda FUJITSU LIMITED