Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size - Mailing list pgsql-hackers

From vignesh C
Subject Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size
Date
Msg-id CALDaNm1Hfp9R9Ej4tRY7dwiCZL=wN2nEOErco7VxMcq9BUQswg@mail.gmail.com
Whole thread Raw
In response to Re: Log a warning in pg_createsubscriber for max_slot_wal_keep_size  (vignesh C <vignesh21@gmail.com>)
List pgsql-hackers
On Mon, 30 Dec 2024 at 12:04, Shubham Khanna
<khannashubham1197@gmail.com> wrote:
>
> On Mon, Dec 30, 2024 at 10:10 AM vignesh C <vignesh21@gmail.com> wrote:
> >
> > On Mon, 30 Dec 2024 at 09:34, Shubham Khanna
> > <khannashubham1197@gmail.com> wrote:
> > >
> > > Hi,
> > >
> > > Currently, there is a risk that pg_createsubscriber may fail to
> > > complete successfully when the max_slot_wal_keep_size value is set too
> > > low. This can occur if the WAL is removed before the standby using the
> > > replication slot is able to complete replication, as the required WAL
> > > files are no longer available.
> > >
> > > I was able to reproduce this issue using the following steps:
> > > Set up a streaming replication environment.
> > > Run pg_createsubscriber in a debugger.
> > > Pause pg_createsubscriber at the setup_recovery stage.
> > > Perform several operations on the primary node to generate a large
> > > volume of WAL, causing older WAL segments to be removed due to the low
> > > max_slot_wal_keep_size setting.
> > > Once the necessary WAL segments are deleted, continue the execution of
> > > pg_createsubscriber.
> > > At this point, pg_createsubscriber fails with the following error:
> > > 2024-12-29 01:21:37.590 IST [427353] FATAL:  could not receive data
> > > from WAL stream: ERROR:  requested WAL segment
> > > 000000010000000000000003 has already been removed
> > > 2024-12-29 01:21:37.592 IST [427345] LOG:  waiting for WAL to become
> > > available at 0/3000110
> > > 2024-12-29 01:21:42.593 IST [427358] LOG:  started streaming WAL from
> > > primary at 0/3000000 on timeline 1
> > > 2024-12-29 01:21:42.593 IST [427358] FATAL:  could not receive data
> > > from WAL stream: ERROR:  requested WAL segment
> > > 000000010000000000000003 has already been removed
> > >
> > > This issue was previously reported in [1], with a suggestion to raise
> > > a warning in [2]. I’ve implemented a patch that logs a warning in
> > > dry-run mode. This will give users the opportunity to adjust the
> > > max_slot_wal_keep_size value before running the command.
> > >
> > > Thoughts?
> >
> > +1 for throwing a warning in dry-run mode
> >
> > Few comments:
> > 1) We can have this check only in dry-run mode, it is not required in
> > non dry-run mode as there is nothing much user can do once the tool is
> > running, we can change this:
> > +       if (max_slot_wal_keep_size != -1)
> > +       {
> > +               pg_log_warning("publisher requires
> > 'max_slot_wal_keep_size = -1', but only %d remain",
> > +                                          max_slot_wal_keep_size);
> > +               pg_log_warning_detail("Change the
> > 'max_slot_wal_keep_size' configuration on the publisher to -1.");
> > +       }
> >
> > to:
> > +       if (dry_run && max_slot_wal_keep_size != -1)
> > +       {
> > +               pg_log_warning("publisher requires
> > 'max_slot_wal_keep_size = -1', but only %d remain",
> > +                                          max_slot_wal_keep_size);
> > +               pg_log_warning_detail("Change the
> > 'max_slot_wal_keep_size' configuration on the publisher to -1.");
> > +       }
> >
> > 2) This error message is not quite right, can we change it to
> > "publisher requires max_slot_wal_keep_size to be -1, but is set to %d"
> > +       if (max_slot_wal_keep_size != -1)
> > +       {
> > +               pg_log_warning("publisher requires
> > 'max_slot_wal_keep_size = -1', but only %d remain",
> > +                                          max_slot_wal_keep_size);
> > +               pg_log_warning_detail("Change the
> > 'max_slot_wal_keep_size' configuration on the publisher to -1.");
> > +       }
> >
> > 3) Also the configuration could be specified in format specifier like
> > it is specified in the earlier case
> >
>
> I have fixed the given comments. The attached patch contains the
> suggested changes.

Few comments:
1) Since this configuration will be updated after reload, you can
reload instead of restarting:
+# Configure 'max_slot_wal_keep_size = 1' on the publisher and verify the
+# warning message
+$node_p->append_conf('postgresql.conf', 'max_slot_wal_keep_size = 1');
+$node_p->restart;

2) You can reset max_slot_wal_keep_size configuration after this test:
+       0,
+       [qr/./],
+       [
+               qr/pg_createsubscriber: warning: publisher requires
max_slot_wal_keep_size to be -1/,
+               qr/Change the configuration parameter
"max_slot_wal_keep_size" on the publisher to -1./,
+       ],
+       'Validate warning for misconfigured max_slot_wal_keep_size on
the publisher'
+);

3) We could update the comment to also mention the possibility of
required wal files being deleted with non-default
max_slot_wal_keep_size.
+       /* Validate max_slot_wal_keep_size */
+       if (dry_run && max_slot_wal_keep_size != -1)
+       {
+               pg_log_warning("publisher requires
max_slot_wal_keep_size to be -1, but is set to %d",
+                                          max_slot_wal_keep_size);
+               pg_log_warning_detail("Change the configuration
parameter \"%s\" on the publisher to %d.",
+
"max_slot_wal_keep_size", -1);
+       }

4) You can update the commit message saying "the possibility of
required wal files being deleted with non-default
max_slot_wal_keep_size" and also mention that this warning will be
raised in dry_run mode.

Regards,
Vignesh



pgsql-hackers by date:

Previous
From: Bertrand Drouvot
Date:
Subject: Re: Autovacuum giving up on tables after crash because of lack of stats
Next
From: Daniel Gustafsson
Date:
Subject: Re: Remove support for OpenSSL *eay32 libs on Windows