On Thu, Nov 30, 2023 at 8:40 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Wed, Nov 29, 2023 at 2:56 PM Hayato Kuroda (Fujitsu)
> <kuroda.hayato@fujitsu.com> wrote:
> >
> > > > >
> > > > > Pushed!
> > > >
> > > > Hi all, the CF entry for this is marked RfC, and CI is trying to apply
> > > > the last patch committed. Is there further work that needs to be
> > > > re-attached and/or rebased?
> > > >
> > >
> > > No. I have marked it as committed.
> > >
> >
> > I found another failure related with the commit [1]. I think it is caused by the
> > autovacuum. I want to propose a patch which disables the feature for old publisher.
> >
> > More detail, please see below.
> >
> > # Analysis of the failure
> >
> > Summary: this failure occurs when the autovacuum starts after the subscription
> > is disabled but before doing pg_upgrade.
> >
> > According to the regress file, it unexpectedly failed the pg_upgrade [2]. There are
> > no possibilities for slots are invalidated, so some WALs seemed to be generated
> > after disabling the subscriber.
> >
> > Also, server log caused by oldpub said that autovacuum worker was terminated when
> > it stopped. This was occurred after walsender released the logical slots. WAL records
> > caused by autovacuum workers could not be consumed by the slots, so that upgrading
> > function returned false.
> >
> > # How to reproduce
> >
> > I made a small file for reproducing the failure. Please see reproduce.txt. This contains
> > changes for launching autovacuum worker very often and for ensuring actual works are
> > done. After applying it, I could reproduce the same failure every time.
> >
> > # How to fix
> >
> > I think it is sufficient to fix only the test code.
> > The easiest way is to disable the autovacuum on old publisher. PSA the patch file.
> >
>
> Agreed, for now, we should change the test as you proposed. I'll take
> care of that. However, I wonder, if we should also ensure that
> autovacuum or any other worker is shut down before walsender processes
> the last set of WAL before shutdown. We can analyze more on this and
> probably start a separate thread to discuss this point.
>
Sorry, my analysis was not complete. On looking closely, I think the
reason is that we are allowed to upgrade the slot iff there is no
pending WAL to be processed. The test first disables the subscription
to avoid unnecessary LOGs on the subscriber and then stops the
publisher node. It is quite possible that just before the shutdown of
the server, autovacuum generates some WAL record that needs to be
processed, so you propose just disabling the autovacuum for this test.
--
With Regards,
Amit Kapila.