On Thu, Sep 28, 2023 at 1:24 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
>
> On Thu, Sep 28, 2023 at 1:06 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Thu, Sep 28, 2023 at 10:44 AM Bharath Rupireddy
> > <bharath.rupireddyforpostgres@gmail.com> wrote:
> > >
> > > > No, without that commit, there is a very high possibility that even if
> > > > we have sent the WAL to the subscriber and got the acknowledgment of
> > > > the same, we would miss updating it before shutdown. This would lead
> > > > to upgrade failures because upgrades have no way to later identify
> > > > whether the remaining WAL records are sent to the subscriber.
> > >
> > > Thanks for clarifying. I'm trying understand what happens without
> > > commit e0b2eed0 with an illustration:
> > >
> > > step 1: publisher - confirmed_flush LSN in replication slot on disk
> > > structure is 80
> > > step 2: publisher - sends WAL at LSN 100
> > > step 3: subscriber - acknowledges the apply LSN or confirmed_flush LSN as 100
> > > step 4: publisher - shuts down without writing the new confirmed_flush
> > > LSN as 100 to disk, note that commit e0b2eed0 is not in place
> > > step 5: publisher - restarts
> > > step 6: subscriber - upon publisher restart, the subscriber requests
> > > WAL from publisher from LSN 100 as it tracks the last applied LSN in
> > > replication origin
> > >
> > > Now, if the pg_upgrade with the patch in this thread is run on
> > > publisher after step 4, it complains with "The slot \"%s\" has not
> > > consumed the WAL yet".
> > >
> > > Is my above understanding right?
> > >
> >
> > Yes.
>
> Thanks. Trying things with replication lag - when there's a lag, the
> pg_upgrade can't proceed further and it complains "The slot "mysub"
> has not consumed the WAL yet".
>
> I think the best way to upgrade a postgres instance with logical
> replication slots is: 1) ensure no replication lag for the logical
> slots; 2) perform pg_upgrade --check first; 3) perform pg_upgrade if
> there are no complaints.
>
> With the above understanding, it looks to me that the commit e0b2eed0
> isn't necessary for back branches. Because, without it the pg_upgrade
> complains "The slot "mysub" has not consumed the WAL yet", and then
> the user has to restart the instance to ensure the WAL is consumed
> (IOW, to get the correct confirmed_flush LSN to the disk).
>
The point is it will be difficult for users to ensure that all the WAL
is consumed because it may have already been sent even after restart
and shutdown but the check will still fail. I think the argument to
support upgrade from branches where we don't have commit e0b2eed0 has
some merits and we can change the checks if there is broader agreement
on it. Let's try to agree on whether the core patch is good as is
especially what we want to achieve via validate_wal_records. Once we
agree on the main patch and commit it, the other work including
considering having an option to upgrade slots can be done as top-up
patches.
--
With Regards,
Amit Kapila.