Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id CAD21AoCzY5MeP8VaqctGcMPeZra4R-SQA+-_Zy5N6XtvTpjPGw@mail.gmail.com
Whole thread Raw
In response to Re: [PoC] pg_upgrade: allow to upgrade publisher node  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: [PoC] pg_upgrade: allow to upgrade publisher node
Re: [PoC] pg_upgrade: allow to upgrade publisher node
List pgsql-hackers
On Thu, Aug 10, 2023 at 12:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Thu, Aug 10, 2023 at 6:46 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> >
> > On Wed, Aug 9, 2023 at 1:15 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Wed, Aug 9, 2023 at 8:01 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > I feel it would be a good idea to provide such a tool for users to
> > > avoid getting errors during upgrade but I think the upgrade code still
> > > needs to ensure that there are no WAL records between
> > > confirm_flush_lsn and SHUTDOWN_CHECKPOINT than required. Or, do you
> > > want to say that we don't do any verification check during the upgrade
> > > and let the data loss happens if the user didn't ensure that by
> > > running such a tool?
> >
> > I meant that if we can check the slot state file while the old cluster
> > stops, we can ensure there are no WAL records between slot's
> > confirmed_fluhs_lsn (in the state file) and the latest checkpoint (in
> > the control file).
> >
>
> Are you suggesting doing this before we start the old cluster or after
> we stop the old cluster? I was thinking about the pros and cons of
> doing this check when the server is 'on' (along with other upgrade
> checks something like the patch is doing now) versus when the server
> is 'off'. I think the advantage of doing it when the server is 'off'
> (after check_and_dump_old_cluster()) is that it will be ensured that
> there is no extra WAL that could be generated during the upgrade and
> has not been verified against confirmed_flush_lsn location. But OTOH,
> to retrieve slot information when the server is 'off', we need a
> separate utility or probably a functionality for the same in
> pg_upgrade and also some WAL reading stuff which sounds to me like a
> larger change that may not be warranted here. I think anyway the extra
> WAL (if any got generated during the upgrade) won't be required after
> the upgrade so not convinced to make such a check while the server is
> 'off'. Are there reasons which make it better to do this while the old
> cluster is 'off'?

What I imagined is that we do this check before
check_and_dump_old_cluster() while the server is 'off'. Reading the
slot state file would be simple and I guess we would not need a tool
or cli program for that. We need to expose RepliactionSlotOnDisk,
though. After reading the control file and the slots' state files we
check if slot's confirmed_flush_lsn matches the latest checkpoint LSN
in the control file (BTW maybe we can get slot name and plugin name
here instead of using pg_dump?). Extra WAL records could be generated
only after this check, so we wouldn't need to worry about that for
slots for logical replication. As for non-logical replication slots,
we would need some WAL reading stuff, but I'm not sure we need it for
the first commit. Or another idea would be to allow users to mark
replication slots "upgradable" so that pg_upgrade skips the
confirmed_flush_lsn check.

BTW this check would not be able to support live-check but I think
it's not a problem as this check with a running server will never be
able to pass.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jimmy Angelakos
Date:
Subject: Re: Adding a pg_servername() function
Next
From: Jelte Fennema
Date:
Subject: Re: proposal: psql: show current user in prompt