Re: [PoC] pg_upgrade: allow to upgrade publisher node - Mailing list pgsql-hackers

From vignesh C
Subject Re: [PoC] pg_upgrade: allow to upgrade publisher node
Date
Msg-id CALDaNm1TLfZZ5aUnhP477KTdK7xgzwZJ2m0b=73nJNzQNM8SdA@mail.gmail.com
Whole thread Raw
In response to RE: [PoC] pg_upgrade: allow to upgrade publisher node  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: [PoC] pg_upgrade: allow to upgrade publisher node
RE: [PoC] pg_upgrade: allow to upgrade publisher node
List pgsql-hackers
On Fri, 21 Jul 2023 at 13:00, Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear hackers,
>
> > Based on the above, we are considering that we delay the timing of shutdown for
> > logical walsenders. The preliminary workflow is:
> >
> > 1. When logical walsenders receives siginal from checkpointer, it consumes all
> >    of WAL records, change its state into WALSNDSTATE_STOPPING, and stop
> > doing
> >    anything.
> > 2. Then the checkpointer does the shutdown checkpoint
> > 3. After that postmaster sends signal to walsenders, same as current
> > implementation.
> > 4. Finally logical walsenders process the shutdown checkpoint record and update
> > the
> >   confirmed_lsn after the acknowledgement from subscriber.
> >   Note that logical walsenders don't have to send a shutdown checkpoint record
> >   to subscriber but following keep_alive will help us to increment the
> > confirmed_lsn.
> > 5. All tasks are done, they exit.
> >
> > This mechanism ensures that the confirmed_lsn of active slots is same as the
> > current
> > WAL location of old publisher, so that 0003 patch would become more simpler.
> > We would not have to calculate the acceptable difference anymore.
> >
> > One thing we must consider is that any WALs must not be generated while
> > decoding
> > the shutdown checkpoint record. It causes the PANIC. IIUC the record leads
> > SnapBuildSerializationPoint(), which just serializes snapbuild or restores from
> > it, so the change may be acceptable. Thought?
>
> I've implemented the ideas from my previous proposal, PSA another patch set.
> Patch 0001 introduces the state WALSNDSTATE_STOPPING to logical walsenders. The
> workflow remains largely the same as described in my previous post, with the
> following additions:
>
> * A flag has been added to track whether all the WALs have been flushed. The
>   logical walsender can only exit after the flag is set. This ensures that all
>   WALs are flushed before the termination of the walsender.
> * Cumulative statistics are now forcibly written before changing the state.
>   While the previous involved reporting stats upon process exit, the current approach
>   must report earlier due to the checkpointer's termination timing. See comments
>   in CheckpointerMain() and atop pgstat_before_server_shutdown().
> * At the end of processes, slots are now saved to disk.
>
>
> Patch 0002 adds --include-logical-replication-slots option to pg_upgrade,
> not changed from previous set.
>
> Patch 0003 adds a check function, which becomes simpler.
> The previous version calculated the "acceptable" difference between confirmed_lsn
> and the current WAL position. This was necessary because shutdown records could
> not be sent to subscribers, creating a disparity in these values. However, this
> approach had drawbacks, such as needing adjustments if record sizes changed.
>
> Now, the record can be sent to subscribers, so the hacking is not needed anymore,
> at least in the context of logical replication. The consistency is now maintained
> by the logical walsenders, so slots created by the backend could not be.
> We must consider what should be...
>
> How do you think?

Here is a patch which checks that there are no WAL records other than
CHECKPOINT_SHUTDOWN WAL record to be consumed based on the discussion
from [1].
Patch 0001 and 0002 is same as the patch posted by Kuroda-san, Patch
0003 exposes pg_get_wal_records_content to get the WAL records along
with the WAL record type between start and end lsn. pg_walinspect
contrib module already exposes a function for this requirement, I have
moved this functionality to be exposed from the backend. Patch 0004
has slight change in check function to check that there are no other
records other than CHECKPOINT_SHUTDOWN to be consumed. The attached
patch has the changes for the same.
Thoughts?

[1] - https://www.postgresql.org/message-id/CAA4eK1Kem-J5NM7GJCgyKP84pEN6RsG6JWo%3D6pSn1E%2BiexL1Fw%40mail.gmail.com

Regards,
Vignesh

Attachment

pgsql-hackers by date:

Previous
From: Anthonin Bonnefoy
Date:
Subject: Re: POC: Extension for adding distributed tracing - pg_tracing
Next
From: Ashutosh Bapat
Date:
Subject: Re: logical decoding and replication of sequences, take 2