Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAD21AoA914uGuJScN8yYhrzpO5ePfWJsfQnSd+dB7Ev7HCOyBA@mail.gmail.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
On Sat, May 3, 2025 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> BF member mule just showed what seems an identical failure [1]:
>
> 2025-05-03 17:46:20.088 CEST [24308:3] LOG:  database system is ready to accept read-only connections
> 2025-05-03 17:46:20.091 CEST [24321:1] LOG:  slot sync worker started
> 2025-05-03 17:46:20.100 CEST [24322:1] LOG:  started streaming WAL from primary at 0/6000000 on timeline 1
> 2025-05-03 17:46:20.117 CEST [24321:2] LOG:  starting logical decoding for slot "lsub1_slot"
> 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL:  Streaming transactions committing after 0/60049C8, reading WAL from
0/5000048.
> 2025-05-03 17:46:20.117 CEST [24321:4] LOG:  logical decoding found consistent point at 0/5000048
> 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL:  There are no running transactions.
> TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), File: "slotsync.c", Line: 311, PID:
24321
> postgres: standby1: slotsync worker (ExceptionalCondition+0x59)[0x556c3b8cb3e9]
> postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100]
> postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c]
> postgres: standby1: slotsync worker (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598]
> postgres: standby1: slotsync worker (postmaster_child_launch+0x102)[0x556c3b6d6962]
> postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea]
> postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d]
> postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f]
> postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305]
> postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421]
> 2025-05-03 17:46:20.337 CEST [24308:4] LOG:  slot sync worker process (PID 24321) was terminated by signal 6: Aborted
>

While I cannot be entirely certain of my analysis, I believe the root
cause might be related to the backward movement of the confirmed_flush
LSN. The following scenario seems possible:

1. The walsender enables the two_phase and sets two_phase_at (which
should be the same as confirmed_flush).
2. The slot's confirmed_flush regresses for some reason.
3. The slotsync worker retrieves the remote slot information and
enables two_phase for the local slot.

If I recall correctly, we previously discussed[1] how a slot's
confirmed_flush LSN could move backward depending on the feedback
messages received from the downstream system. Based on the following
primary server's logs, it appears possible that the subscriber sent
feedback LSNs older than the slot's confirmed_flush, consequently
causing the confirmed_flush to move backward. However, I should note
that this is currently a hypothesis, as I haven't yet been able to
reproduce and verify this scenario.

Regards,

[1] https://www.postgresql.org/message-id/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Tanin Na Nakorn
Date:
Subject: PATCH: pg_dump to support "on conflict do update"
Next
From: Jelte Fennema-Nio
Date:
Subject: Re: PG 18 release notes draft committed