Home > mailing lists

Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From	Masahiko Sawada
Subject	Re: Fix slot synchronization with two_phase decoding enabled
Date	May 4 12:03:07
Msg-id	CAD21AoA914uGuJScN8yYhrzpO5ePfWJsfQnSd+dB7Ev7HCOyBA@mail.gmail.com Whole thread Raw
In response to	Re: Fix slot synchronization with two_phase decoding enabled (Tom Lane <tgl@sss.pgh.pa.us>)
List	pgsql-hackers

Tree view

On Sat, May 3, 2025 at 10:23 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> BF member mule just showed what seems an identical failure [1]:
>
> 2025-05-03 17:46:20.088 CEST [24308:3] LOG:  database system is ready to accept read-only connections
> 2025-05-03 17:46:20.091 CEST [24321:1] LOG:  slot sync worker started
> 2025-05-03 17:46:20.100 CEST [24322:1] LOG:  started streaming WAL from primary at 0/6000000 on timeline 1
> 2025-05-03 17:46:20.117 CEST [24321:2] LOG:  starting logical decoding for slot "lsub1_slot"
> 2025-05-03 17:46:20.117 CEST [24321:3] DETAIL:  Streaming transactions committing after 0/60049C8, reading WAL from
0/5000048.
> 2025-05-03 17:46:20.117 CEST [24321:4] LOG:  logical decoding found consistent point at 0/5000048
> 2025-05-03 17:46:20.117 CEST [24321:5] DETAIL:  There are no running transactions.
> TRAP: failed Assert("slot->data.two_phase_at <= slot->data.confirmed_flush"), File: "slotsync.c", Line: 311, PID:
24321
> postgres: standby1: slotsync worker (ExceptionalCondition+0x59)[0x556c3b8cb3e9]
> postgres: standby1: slotsync worker (+0x4cb100)[0x556c3b706100]
> postgres: standby1: slotsync worker (+0x4cba4c)[0x556c3b706a4c]
> postgres: standby1: slotsync worker (ReplSlotSyncWorkerMain+0x258)[0x556c3b707598]
> postgres: standby1: slotsync worker (postmaster_child_launch+0x102)[0x556c3b6d6962]
> postgres: standby1: slotsync worker (+0x49daea)[0x556c3b6d8aea]
> postgres: standby1: slotsync worker (+0x49fb3d)[0x556c3b6dab3d]
> postgres: standby1: slotsync worker (PostmasterMain+0xd3f)[0x556c3b6dc04f]
> postgres: standby1: slotsync worker (main+0x1ca)[0x556c3b3bee1a]
> /lib/x86_64-linux-gnu/libc.so.6(+0x2724a)[0x7fe08da4824a]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x85)[0x7fe08da48305]
> postgres: standby1: slotsync worker (_start+0x21)[0x556c3b3bf421]
> 2025-05-03 17:46:20.337 CEST [24308:4] LOG:  slot sync worker process (PID 24321) was terminated by signal 6: Aborted
>

While I cannot be entirely certain of my analysis, I believe the root
cause might be related to the backward movement of the confirmed_flush
LSN. The following scenario seems possible:

1. The walsender enables the two_phase and sets two_phase_at (which
should be the same as confirmed_flush).
2. The slot's confirmed_flush regresses for some reason.
3. The slotsync worker retrieves the remote slot information and
enables two_phase for the local slot.

If I recall correctly, we previously discussed[1] how a slot's
confirmed_flush LSN could move backward depending on the feedback
messages received from the downstream system. Based on the following
primary server's logs, it appears possible that the subscriber sent
feedback LSNs older than the slot's confirmed_flush, consequently
causing the confirmed_flush to move backward. However, I should note
that this is currently a hypothesis, as I haven't yet been able to
reproduce and verify this scenario.

Regards,

[1] https://www.postgresql.org/message-id/85fff40e-148b-4e86-b921-b4b846289132%40vondra.me

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

pgsql-hackers by date:

From: Tanin Na Nakorn
Date: 04 May, 08:47:53
Subject: PATCH: pg_dump to support "on conflict do update"

From: Jelte Fennema-Nio
Date: 04 May, 12:19:36
Subject: Re: PG 18 release notes draft committed

Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

Previous

Next