RE: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Fix slot synchronization with two_phase decoding enabled |
Date | |
Msg-id | OS3PR01MB5718BC899AEE1D04755C6E7594832@OS3PR01MB5718.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Fix slot synchronization with two_phase decoding enabled (Amit Kapila <amit.kapila16@gmail.com>) |
List | pgsql-hackers |
On Tue, Apr 29, 2025 at 6:57 PM Amit Kapila wrote: > > On Tue, Apr 29, 2025 at 3:01 AM Masahiko Sawada > <sawada.mshk@gmail.com> wrote: > > > > Your fix looks good to me. Is it worth considering putting an > > assertion to verify if new two_phase_at is equal to or greater than > > confirmed_lsn (or at least it doesn't go backward), when syncing > > two_phase_at? > > > > Yeah, it makes sense. But the condition should be reverse (two_phase_at > should be less than or equal to confirmed_flush). I have done that, changed a > few comments, and committed the patch. I noticed a BF failure[1] related to the fix, where the recently added assertion Assert(slot->data.two_phase_at <= slot->data.confirmed_flush) was broken. After examining the logs and code, I couldn't identify any functionality issues. Here is my analysis: AFAICS, the last slotsync cycle should have retrieved the latest confirmed_flush and two_phase_at from the source slot, both expected to be 0/6005150. Here are some details: The standby's log[1] shows the source slot's two_phase_at as 0/6005150. Meanwhile, the publisher's log reveals that the source slot's confirmed_flush was already 0/6005150 before the last slotsync. Therefore, the last slotsync cycle should have fetched confirmed_flush: 0/6005150, two_phase_at: 0/6005150. If we assume remote_slot->confirmed_lsn during the last sync cycle is 0/6005150, then either the local slot's confirmed_lsn had already been updated to this value in a prior sync, leading it to skip the update in the last cycle (in which case, the assertion shouldn't be broken), or it should enter the slot update branch to set the synced slot to 0/6005150 (see update_local_synced_slot() for details). Thus, theoretically, the assertion wouldn't fail. Beyond functionality problems, another possibility might be the CPU reordered memory access, causing the assertion to execute before updating confirmed_flush. However, we lack the information needed to verify this, so one idea is to add some DEBUG message in update_local_synced_slot() to facilitate tracking if the issue recurs. I'll continue to investigate, but any suggestions or insights would be greatly appreciated. [1] https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=calliphoridae&dt=2025-04-29%2009%3A11%3A31&stg=recovery-check [2] 2025-04-29 09:18:05.641 UTC [1023004][client backend][0/2:0] LOG: statement: SELECT two_phase AND '0/6005150' = two_phase_atfrom pg_replication_slots WHERE slot_name = 'lsub1_slot' AND synced AND NOT temporary; [3] 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] LOG: 0/6004FC8 has been already streamed, forwarding to 0/6005150 2025-04-29 09:18:05.487 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8(proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.489 UTC [1022911][client backend][:0] LOG: disconnection: session time: 0:00:00.005 user=bf database=postgreshost=[local] 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] LOG: starting logical decoding for slot "lsub1_slot" 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] DETAIL: Streaming transactions committing after 0/6005150, readingWAL from 0/6004A00. 2025-04-29 09:18:05.516 UTC [1022913][walsender][29/0:0] STATEMENT: START_REPLICATION SLOT "lsub1_slot" LOGICAL 0/6004FC8(proto_version '4', streaming 'parallel', two_phase 'on', origin 'any', publication_names '"regress_mypub"') 2025-04-29 09:18:05.518 UTC [1022687][client backend][9/19:0] LOG: statement: SELECT slot_name, plugin, confirmed_flush_lsn,restart_lsn, catalog_xmin, two_phase, two_phase_at, failover, database, invalidation_reason FROM pg_catalog.pg_replication_slotsWHERE failover and NOT temporary 2025-04-29 09:18:05.524 UTC [1022928][not initialized][:0] LOG: connection received: host=[local] Best Regards, Hou zj
pgsql-hackers by date: