Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAD21AoDgrgPzyApqTryFN0DqNUVADd8EtNB3Nw2SJaAOoq4KSQ@mail.gmail.com Whole thread Raw |
In response to | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
|
List | pgsql-hackers |
On Thu, Sep 4, 2025 at 11:15 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > od On Tue, Sep 2, 2025 at 8:11 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Sawada-san, > > > > Here are my comments. > > > > 01. > > ``` > > checkPoint.logicalDecodingEnabled = IsLogicalDecodingEnabled(); > > ``` > > > > Per my analysis, the value is always false here because StartupLogicalDecodingStatus > > is not called yet. Can we use "false" directly? > > I think that it's better to read the shared flag instead of directly > setting false since LogicalDecodingCtl is already initialized. > > > > > 02. > > ``` > > elog(DEBUG1, "waiting for %d transactions to complete", running->xcnt); > > ``` > > > > Here plural form is always used even if the running transaction is only one. > > How about something like: > > ``` > > Number of transactions to wait finishing: %d > > ``` > > Hmm, not sure it improves the message. I think we don't care much > about plurals in debug messages. And in our convention the main log > message doesn't start a capital character. > > > > > > 03. > > ``` > > while (RecoveryInProgress()) > > { > > pgstat_report_wait_start(WAIT_EVENT_LOGICAL_DECODING_STATUS_CHANGE_DELAY); > > pg_usleep(100000L); /* wait for 100 msec */ > > pgstat_report_wait_end(); > > } > > ``` > > > > I found a stuck case here: if a backend process within the loop and startup waits > > a signal is processed, both of them can stuck. The backend waits the recovery > > state to be DONE, and the startup waits all processes handle consume the signal. > > IIUC we must add CHECK_FOR_INTERRUPTS() or ProcessProcSignalBarrier(). > > > > Actual steps: > > > > 0. constructed a streaming replication system, which the only primary server had > > a logical slot. I.e., the effective_wal_level was logical. > > 1. connected to a standby node > > 2. attached to the backend process via gdb > > 3. added a breakpoint at create_logical_replication_slot > > 4. called pg_create_logical_replication_slot() on the backend. > > the backend will stop before ReplicationSlotCreate(). > > 5. from another terminal, attached to the startup process via gdb > > 6. added a breakpoint at UpdateLogicalDecodingStatusEndOfRecovery() > > 7. from another terminal, send a promote signal to the standby. > > The startup will stop at UpdateLogicalDecodingStatusEndOfRecovery() > > 8. executed steps on startup process, untill delay_status_change was updated > > and LogicalDecodingControlLock was released. > > 9. detached from the backend process. It would stop at the loop in > > start_logical_decoding_status_change(). > > 10. detached from the startup process. It would wait all processes handled the > > signal, but the backend won't do. > > Good find! I'll fix the problem by adding CHECK_FOR_INTERRUPTS() as > you suggested. I've attached the updated patch that incorporated all comments I got so far. FYI, I've been doing extensive long-running stability tests in my environment for several days. The testing setup involves a primary-standby replication configuration where logical decoding is repeatedly enabled and disabled on the primary server. I verify that there are no activation or deactivation processes failures and confirm that non-logical WAL records are written when logical decoding is enabled. Additionally, I perform repeated failovers too. While no issues have been identified so far, I plan to continue testing with varied workloads and configurations. One concern about this patch: although the patch's core concept is straightforward, I'm particularly concerned about the requirement to write a STATUS_CHANGE WAL record when dropping the last logical slot, which could occur during process shutdown (specifically via before_shmem callback). While the process forgets pending cancellations and holds interrupts during shutdown callback execution, the WAL record writing operation could involve waits and disk I/O etc. Although our testing hasn't revealed any related issues, I'm concerned this could become problematic in the future. I'd be happy to hear opinions on this matter and am also open to alternative approaches in terms of how to disable logical decoding. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: