Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date
Msg-id CAD21AoB4txbyZm=8+pm9-o4XR8dCuzXdTP=3Ymr5k8T6YPjWMg@mail.gmail.com
Whole thread Raw
In response to RE: POC: enable logical decoding when wal_level = 'replica' without a server restart  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
Responses Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
List pgsql-hackers
od On Tue, Sep 2, 2025 at 8:11 PM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> Here are my comments.
>
> 01.
> ```
>         checkPoint.logicalDecodingEnabled = IsLogicalDecodingEnabled();
> ```
>
> Per my analysis, the value is always false here because StartupLogicalDecodingStatus
> is not called yet. Can we use "false" directly?

I think that it's better to read the shared flag instead of directly
setting false since LogicalDecodingCtl is already initialized.

>
> 02.
> ```
> elog(DEBUG1, "waiting for %d transactions to complete", running->xcnt);
> ```
>
> Here plural form is always used even if the running transaction is only one.
> How about something like:
> ```
> Number of transactions to wait finishing: %d
> ```

Hmm, not sure it improves the message. I think we don't care much
about plurals in debug messages. And in our convention the main log
message doesn't start a capital character.


>
> 03.
> ```
>                 while (RecoveryInProgress())
>                 {
>                         pgstat_report_wait_start(WAIT_EVENT_LOGICAL_DECODING_STATUS_CHANGE_DELAY);
>                         pg_usleep(100000L); /* wait for 100 msec */
>                         pgstat_report_wait_end();
>                 }
> ```
>
> I found a stuck case here: if a backend process within the loop and startup waits
> a signal is processed, both of them can stuck. The backend waits the recovery
> state to be DONE, and the startup waits all processes handle consume the signal.
> IIUC we must add CHECK_FOR_INTERRUPTS() or ProcessProcSignalBarrier().
>
> Actual steps:
>
> 0.  constructed a streaming replication system, which the only primary server had
>     a logical slot. I.e., the effective_wal_level was logical.
> 1.  connected to a standby node
> 2.  attached to the backend process via gdb
> 3.  added a breakpoint at create_logical_replication_slot
> 4.  called pg_create_logical_replication_slot() on the backend.
>     the backend will stop before ReplicationSlotCreate().
> 5.  from another terminal, attached to the startup process via gdb
> 6.  added a breakpoint at UpdateLogicalDecodingStatusEndOfRecovery()
> 7.  from another terminal, send a promote signal to the standby.
>     The startup will stop at UpdateLogicalDecodingStatusEndOfRecovery()
> 8.  executed steps on startup process, untill delay_status_change was updated
>     and LogicalDecodingControlLock was released.
> 9.  detached from the backend process. It would stop at the loop in
>     start_logical_decoding_status_change().
> 10. detached from the startup process. It would wait all processes handled the
>     signal, but the backend won't do.

Good find! I'll fix the problem by adding CHECK_FOR_INTERRUPTS() as
you suggested.



Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Emre Hasegeli
Date:
Subject: Allow using replication origins in SQL level parallel sessions
Next
From: Sami Imseih
Date:
Subject: Re: GetNamedLWLockTranche crashes on Windows in normal backend