Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAD21AoDtfZ0P_zMNauVM4FBXrQx-yU7ms-Rcem2b2RusKeWn8A@mail.gmail.com Whole thread Raw |
In response to | RE: POC: enable logical decoding when wal_level = 'replica' without a server restart ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>) |
List | pgsql-hackers |
On Wed, Aug 27, 2025 at 5:08 AM Hayato Kuroda (Fujitsu) <kuroda.hayato@fujitsu.com> wrote: > > Dear Sawada-san, > > Thanks for updating the patch. Here are my comments. Thank you for reviewing the patch! > > xlog_desc() > ``` > else if (info == XLOG_LOGICAL_DECODING_STATUS_CHANGE) > { > bool enabled; > > memcpy(&enabled, rec, sizeof(bool)); > appendStringInfo(buf, enabled ? "true" : "false"); > } > ``` > > Per 2075ba9, appendStringInfoString() can be used if we do not have other messages. Agreed, will fix. > > logicalctl.h > ``` > extern void UpdateNumberOfLogicalSlots(bool incr); > ``` > > This function is not implemented. Removed. > > UpdateLogicalDecodingStatus() > ``` > elog(DEBUG1, "update logical decoding status to %d", new_status); > ``` > > I prefer to use true/false instead of 1/0, thought? I think we don't necessarily need it as it's a debug log. > xlog_redo() > ``` > /* Update the status on shared memory */ > memcpy(&logical_decoding, XLogRecGetData(record), sizeof(bool)); > UpdateLogicalDecodingStatus(logical_decoding, true); > > if (InRecovery && InHotStandby) > { > if (!logical_decoding) > { > /* > * Invalidate logical slots if we are in hot standby and the > * primary disabled the logical decoding. > */ > InvalidateObsoleteReplicationSlots(RS_INVAL_WAL_LEVEL, > 0, InvalidOid, > InvalidTransactionId); > > ``` > > Assuming that logical_decoding written in the WAL is false here, and a logical > replication slot is created just after that. In my experiments below happened: > Let me clarify each step: > 1. startup process updated logical_decoding_enabled to false, at line 8652. I assume that logical_decoding_enabled was enabled before step 1. > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical > slot and started logical decoding with fast_foward mode. I guess that the postmaster launched the slotsync worker before the startup changes the status since logical decoding was enabled as I mentioned above, which seems fine to me. > 3. startup invalidated logical slots due to the wal_level. the slot created at > step2 was automatically dropped, because it was not sync-readly yet. > 4. startup process shut down the slotsync worker. > 5. start process read the STATUS_CHANGE record again, which has the value "true". > it requested to restart the sync worker. > 6. restarted sync worker synchronize the slot again... > > For me it works well but it is bit a strange because 1) logical decoding is > started even when effective_wal_level is false, I think it's a race condition between the postmaster and the startup, it could happen even between the backend and the startup; the startup disables logical decoding right after the backend passes CheckLogicalDecodingRequirements() check. I think it's technically okay since all WAL records before the STATUS_CHANGE should have the logical information. Even if it starts to do logical decoding, it would end up decoding the STATUS_CHANGE record and with an error (see xlog_decode()). > and 2) the synced slot is > dropped once with below message: > > ``` > LOG: terminating process 1474448 to release replication slot "test2" > DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at least one logical slot on the primary server. > CONTEXT: WAL redo at 0/030000B8 for XLOG/LOGICAL_DECODING_STATUS_CHANGE: false > ERROR: canceling statement due to conflict with recovery > DETAIL: User was using a logical replication slot that must be invalidated. > ``` > > Can we stop the sync worker before updating the status? IIUC this is one of the > solution. I think it would lead to another race condition; the slotsync worker can start again before updating the status. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: