Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | Shlok Kyal |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CANhcyEXVyPS74B+Nmwfa3132agkZEDEv+Cg1xu9fp+5ppKx=Ww@mail.gmail.com Whole thread Raw |
In response to | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
|
List | pgsql-hackers |
On Fri, 29 Aug 2025 at 09:38, Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Aug 27, 2025 at 7:45 PM Hayato Kuroda (Fujitsu) > <kuroda.hayato@fujitsu.com> wrote: > > > > Dear Sawada-san, > > > > > > Assuming that logical_decoding written in the WAL is false here, and a logical > > > > replication slot is created just after that. In my experiments below happened: > > > > > > > > > > Let me clarify each step: > > > > > > > 1. startup process updated logical_decoding_enabled to false, at line 8652. > > > > > > I assume that logical_decoding_enabled was enabled before step 1. > > > > Right. Initially logical replication slot exist on both primary and standby. > > More detail; the standby slot was created by the slotsync worker. > > > > > > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical > > > > slot and started logical decoding with fast_foward mode. > > > > > > I guess that the postmaster launched the slotsync worker before the > > > startup changes the status since logical decoding was enabled as I > > > mentioned above, which seems fine to me. > > > > As you said, the slotsync worker has already been launched when the status is > > changed. I felt logical slot() should not be created after the status on the shared > > memory is changed. > > > > > > 3. startup invalidated logical slots due to the wal_level. the slot created at > > > > step2 was automatically dropped, because it was not sync-readly yet. > > > > 4. startup process shut down the slotsync worker. > > > > 5. start process read the STATUS_CHANGE record again, which has the value > > > "true". > > > > it requested to restart the sync worker. > > > > 6. restarted sync worker synchronize the slot again... > > > > > > > > For me it works well but it is bit a strange because 1) logical decoding is > > > > started even when effective_wal_level is false, > > > > > > I think it's a race condition between the postmaster and the startup, > > > it could happen even between the backend and the startup; the startup > > > disables logical decoding right after the backend passes > > > CheckLogicalDecodingRequirements() check. I think it's technically > > > okay since all WAL records before the STATUS_CHANGE should have the > > > logical information. Even if it starts to do logical decoding, it > > > would end up decoding the STATUS_CHANGE record and with an error (see > > > xlog_decode()). > > My understanding of where the synced slot starts to move was not > right; it starts from the remote slot's restart_lsn, which could be > far ahead from the STATUS_CHANGE record that the startup process is > applying but where logical decoding should be enabled. It doesn't > happen that the slotsync worker tries to decode non-logical WAL > records even if it advances the slot after the startup disabled > logical decoding. > > > To clarify, are you thinking that it is no need to be fixed, because eventually > > the system becomes the appropriate state, right? > > IIUC you're concerned it's possible that the slotsync worker creates > or advances a logical slot between the startup changes the logical > decoding status to false and sends the stop signal. TBH I have no idea > how efficiently to fix it. I've considered a simple idea that the > slotsync worker checks IsLogicalDecodingEnabled() before trying to > sync one logical slot. However, it doesn't solve the race condition; > the startup process can disable logical decoding right after the > slotsync passed the check, in which case users would see the logical > slot is created after logical decoding is disabled. > > Another race condition that we might need to deal with is, the > slotsync worker is launched while logical decoding is still enabled, > but if the startup sends the stop signal to the slotsync worker before > the worker sets its pid to SlotSyncCtx->pid, the worker will keep > running. I've added the check !IsLogicalDecodingEnabled() to the > slotsync worker's initialization. > > > > > > > and 2) the synced slot is > > > > dropped once with below message: > > > > > > > > ``` > > > > LOG: terminating process 1474448 to release replication slot "test2" > > > > DETAIL: Logical decoding on standby requires "wal_level" >= "logical" or at > > > least one logical slot on the primary server. > > > > CONTEXT: WAL redo at 0/030000B8 for > > > XLOG/LOGICAL_DECODING_STATUS_CHANGE: false > > > > ERROR: canceling statement due to conflict with recovery > > > > DETAIL: User was using a logical replication slot that must be invalidated. > > > > ``` > > > > > > > > Can we stop the sync worker before updating the status? IIUC this is one of the > > > > solution. > > > > > > I think it would lead to another race condition; the slotsync worker > > > can start again before updating the status. > > > > Hmm, okay. > > > > Another small comment: this data structure is not used in other files, no need to set extern. > > > > ``` > > extern LogicalDecodingCtlData *LogicalDecodingCtl; > > ``` > > Removed. > > I've attached the updated patch. > Hi Sawada-san, Thanks for the updated patch. I have a doubt. When we create publication (when wal_level is set to replica) we get a warning: WARNING: logical decoding needs to be enabled to publish logical changes HINT: Before creating subscriptions, set "wal_level" >= "logical" or create a logical replication slot when "wal_level" = "replica". The hint suggests that when wal_level = 'replica', before creating a subscription, we should create logical slots on the publisher. But when I tested this scenario, I created a subscription (without having a prior logical slot on the publisher). The operation was successful, the effective_wal_level was set appropriately and logical replication was working fine. I think this happens because the CREATE SUBSCRIPTION command itself creates a logical slot on the publisher. Should we update the HINT message here? Thanks, Shlok Kyal
pgsql-hackers by date: