RE: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers

From Hayato Kuroda (Fujitsu)
Subject RE: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date
Msg-id OSCPR01MB14966E989331F1FA7AF06BD9BF53BA@OSCPR01MB14966.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: POC: enable logical decoding when wal_level = 'replica' without a server restart  (Masahiko Sawada <sawada.mshk@gmail.com>)
List pgsql-hackers
Dear Sawada-san,

> > Assuming that logical_decoding written in the WAL is false here, and a logical
> > replication slot is created just after that. In my experiments below happened:
> >
> 
> Let me clarify each step:
> 
> > 1. startup process updated logical_decoding_enabled to false, at line 8652.
> 
> I assume that logical_decoding_enabled was enabled before step 1.

Right. Initially logical replication slot exist on both primary and standby.
More detail; the standby slot was created by the slotsync worker.

> > 2. slotsync worker started to sync. Surprisingly, it created a (second) logical
> >    slot and started logical decoding with fast_foward mode.
> 
> I guess that the postmaster launched the slotsync worker before the
> startup changes the status since logical decoding was enabled as I
> mentioned above, which seems fine to me.

As you said, the slotsync worker has already been launched when the status is
changed. I felt logical slot should not be created after the status on the shared
memory is changed.

> > 3. startup invalidated logical slots due to the wal_level. the slot created at
> >    step2 was automatically dropped, because it was not sync-readly yet.
> > 4. startup process shut down the slotsync worker.
> > 5. start process read the STATUS_CHANGE record again, which has the value
> "true".
> >    it requested to restart the sync worker.
> > 6. restarted sync worker synchronize the slot again...
> >
> > For me it works well but it is bit a strange because 1) logical decoding is
> > started even when effective_wal_level is false,
> 
> I think it's a race condition between the postmaster and the startup,
> it could happen even between the backend and the startup; the startup
> disables logical decoding right after the backend passes
> CheckLogicalDecodingRequirements() check. I think it's technically
> okay since all WAL records before the STATUS_CHANGE should have the
> logical information. Even if it starts to do logical decoding, it
> would end up decoding the STATUS_CHANGE record and with an error (see
> xlog_decode()).

To clarify, are you thinking that it is no need to be fixed, because eventually
the system becomes the appropriate state, right?

> > and 2) the synced slot is
> > dropped once with below message:
> >
> > ```
> > LOG:  terminating process 1474448 to release replication slot "test2"
> > DETAIL:  Logical decoding on standby requires "wal_level" >= "logical" or at
> least one logical slot on the primary server.
> > CONTEXT:  WAL redo at 0/030000B8 for
> XLOG/LOGICAL_DECODING_STATUS_CHANGE: false
> > ERROR:  canceling statement due to conflict with recovery
> > DETAIL:  User was using a logical replication slot that must be invalidated.
> > ```
> >
> > Can we stop the sync worker before updating the status? IIUC this is one of the
> > solution.
> 
> I think it would lead to another race condition; the slotsync worker
> can start again before updating the status.

Hmm, okay.

Another small comment: this data structure is not used in other files, no need to set extern.

```
extern LogicalDecodingCtlData *LogicalDecodingCtl;
```

Best regards,
Hayato Kuroda
FUJITSU LIMITED 


pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication
Next
From: Chao Li
Date:
Subject: Re: Inconsistent update in the MERGE command