Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Date
Msg-id CAD21AoB=Rf-SASOJR2WqvWcrA5Q3S2oUBACVLdJPaA8x6EchBA@mail.gmail.com
Whole thread Raw
In response to RE: POC: enable logical decoding when wal_level = 'replica' without a server restart  ("Hayato Kuroda (Fujitsu)" <kuroda.hayato@fujitsu.com>)
List pgsql-hackers
On Thu, Jul 31, 2025 at 5:00 AM Hayato Kuroda (Fujitsu)
<kuroda.hayato@fujitsu.com> wrote:
>
> Dear Sawada-san,
>
> > I thought we could fix this issue by checking the number of in-use
> > logical slots while holding ReplicationSlotControlLock and
> > LogicalDecodingControlLock, but it seems we need to deal with another
> > race condition too between backends and startup processes at the end
> > of recovery.
> >
> > Currently the backend skips controlling logical decoding status if the
> > server is in recovery (by checking RecoveryInProgress()), but it's
> > possible that a backend process tries to drop a logical slot after the
> > startup process calling UpdateLogicalDecodingStatusEndOfRecovery() and
> > before accepting writes.
>
> Right. I also verified on local and found that
> ReplicationSlotDropAcquired()->DisableLogicalDecodingIfNecessary() sometimes
> skips to modify the status because RecoveryInProgress is still false.
>
> > In this case, the backend ends up not
> > disabling logical decoding and it remains enabled. I think we would
> > somehow need to delay the logical decoding status change in this
> > period until the recovery completes.
>
> My primitive idea was to 1) keep startup acquiring the lock till end of recovery
> and 2) DisableLogicalDecodingIfNecessary() acquires lock before checking the
> recovery status, but it could not work well. Not sure but WaitForProcSignalBarrier()
> stucked if the process acquired LogicalDecodingControlLock lock....

I think that it's not realistic to keep holding a lwlock until the
recovery actually completes because we perform a checkpoint after
that.

In the latest version patch I attached, I introduce a flag on shared
memory to delay any logical decoding status change until the recovery
completes. The implementation got more complex than I expected but I
don't have a better idea. I'm open to other approaches. Also, I
incorporated all comments I got so far[1][2][3] and updated the
documentation.

Regards,

[1] https://www.postgresql.org/message-id/CALDaNm3BfG1hpWVEaqwBgXpcEGSQXDi536OzB2%3D8SFTz-v%2B3CA%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAJpy0uDxap0YKLx5N45_Vz49QARjioUaOb1qpaiV0PBkYoivRg%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/OSCPR01MB149663D242F6E97630758DD6EF55AA%40OSCPR01MB14966.jpnprd01.prod.outlook.com

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: Add backup_type to pg_stat_progress_basebackup
Next
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart