Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAD21AoCUmrvcb-HgVmuZk=W1j9KKt=JmOFp8H+nXmtOcPcspLg@mail.gmail.com Whole thread Raw |
In response to | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Mon, Oct 13, 2025 at 1:33 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Hi, > > On Fri, Oct 10, 2025 at 1:54 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Fri, Oct 10, 2025 at 9:32 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Thu, Oct 9, 2025 at 4:14 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > Thank you for the comments. > > > > > > > > I've attached the updated patch. As we discussed, it now uses the lazy > > > > deactivation in all cases and has the discussed points in the comments > > > > atop logicalctl.c. > > > > > > > > > > Thank You for making the changes. I am in the process of verifying the > > > patch. Please find one concern. > > Thank you for reviewing the patch! > > > > > > > When I promote a standby, there's a brief period during > > > recovery-completion when status_change_allowed is set to true and also > > > RecoveryInProgress() is still true. If I try to create a logical > > > replication slot on the standby during this window, it fails with the > > > following error: > > > > > > postgres=# SELECT pg_create_logical_replication_slot('slot1', > > > 'pgoutput', false, false, false); > > > ERROR: logical decoding on standby requires "effective_wal_level" >= > > > "logical" on the primary > > > HINT: To enable logical decoding, set "wal_level" >= "logical" or > > > create at least one logical slot when "wal_level" = "replica". > > > > > > Note that the primary already has logical slots. This issue occurs > > > because logical decoding is disabled by > > > UpdateLogicalDecodingStatusEndOfRecovery() due to no existing slots on > > > standby while RecoveryInProgress() is still true. Should > > > CheckLogicalDecodingRequirements() also consider > > > 'status_change_allowed' on standby to handle this transition more > > > gracefully? > > Good point. I think we can check status_change_allowed instead of > RecoveryInProgress(). Fixed the patch accordingly. > > > > > 1) > > I am creating a subscription for publication present on standby, I get > > this error: > > > > postgres=# CREATE subscription sub1 connection 'dbname=postgres > > host=localhost user=shveta port=5434' publication pub1; > > ERROR: could not create replication slot "sub1": ERROR: logical > > decoding on standby requires "effective_wal_level" >= "logical" on the > > primary > > HINT: To enable logical decoding, set "wal_level" >= "logical" or > > create at least one logical slot when "wal_level" = "replica". > > > > Should Hint also mention 'on primary' i.e. 'To enable logical decoding > > on primary, set..' , otherwise it could be slightly confusing to > > follow HINT. > > Fixed. > > > > > 2) > > CheckLogicalDecodingRequirements(): > > > > Is there a reason for moving 'logical decoding on standby requires..' > > error before 'logical decoding requires a database connection' error? > > Shall we keep the same order as earlier? > > Fixed. > > > > > 3) > > GetActiveWalLevelOnStandby() is not being used anywhere now. > > Right. However, it's an exposed function so I'm somewhat hesitant with > removing it. > > > > > 4) > > + if (!old_pending_disable) > > + { > > + volatile PROC_HDR *procglobal = ProcGlobal; > > + ProcNumber checkpointerProc = procglobal->checkpointerProc; > > + > > + /* Wake up the checkpointer */ > > + if (checkpointerProc != INVALID_PROC_NUMBER) > > + SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch); > > + } > > > > Why are we waking the checkpointer only when pending_disable changes > > from false to true? When this function is called and > > old_pending_disable is still true, wouldn’t that be an even stronger > > reason to wake the checkpointer? > > I guess even if we set the checkpointer's latch multiple times, it > would not help wake it up quickly because the checkpointer might be > just busy with its checkpointing work. Having said that, it would help > simplify the code structure a bit. Given that > RequestDisableLogicalDecoding() function expects to be called only > after dropping the possibly-last logical slot, it would not be harmful > even if we set its latch anyway. And it might help the case where the > checkpointer is temporarily not working and we could not set it latch. > So I took your idea. Thanks! > > > > > 5) > > + /* > > + * Update shmem flags. We don't need to care about the order of setting > > + * global flag and writing the WAL record writes are not allowed yet. > > + */ > > > > Comment is somewhat problematic. > > > > 6) > > > > ReportSlotInvalidation(): > > + appendStringInfoString(&err_detail, _("Logical decoding on standby > > requires \"wal_level\" >= \"logical\" or at least one logical slot on > > the primary server.")); > > > > This message could be a little ambiguous. It is not completely clear > > that that wal_level=logical is needed on standby or on primary. How > > about below msg: > > > > Logical decoding on standby requires the primary server to have either > > wal_level >= logical or at least one logical replication slot created. > > > > 7) > > + "effective_wal_level got decreased to 'replica' on the primary to > > invalidate stnadby's slots" > > > > stnadby --> standby > > Fixed above points. > > > > > 8) > > + # Trigger promotion with no wait for the startup process to reach the > > + # injection point. > > + $standby5->safe_psql('postgres', qq[select pg_promote(false)]); > > + note('promote the standby and waiting for injection_point'); > > + $standby5->wait_for_event('startup', > > + 'startup-logical-decoding-status-change-end-of-recovery'); > > + note( > > + "injection_point > > 'startup-logical-decoding-status-change-end-of-recovery' is reached" > > + ); > > > > Is the comment correct here? We are waiting to reach the injection > > point? But the comment says 'no wait for'? > > I meant to promote the server without wait (i.e., passing false to > pg_promote() function). I modified the comments. > > > > > 9) > > + # Drop the logical slot, requesting to disable logical decoding to > > the checkpointer. > > + # It has to wait for the recovery to complete before disabling > > logical decoding. > > > > Double space in 'It has'. > > > > 10) > > + # Drop the logical slot, requesting to disable logical decoding to > > the checkpointer. > > + # It has to wait for the recovery to complete before disabling > > logical decoding. > > + $standby5->safe_psql('postgres', > > + qq[select pg_drop_replication_slot('standby5_slot');]); > > + > > + # Resume the startup process to complete the recovery. > > + $standby5->safe_psql('postgres', > > + qq[select injection_points_wakeup('startup-logical-decoding-status-change-end-of-recovery')] > > + ); > > > > I think between these 2 steps, we should check logs for 'waiting for > > recovery completion to change logical decoding status'. > > The test will become more clear then. > > Fixed the above points. > > I've attached the updated version patch. I noticed that cfbot tests failed and the patch needs to be updated to the current HEAD. I've attached the updated patch. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: