Re: POC: enable logical decoding when wal_level = 'replica' without a server restart - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart |
Date | |
Msg-id | CAJpy0uD0Qf5WBZv6-qRqQTP9jEAbLH6NFGi=y4fihMMieKVHAA@mail.gmail.com Whole thread Raw |
In response to | Re: POC: enable logical decoding when wal_level = 'replica' without a server restart (Masahiko Sawada <sawada.mshk@gmail.com>) |
List | pgsql-hackers |
On Sat, Jun 7, 2025 at 2:44 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Jun 6, 2025 at 3:02 AM shveta malik <shveta.malik@gmail.com> wrote: > > > > On Wed, Jun 4, 2025 at 3:40 PM shveta malik <shveta.malik@gmail.com> wrote: > > > > > > On Wed, Jun 4, 2025 at 6:41 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, May 20, 2025 at 9:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > Yeah, I find the idea that the presence of a logical slot will allow > > > > > the user to enable logical decoding/replication more appealing than > > > > > this new alternative, leaving aside the challenges of realizing it. > > > > > > +1. This idea appears more user-friendly and easier to understand > > > compared to other approaches, such as having multiple GUCs or using > > > ALTER SYSTEM. > > > > > > > I've drafted this idea. Here are summary for attached two patches: > > > > > > > > 0001 patch allows us to create a logical slot without WAL reservation. > > > > > > > > 0002 patch is the main patch for dynamically enabling/disabling > > > > logical decoding when wal_level is 'replica'. > > > > > > Thank You for the patches. I have done some initial testing, it seems > > > to be working well. I will do more testing and review and will share > > > further feedback. > > > > I reviewed further and had few concerns: > > Thank you for reviewing this feature! > > > > > 1) > > We now invalidate slots on standby if the primary (with > > wal_level=replica) has dropped the last logical slot and internally > > reverted its runtime (effective) wal_level back to replica. Consider > > the following scenario involving a cascaded logical replication setup: > > > > a) The publisher is configured with wal_level = replica and has > > created a publication (pub1). > > b) A subscriber server creates a subscription (sub1) to pub1. As part > > of the slot creation for sub1, the publisher's effective wal_level is > > switched to logical. > > c) The publisher also has a physical standby, which in turn has its > > own logical subscriber, named standby_sub1. > > > > At this point, everything works as expected i.e. changes from the > > publisher flow through the physical standby and are replicated to > > standby_sub1. Now if the user drops sub1, the replication slot on the > > primary is also dropped. Since this was the last logical slot, the > > primary automatically switches its effective wal_level back to > > replica. This change propagates to the standby, causing it to > > invalidate the slot for standby_sub1. As a result, the standby logs > > the following error: > > > > STATEMENT: START_REPLICATION SLOT "standby_sub1" LOGICAL 0/0 (...) > > ERROR: logical decoding needs to be enabled on the primary > > > > Even if we manually recreate a logical slot on the primary afterward, > > the standby_sub1 subscriber is not able to proceed: > > ERROR: can no longer access replication slot "standby_sub1" > > DETAIL: This replication slot has been invalidated due to > > "wal_level_insufficient". > > > > So the removal of the logical subscriber for the publisher has somehow > > restricted the logical subscriber of standby to work. Is this > > behaviour acceptable? > > > > Without this feature, if I manually switch back wal_level to replica > > on primary, then it will fail to start. This makes the issue obvious > > and prevents misconfiguration. > > FATAL: logical replication slot "sub2" exists, but "wal_level" < "logical" > > HINT: Change "wal_level" to be "logical" or higher. > > > > But the current behaviour is harder to diagnose, as the problem is > > effectively hidden behind subscription/slot creation/deletion. > > The most upstream server in replication configuration would carefully > need to keep having at least one logical slot. One way to keep > effective_wal_level 'logical' on the publisher where wal_level = > 'replica' is to have a logical slot without WAL reservation that is > not relevant with any subscriptions. It could require an extra logical > slot but seems workable. Does it resolve this concern? > Yes, I agree that publishers should have a separate slot (not related with any subscription) without WAL reservation to retain effective_wal_level as logical when wal_level is replica. But the question is how can that be ensured? Will it be user's responsibility to always create that slot? If user has already some subscriptions subscribing to most upstream server, then while setting up logical replication on physical standby at a later stage, user will not even encounter the error: ERROR: logical decoding needs to be enabled on the primary, HINT: Set wal_level >= logical or create at least one logical slot on the primary. And in lack of such error, users may always end up in the above explained situation. > > 2) > > 'show effective_wal_level' shows output as 'logical' if a slot exists > > on primary. But on physical standby, it still shows it as 'replica' > > even in the presence of slots. Is this intentional? > > Yes. I think we should disallow the standbys to create a logical slot > as long as they use wal_level = 'replica', because otherwise the > standby would need to invalidate the logical slot at a promotion. > Which could cause a large down time in a failover case. Do you mean even if primary is running on effective_wal_level=logical, we shall disallow slot-creation on standby if standby has wal_level=replica? It means the $subject's enhancement is only valid on primary? Or the other way could be that we can have 2 trigger points for enabling effective_wal_level to logical on primary: 1) One is when a logical slot is created on primary. 2) Another is when a logical slot is created on any of its physical standby. We need to maintain these 2 separately as drop of last primary's slot should not toggle it back to replica when any of its physical standbys still need it. But if a publisher has multiple physical standbys, then it will need extra handling i.e. last logical-slot drop on standby1 should not end up toggling effective_wal_level to replica when standby2 still has some logical slots. I am somehow trying to think of a way where we have that extra slot without the user's intervention. > > > 3) > > I haven’t tested this yet, but I’d like to discuss what the expected > > behavior should be if a slot exists on the primary but is marked as > > invalidated. Will an invalidated slot still cause the effective > > wal_level to remain at logical, or will invalidating the only logical > > slot trigger a switch back to replica? > > There is a chance that a slot with un-reserved wal may be invalidated > > due to time-out. > > Good point. I think we don't need to decrease the effective_wal_level > to 'replica' even if we invalidate all logical slots. We need neither > WAL reservation nor dead tuple retention in order to set > effective_wal_level to 'logical' so I think it's straightforward that > effective_wal_level value depends on only the presence of logical > slots. If dle_replication_slot_timeout affects also logical slots > created with immeidately_reserve=false, we might want to exclude them > to avoid confusion. > Yes, we shall exclude such slot from timeout based invalidation. As there are chances that if a slot is invalidated, user may drop it anytime. thanks Shveta
pgsql-hackers by date: