Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Fix slot synchronization with two_phase decoding enabled |
Date | |
Msg-id | CAD21AoDrmo=pAMo3ZX=zVK1SowaJ=eCbfXa9teSkxgDC2dV8rA@mail.gmail.com Whole thread Raw |
In response to | Re: Fix slot synchronization with two_phase decoding enabled (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Fix slot synchronization with two_phase decoding enabled
|
List | pgsql-hackers |
On Thu, Apr 24, 2025 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > ---------- > > > > > > > > Approach 2 > > > > > > > > ---------- > > > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > > > > > enabled when there's a possibility of existing prepared transactions before > > > > > > > the > > > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > > > two-phase > > > > > > > > and failover, we could check for any decoded prepared transactions when > > > > > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > > > > prepare > > > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > > > unnoticeable. > > > > > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > > > while the slot is created. But if I understand the problem correctly, > > > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > > > still false, so we need to deal with the transactions that are > > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > > > > started decoding when setting two_phase=true during CreateDecodingContext() > > > > > > after tablesync, we could check prepared transactions afterwards during > > > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > > later point of time. > > > > > > > > > > > Alternatively, a simpler method would be to prevent this situation entirely > > > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > > > > > created with failover set to true and twophase is later modified to true after > > > > > > tablesync. Although the simpler check is more user-visible, it may offer less > > > > > > flexibility. > > > > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > > smart checks in the back-branch. If we follow what you say here, then > > > > > users have the following ways in PG17 to enable both failover and > > > > > two_phase. (a) During Create Subscription, users can set both > > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > > to set 'failover'. > > > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > > add check when enabling failover for the two_phase-enabled-slots. For > > > > example, users who want to enable both need to do two steps: > > > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > > the two_phase states is ready (and possibly check if the slot's > > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > > the publisher enables the failover for the two_phase-enabled-slot up > > > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > > > passed slot's two_phase_at, otherwise raise an error with the message > > > > like "the slot need to consume change upto %X/%X to enable failover". > > > > > > > > > > This should further simplify the checks with an additional restriction > > > during the CREATE SUBSCRIPTION time. I am in favor of it because I > > > want the code in this area to be as much same in HEAD and backbranches > > > as possible. > > > > > > > Please find the updated patch for Approach 3, which implements the > > idea suggested by Swada-san above. > > Thank you for the patch! I think we need to update the documentation > as well. I'll also review the patch shortly so could you please > prepare the documentation changes? Here are reviews comments on v7 patch: + /* + * Do not allow users to enable the failover and two_phase options + * together. + * + * See comments atop the similar check in ReplicationSlotCreate() for a + * detailed reason. + */ + if (opts.twophase && opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options", + "failover", "two_phase")); I think we can reword the error message to something like "cannot enable both failover and two_phase options at CREATE SUBSCRIPTION", and we can give a errhint that failover can be enabled after two_phase state is ready. --- + if (two_phase && !IsSyncingReplicationSlots()) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options", + "failover", "two_phase")); Similar to the above comment, how about rewriting it to something like "cannot enable both failover and two_phase during replication slot creation". I realized that users who create a logical slot using pg_create_logical_replication_slot() would not be able to enable both options at slot creation, and there is no easy way to enable the failover after two_phase-enabled-slot creation. Users would need to use ALTER_REPLICATION_SLOT replication command, which seems unrealistics for users to use. On the other hand, if we allow creating a logical slot with enabling failover and two_phase using SQL API, there is still a chance for this bug to occur. Would it be worth considering that if a logical slot is created with enabling failover and two_phase using SQL API, we create the slot with only two_phase=true, then advance the slot until the slot satisfies restart_lsn >= two_phase_at, and then enable the failover? --- + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && + opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a subscription with a pending two_phase state")); + I think ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is a more appropriate error code. --- + if (failover && MyReplicationSlot->data.two_phase && + MyReplicationSlot->data.restart_lsn < MyReplicationSlot->data.two_phase_at) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a two-phase enabled replication slot"), + errdetail("The slot need to consume change upto %X/%X to enable failover.", + LSN_FORMAT_ARGS(MyReplicationSlot->data.two_phase_at))); Same as above comment with regards to the error code. --- We need regression tests for replication slot creation SQL API too. --- +# Confirm that the failover flag on the slot has been turned on +is( $publisher->safe_psql( + 'postgres', + q{SELECT failover from pg_replication_slots WHERE slot_name = 'lsub1_slot';} + ), + "t", + 'logical slot has failover true on the publisher'); It should check 'lsub2_slot' instead. --- +# Create a subscription with two_phase enabled +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (create_slot = false, copy_data = false, enabled = false, two_phase = true);" +); + +# Enable failover for the subscription with two_phase in pending state +($result, $stdout, $stderr) = $subscriber1->psql('postgres', + "ALTER SUBSCRIPTION regress_mysub2 SET (failover = true)"); +ok( $stderr =~ /ERROR: cannot enable failover for a subscription with a pending two_phase state/, + "Enabling failover is not allowed for a two_phase pending subscription"); + I think it would be good to have more tests where we make the two_phase state ready and execute 'ALTER SUBSCRIPTION SET (failover = true)' successfully. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: