Thread: Fix slot synchronization with two_phase decoding enabled
Hi, When testing the slot synchronization with logical replication slots that enabled two_phase decoding, I found that transactions prepared before two-phase decoding is enabled may fail to replicate to the subscriber after being committed on a promoted standby following a failover. To reproduce this issue, please follow these steps (also detailed in the attached TAP test, v1-0001): 1. sub: create a subscription with (two_phase = false) 2. primary (pub): prepare a txn A. 3. sub: alter subscription set (two_phase = true) and wait for the logical slot to be synced to standby. 4. primary (pub): stop primary, promote the standby and let the subscriber use the promoted standby as publisher. 5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply worker will report the following ERROR because it didn't receive the PREPARE. ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist I think the root cause of this issue is that the two_phase_at field of the slot, which indicates the LSN from which two-phase decoding is enabled (used to prevent duplicate data transmission for prepared transactions), is not synchronized to the standby server. In step 3, transaction A is not immediately replicated because it occurred before enabling two-phase decoding. Thus, the prepared transaction should only be replicated after decoding the final COMMIT PREPARED, as referenced in ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the standby, the prepared transaction fails to send at that time. This problem arises after the support for altering the two-phase option (1462aad). Previously, two-phase was only enabled during slot creation, which wait for all prepared transactions to finish (via ... -> SnapBuildWaitSnapshot) before reaching a consistent state, so the bug didn't exist. To address the issue, I propose synchronizing the two_phase_at field to the standby server, as implemented in the attached patches. As mentioned earlier,this bug exists only for PG18 so we do not need to back patch. v1-0001: Tap test to reproduce the issue I place this patch as the first one so that reviewers can run it independently to reproduce the issue. Once the problem is thoroughly understood and the fix proves stable, the patches can be integrated. v1-0002: Display two_phase_at in the pg_replication_slots view v1-0003: Sync the two_phase_at field of a replication slot to the standby An alternative approach might be modifying ALTER_REPLICATION_SLOT to wait for all prepared transactions to commit when enabling two-phase. However, this appears inelegant and less user friendly. Best Regards, Hou zj
Attachment
On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Hi, > > When testing the slot synchronization with logical replication slots that > enabled two_phase decoding, I found that transactions prepared before two-phase > decoding is enabled may fail to replicate to the subscriber after being > committed on a promoted standby following a failover. > > To reproduce this issue, please follow these steps (also detailed in the > attached TAP test, v1-0001): > > 1. sub: create a subscription with (two_phase = false) > 2. primary (pub): prepare a txn A. > 3. sub: alter subscription set (two_phase = true) and wait for the logical slot to > be synced to standby. > 4. primary (pub): stop primary, promote the standby and let the subscriber use > the promoted standby as publisher. > 5. promoted standby (pub): COMMIT PREPARED A; > 6. sub: the apply worker will report the following ERROR because it didn't > receive the PREPARE. > ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist > > I think the root cause of this issue is that the two_phase_at field of the > slot, which indicates the LSN from which two-phase decoding is enabled (used to > prevent duplicate data transmission for prepared transactions), is not > synchronized to the standby server. > > In step 3, transaction A is not immediately replicated because it occurred > before enabling two-phase decoding. Thus, the prepared transaction should only > be replicated after decoding the final COMMIT PREPARED, as referenced in > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the > standby, the prepared transaction fails to send at that time. > > This problem arises after the support for altering the two-phase option > (1462aad). > Thanks for the report and patch. I'll look into it. -- With Regards, Amit Kapila.
On Tue, Mar 25, 2025 at 12:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Hi, > > > > When testing the slot synchronization with logical replication slots that > > enabled two_phase decoding, I found that transactions prepared before two-phase > > decoding is enabled may fail to replicate to the subscriber after being > > committed on a promoted standby following a failover. > > > > To reproduce this issue, please follow these steps (also detailed in the > > attached TAP test, v1-0001): > > > > 1. sub: create a subscription with (two_phase = false) > > 2. primary (pub): prepare a txn A. > > 3. sub: alter subscription set (two_phase = true) and wait for the logical slot to > > be synced to standby. > > 4. primary (pub): stop primary, promote the standby and let the subscriber use > > the promoted standby as publisher. > > 5. promoted standby (pub): COMMIT PREPARED A; > > 6. sub: the apply worker will report the following ERROR because it didn't > > receive the PREPARE. > > ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist > > > > I think the root cause of this issue is that the two_phase_at field of the > > slot, which indicates the LSN from which two-phase decoding is enabled (used to > > prevent duplicate data transmission for prepared transactions), is not > > synchronized to the standby server. > > > > In step 3, transaction A is not immediately replicated because it occurred > > before enabling two-phase decoding. Thus, the prepared transaction should only > > be replicated after decoding the final COMMIT PREPARED, as referenced in > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the > > standby, the prepared transaction fails to send at that time. > > > > This problem arises after the support for altering the two-phase option > > (1462aad). > > I suspect that this can happen in PG17 as well, but I need to think more about it to make a reproducible test case. In the meantime, I have a few minor comments on the proposed patches: 1. ################################################## # Promote the standby1 to primary. Confirm that: # a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary # b) logical replication for regress_mysub1 is resumed successfully after failover # c) changes can be consumed from the synced slot 'snap_test_slot' ################################################## -$standby1->start; $primary->wait_for_replay_catchup($standby1); # Capture the time before the standby is promoted @@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1'); is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), "20", 'data replicated from the new primary'); +# Commit the prepared transaction +$standby1->safe_psql('postgres', + "COMMIT PREPARED 'test_twophase_slotsync';"); +$standby1->wait_for_catchup('regress_mysub1'); + +# Confirm that the prepared transaction is replicated to the subscriber +is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), + "21", 'prepared data replicated from the new primary'); The commentary above this test should include information about verifying the replication of previously prepared transactions after promotion. Also, it would be better if confirm the commit prepared before confirming the new Insert is replicated after promotion. 2. @@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot, Oid remote_dbid, SpinLockAcquire(&slot->mutex); slot->data.restart_lsn = remote_slot->restart_lsn; slot->data.confirmed_flush = remote_slot->confirmed_lsn; + slot->data.two_phase_at = remote_slot->two_phase_at; Why do we need to update the two_phase_at here when the patch does it later in this function when local and remote values don't match? -- With Regards, Amit Kapila.
On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Hi, > > > > > > When testing the slot synchronization with logical replication slots that > > > enabled two_phase decoding, I found that transactions prepared before > two-phase > > > decoding is enabled may fail to replicate to the subscriber after being > > > committed on a promoted standby following a failover. > > > > > > To reproduce this issue, please follow these steps (also detailed in the > > > attached TAP test, v1-0001): > > > > > > 1. sub: create a subscription with (two_phase = false) > > > 2. primary (pub): prepare a txn A. > > > 3. sub: alter subscription set (two_phase = true) and wait for the logical > slot to > > > be synced to standby. > > > 4. primary (pub): stop primary, promote the standby and let the subscriber > use > > > the promoted standby as publisher. > > > 5. promoted standby (pub): COMMIT PREPARED A; > > > 6. sub: the apply worker will report the following ERROR because it didn't > > > receive the PREPARE. > > > ERROR: prepared transaction with identifier "pg_gid_16387_752" > does not exist > > > > > > I think the root cause of this issue is that the two_phase_at field of the > > > slot, which indicates the LSN from which two-phase decoding is enabled > (used to > > > prevent duplicate data transmission for prepared transactions), is not > > > synchronized to the standby server. > > > > > > In step 3, transaction A is not immediately replicated because it occurred > > > before enabling two-phase decoding. Thus, the prepared transaction > should only > > > be replicated after decoding the final COMMIT PREPARED, as referenced > in > > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at > on the > > > standby, the prepared transaction fails to send at that time. > > > > > > This problem arises after the support for altering the two-phase option > > > (1462aad). > > > > > I suspect that this can happen in PG17 as well, but I need to think > more about it to make a reproducible test case. After further analysis, I was able to reproduce the same issue [1] in PG 17. However, since the proposed fix requires catalog changes and the issue is not a security risk significant enough to justify changing the catalog in back branches, we cannot back-patch the same solution. Following off-list discussions with Amit and Kuroda-san, we are considering disallowing enabling failover and two-phase decoding together for a replication slot, as suggested in attachment 0002. Another idea considered is to prevent the slot that enables two-phase decoding from being synced to standby. IOW, this means displaying the failover field as false in the view, if there is any possibility that transactions prepared before the two_phase_at position exist (e.g., if restart_lsn is less than two_phase_at). However, implementing this change would require additional explanations to users for this new behavior, which seems tricky. > > In the meantime, I have a few minor comments on the proposed patches: > 1. > ################################################## > # Promote the standby1 to primary. Confirm that: > # a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary > # b) logical replication for regress_mysub1 is resumed successfully > after failover > # c) changes can be consumed from the synced slot 'snap_test_slot' > ################################################## > -$standby1->start; > $primary->wait_for_replay_catchup($standby1); > > # Capture the time before the standby is promoted > @@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1'); > is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > "20", 'data replicated from the new primary'); > > +# Commit the prepared transaction > +$standby1->safe_psql('postgres', > + "COMMIT PREPARED 'test_twophase_slotsync';"); > +$standby1->wait_for_catchup('regress_mysub1'); > + > +# Confirm that the prepared transaction is replicated to the subscriber > +is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}), > + "21", 'prepared data replicated from the new primary'); > > The commentary above this test should include information about > verifying the replication of previously prepared transactions after > promotion. Also, it would be better if confirm the commit prepared > before confirming the new Insert is replicated after promotion. > > 2. > @@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot, > Oid remote_dbid, > SpinLockAcquire(&slot->mutex); > slot->data.restart_lsn = remote_slot->restart_lsn; > slot->data.confirmed_flush = remote_slot->confirmed_lsn; > + slot->data.two_phase_at = remote_slot->two_phase_at; > > Why do we need to update the two_phase_at here when the patch does it > later in this function when local and remote values don't match? Thanks for the comments, they have been addressed in V2. [1] - pub: created a slot 'sub' with two_phase=false, then prepared a transaction - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is greater than prepared txn lsn. - sub: create subscription with (slot_name='sub', create_slot=false, failover = true, two_phase=true, copy_data=false); two_phase_at will be set to the same as confirmed_flush_lsn which is greater than the prepared transaction. - stop the primary and promote the standby. - commit the prepared transaction on standby, the following error will be reported on subscriber: LOG: logical replication apply worker for subscription "sub2" has started ERROR: prepared transaction with identifier "pg_gid_16398_764" does not exist. Best Regards, Hou zj
Attachment
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > I suspect that this can happen in PG17 as well, but I need to think > > more about it to make a reproducible test case. > > After further analysis, I was able to reproduce the same issue [1] in > PG 17. > > However, since the proposed fix requires catalog changes and the issue is not a > security risk significant enough to justify changing the catalog in back > branches, we cannot back-patch the same solution. > Agreed. In the past, as in commit b6e39ca92e, we have backported a catalog-modifying commit, but that is for a CVE. Users need to follow manual steps as explained in 9.6.4 release notes [1], which would be cumbersome for them. This is not a security issue, so we shouldn't backpatch a catalog modifying commit following past. > Following off-list > discussions with Amit and Kuroda-san, we are considering disallowing enabling > failover and two-phase decoding together for a replication slot, as suggested > in attachment 0002. > > Another idea considered is to prevent the slot that enables two-phase decoding > from being synced to standby. IOW, this means displaying the failover field as > false in the view, if there is any possibility that transactions prepared > before the two_phase_at position exist (e.g., if restart_lsn is less than > two_phase_at). However, implementing this change would require additional > explanations to users for this new behavior, which seems tricky. > I find it tricky to explain to users. We need to say that sometimes the slots won't be synced even if the failover is set to true. Users can verify that by checking slot properties on the publisher. Also, on the subscriber, the failover flag in the subscription may still be true unless we do more engineering to make it false. So, I prefer to simply disallow setting failover and two_phase together. We need to recommend to users in release notes for 17 that they need to disable failover for subscriptions where two_phase is enabled or re-create the subscriptions with two_phase=false and failover=true. Users may not like it, but I think it is better than getting a complaint that after promotion of standby the data to subscriber is not getting replicated. [1] - https://www.postgresql.org/docs/9.6/release-9-6-4.html -- With Regards, Amit Kapila.
On Mon, Mar 31, 2025 at 5:04 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Thanks for the comments, they have been addressed in V2. > In the PG17 patch, we only have a check preventing failover and two_phase in CreateSubscription. Don't we need a similar check for AlterSubscription? Apart from the above, I have a few suggestions for changes in docs, error messages, and comments for both versions. See attached. -- With Regards, Amit Kapila.
Attachment
On Tue, Apr 1, 2025 at 2:09 PM Amit Kapila wrote: > > On Mon, Mar 31, 2025 at 5:0 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > Thanks for the comments, they have been addressed in V2. > > > > In the PG17 patch, we only have a check preventing failover and > two_phase in CreateSubscription. Don't we need a similar check for AlterSubscription? > > Apart from the above, I have a few suggestions for changes in docs, > error messages, and comments for both versions. See attached. Thanks for the comments. Here is the V3 patch set which addressed all the comments. I also added a testcase for ALTER SUB in 0002 as suggested by Kuroda-san off-list. Additionally, I found that a test failed in PG17 following this patch because it involved creating a subscription with both failover and two-phase enabled. Since that test was designed to test the upgrade of replication slots and is not directly related to failover functionality, I decided to disable the failover option for test case. Best Regards, Hou zj
Attachment
On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Here is the V3 patch set which addressed all the comments. > Comment 0n 0001 <literal>NULL</literal> for logical slots where + <structfield>two_phase</structfield> is false and physical slots. + </para></entry> change above to: <literal>NULL</literal> for logical slots where <structfield>two_phase</structfield> is false and for physical slots. Comment on 0002 +# Create a subscription with two_phase enabled +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot = false, enabled = false, two_phase = true);" +); + +# Enable failover for the subscription +($result, $stdout, $stderr) = $subscriber1->psql('postgres', + "ALTER SUBSCRIPTION regress_mysub2 SET (failover = true)"); +ok( $stderr =~ /ERROR: cannot enable failover for a two_phase enabled subscription/, + "Enabling failover is not allowed for a two_phase enabled subscription"); Is there a need for this test to be in .pl file? Can't we add it in .sql file? Apart from the above, I have made minor modifications to the PG17 patch in the attached. -- With Regards, Amit Kapila.
Attachment
On Wed, Apr 2, 2025 at 12:41 PM Amit Kapila wrote: > > On Tue, Apr 1, 2025 at 4:28 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> > wrote: > > > > Here is the V3 patch set which addressed all the comments. > > > > Comment 0n 0001 > <literal>NULL</literal> for logical slots where > + <structfield>two_phase</structfield> is false and physical slots. > + </para></entry> > > change above to: > <literal>NULL</literal> for logical slots where > <structfield>two_phase</structfield> is false and for physical slots. Changed. > > Comment on 0002 > +# Create a subscription with two_phase enabled > +$subscriber1->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION > '$publisher_connstr' > PUBLICATION regress_mypub WITH (slot_name = lsub1_slot, create_slot = > false, enabled = false, two_phase = true);" > +); > + > +# Enable failover for the subscription ($result, $stdout, $stderr) = > +$subscriber1->psql('postgres', "ALTER SUBSCRIPTION regress_mysub2 > +SET (failover = true)"); ok( $stderr =~ > +/ERROR: cannot enable failover for a two_phase > enabled subscription/, > + "Enabling failover is not allowed for a two_phase enabled > + subscription"); > > Is there a need for this test to be in .pl file? Can't we add it in .sql file? Right, I moved the test into subscription.sql > Apart from the above, I have made minor modifications to the PG17 > patch in the attached. Here is V4 patch set which addressed above comments and passed pgindent test. Best Regards, Hou zj
Attachment
On Mon, Mar 31, 2025 at 4:34 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Hi, > > > > > > > > When testing the slot synchronization with logical replication slots that > > > > enabled two_phase decoding, I found that transactions prepared before > > two-phase > > > > decoding is enabled may fail to replicate to the subscriber after being > > > > committed on a promoted standby following a failover. > > > > > > > > To reproduce this issue, please follow these steps (also detailed in the > > > > attached TAP test, v1-0001): > > > > > > > > 1. sub: create a subscription with (two_phase = false) > > > > 2. primary (pub): prepare a txn A. > > > > 3. sub: alter subscription set (two_phase = true) and wait for the logical > > slot to > > > > be synced to standby. > > > > 4. primary (pub): stop primary, promote the standby and let the subscriber > > use > > > > the promoted standby as publisher. > > > > 5. promoted standby (pub): COMMIT PREPARED A; > > > > 6. sub: the apply worker will report the following ERROR because it didn't > > > > receive the PREPARE. > > > > ERROR: prepared transaction with identifier "pg_gid_16387_752" > > does not exist > > > > > > > > I think the root cause of this issue is that the two_phase_at field of the > > > > slot, which indicates the LSN from which two-phase decoding is enabled > > (used to > > > > prevent duplicate data transmission for prepared transactions), is not > > > > synchronized to the standby server. > > > > > > > > In step 3, transaction A is not immediately replicated because it occurred > > > > before enabling two-phase decoding. Thus, the prepared transaction > > should only > > > > be replicated after decoding the final COMMIT PREPARED, as referenced > > in > > > > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at > > on the > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > This problem arises after the support for altering the two-phase option > > > > (1462aad). > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > more about it to make a reproducible test case. > > After further analysis, I was able to reproduce the same issue [1] in > PG 17. > > [1] > - pub: created a slot 'sub' with two_phase=false, then prepared a transaction > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is > greater than prepared txn lsn. > - sub: create subscription with (slot_name='sub', create_slot=false, failover = > true, two_phase=true, copy_data=false); two_phase_at will be set to the same > as confirmed_flush_lsn which is greater than the prepared transaction. > - stop the primary and promote the standby. > - commit the prepared transaction on standby, the following error will be > reported on subscriber: It seems to require elaborate steps to reproduce this issue in v17. I wonder if we could somehow narrow down the cases that we want to prohibit. The patch for v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, but I guess that it's still safe if it also creates the replication slot (e.g., create_slot is true). If my understanding is right, we can allow users to specify both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to disallow that in ReplicationSlotCreate(). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: Hi, > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > > > <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > Hi, > > > > > > > > > > When testing the slot synchronization with logical replication > > > > > slots that enabled two_phase decoding, I found that transactions > > > > > prepared before > > > two-phase > > > > > decoding is enabled may fail to replicate to the subscriber > > > > > after being committed on a promoted standby following a failover. > > > > > > > > > > To reproduce this issue, please follow these steps (also > > > > > detailed in the attached TAP test, v1-0001): > > > > > > > > > > 1. sub: create a subscription with (two_phase = false) 2. > > > > > primary (pub): prepare a txn A. > > > > > 3. sub: alter subscription set (two_phase = true) and wait for > > > > > the logical > > > slot to > > > > > be synced to standby. > > > > > 4. primary (pub): stop primary, promote the standby and let the > > > > > subscriber > > > use > > > > > the promoted standby as publisher. > > > > > 5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply > > > > > worker will report the following ERROR because it didn't > > > > > receive the PREPARE. > > > > > ERROR: prepared transaction with identifier "pg_gid_16387_752" > > > does not exist > > > > > > > > > > I think the root cause of this issue is that the two_phase_at > > > > > field of the slot, which indicates the LSN from which two-phase > > > > > decoding is enabled > > > (used to > > > > > prevent duplicate data transmission for prepared transactions), > > > > > is not synchronized to the standby server. > > > > > > > > > > In step 3, transaction A is not immediately replicated because > > > > > it occurred before enabling two-phase decoding. Thus, the > > > > > prepared transaction > > > should only > > > > > be replicated after decoding the final COMMIT PREPARED, as > > > > > referenced > > > in > > > > > ReorderBufferFinishPrepared(). However, due to the invalid > > > > > two_phase_at > > > on the > > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > > > This problem arises after the support for altering the two-phase > > > > > option (1462aad). > > > > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > > more about it to make a reproducible test case. > > > > After further analysis, I was able to reproduce the same issue [1] in > > PG 17. > > > > [1] > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > transaction > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is > > greater than prepared txn lsn. > > - sub: create subscription with (slot_name='sub', create_slot=false, > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > be set to the same as confirmed_flush_lsn which is greater than the > prepared transaction. > > - stop the primary and promote the standby. > > - commit the prepared transaction on standby, the following error will be > > reported on subscriber: > > It seems to require elaborate steps to reproduce this issue in v17. I wonder if we > could somehow narrow down the cases that we want to prohibit. The patch for > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, > but I guess that it's still safe if it also creates the replication slot (e.g., > create_slot is true). If my understanding is right, we can allow users to specify > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > disallow that in ReplicationSlotCreate(). Thanks for reviewing the steps. The current reproducer aims for simplicity; however, I think it's possible to reproduce the issue even with create_slot = true, although it requires the help of a debugger and additional steps. But as long as there are transactions prepared before the two_phase_at position, and they are skipped due to checks in ReorderBufferFinishPrepared() (refer to comments[1] for why we skip sending prepared transaction), the issue can be reproduced. For instance, when creating a subscription with (copy_data=true, failover=true, two_phase=true), the slot's two_phase setting starts as false and shifts to true after table sync (refer to [2] for related comments). During this period, if a user prepares a transaction where the prepare LSN is less than the two_phase_at, the same problem could happen. Similarly, when setting up a subscription with (copy_data=false, failover=true, two_phase=true), although two_phase is initially set to true and we wait for running transactions to finish when building consistent snapshot, a race condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it won't check running transactions further (see SnapBuildFindSnapshot() for specifics), if a user prepares a transaction at this point, it's possible for the prepare LSN to be less than the LSN marking the snapshot's consistent state, causing the same issue. [1] /* * It is possible that this transaction is not decoded at prepare time * either because by that time we didn't have a consistent snapshot, or * two_phase was not enabled, or it was decoded earlier but we have * restarted. We only need to send the prepare if it was not decoded * earlier. We don't need to decode the xact for aborts if it is not done * already. */ [2] /* * Even if two_phase is set, don't create the slot with * two-phase enabled. Will enable it once all the tables are * synced and ready. This avoids race-conditions like prepared * transactions being skipped due to changes not being applied * due to checks in should_apply_changes_for_rel() when * tablesync for the corresponding tables are in progress. See * comments atop worker.c. * * Note that if tables were specified but copy_data is false * then it is safe to enable two_phase up-front because those * tables are already initially in READY state. When the * subscription has no tables, we leave the twophase state as * PENDING, to allow ALTER SUBSCRIPTION ... REFRESH * PUBLICATION to work. */ Best Regards, Hou zj
On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: > > Hi, > > > > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Thu, Mar 27, 2025 at 2:29 PM Amit Kapila wrote: > > > > > > > > > > > On Tue, Mar 25, 2025 at 12:1 PM Amit Kapila > > > > <amit.kapila16@gmail.com> > > > > wrote: > > > > > > > > > > On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > When testing the slot synchronization with logical replication > > > > > > slots that enabled two_phase decoding, I found that transactions > > > > > > prepared before > > > > two-phase > > > > > > decoding is enabled may fail to replicate to the subscriber > > > > > > after being committed on a promoted standby following a failover. > > > > > > > > > > > > To reproduce this issue, please follow these steps (also > > > > > > detailed in the attached TAP test, v1-0001): > > > > > > > > > > > > 1. sub: create a subscription with (two_phase = false) 2. > > > > > > primary (pub): prepare a txn A. > > > > > > 3. sub: alter subscription set (two_phase = true) and wait for > > > > > > the logical > > > > slot to > > > > > > be synced to standby. > > > > > > 4. primary (pub): stop primary, promote the standby and let the > > > > > > subscriber > > > > use > > > > > > the promoted standby as publisher. > > > > > > 5. promoted standby (pub): COMMIT PREPARED A; 6. sub: the apply > > > > > > worker will report the following ERROR because it didn't > > > > > > receive the PREPARE. > > > > > > ERROR: prepared transaction with identifier "pg_gid_16387_752" > > > > does not exist > > > > > > > > > > > > I think the root cause of this issue is that the two_phase_at > > > > > > field of the slot, which indicates the LSN from which two-phase > > > > > > decoding is enabled > > > > (used to > > > > > > prevent duplicate data transmission for prepared transactions), > > > > > > is not synchronized to the standby server. > > > > > > > > > > > > In step 3, transaction A is not immediately replicated because > > > > > > it occurred before enabling two-phase decoding. Thus, the > > > > > > prepared transaction > > > > should only > > > > > > be replicated after decoding the final COMMIT PREPARED, as > > > > > > referenced > > > > in > > > > > > ReorderBufferFinishPrepared(). However, due to the invalid > > > > > > two_phase_at > > > > on the > > > > > > standby, the prepared transaction fails to send at that time. > > > > > > > > > > > > This problem arises after the support for altering the two-phase > > > > > > option (1462aad). > > > > > > > > > > > > > > I suspect that this can happen in PG17 as well, but I need to think > > > > more about it to make a reproducible test case. > > > > > > After further analysis, I was able to reproduce the same issue [1] in > > > PG 17. > > > > > > [1] > > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > > transaction > > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so it is > > > greater than prepared txn lsn. > > > - sub: create subscription with (slot_name='sub', create_slot=false, > > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > > be set to the same as confirmed_flush_lsn which is greater than the > > prepared transaction. > > > - stop the primary and promote the standby. > > > - commit the prepared transaction on standby, the following error will be > > > reported on subscriber: > > > > It seems to require elaborate steps to reproduce this issue in v17. I wonder if we > > could somehow narrow down the cases that we want to prohibit. The patch for > > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and failover, > > but I guess that it's still safe if it also creates the replication slot (e.g., > > create_slot is true). If my understanding is right, we can allow users to specify > > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > > disallow that in ReplicationSlotCreate(). > > Thanks for reviewing the steps. > > The current reproducer aims for simplicity; however, I think it's possible to reproduce > the issue even with create_slot = true, although it requires the help of a > debugger and additional steps. But as long as there are transactions prepared > before the two_phase_at position, and they are skipped due to checks in > ReorderBufferFinishPrepared() (refer to comments[1] for why we skip sending > prepared transaction), the issue can be reproduced. > > For instance, when creating a subscription with (copy_data=true, failover=true, > two_phase=true), the slot's two_phase setting starts as false and shifts to > true after table sync (refer to [2] for related comments). During this period, > if a user prepares a transaction where the prepare LSN is less than the > two_phase_at, the same problem could happen. > > Similarly, when setting up a subscription with (copy_data=false, failover=true, > two_phase=true), although two_phase is initially set to true and we wait for > running transactions to finish when building consistent snapshot, a race > condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it > won't check running transactions further (see SnapBuildFindSnapshot() for > specifics), if a user prepares a transaction at this point, it's possible for > the prepare LSN to be less than the LSN marking the snapshot's consistent > state, causing the same issue. Thank you for the explanation! I agree that the issue happens in these cases. As another idea, I wonder if we could somehow defer to make the synced slot as 'sync-ready' until we can ensure that the slot doesn't have any transactions that are prepared before the point of enabling two_phase. For example, when the slotsync worker fetches the remote slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local slot's two_phase becomes true or the local slot is newly created with enabling two_phase, and then it makes the slot 'sync-ready' once it confirmed that the slot's restart_lsn passed LSN-1. Does it work? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Wed, Apr 2, 2025 at 3:45 PM Masahiko Sawada wrote: > > > > Hi, > > > > > > > > On Mon, Mar 31, 2025 at 4:3 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > After further analysis, I was able to reproduce the same issue [1] in > > > > PG 17. > > > > > > > > [1] > > > > - pub: created a slot 'sub' with two_phase=false, then prepared a > > > > transaction > > > > - pub: after some activity, advanced the confirmed_flush_lsn of 'sub', so > it is > > > > greater than prepared txn lsn. > > > > - sub: create subscription with (slot_name='sub', create_slot=false, > > > > failover = true, two_phase=true, copy_data=false); two_phase_at will > > > > be set to the same as confirmed_flush_lsn which is greater than the > > > prepared transaction. > > > > - stop the primary and promote the standby. > > > > - commit the prepared transaction on standby, the following error will be > > > > reported on subscriber: > > > > > > It seems to require elaborate steps to reproduce this issue in v17. I wonder > if we > > > could somehow narrow down the cases that we want to prohibit. The patch > for > > > v17 disallows CREATE SUBSCRIPTION to enable both two_phase and > failover, > > > but I guess that it's still safe if it also creates the replication slot (e.g., > > > create_slot is true). If my understanding is right, we can allow users to > specify > > > both fields if CRETE SUBSCRIPTION creates the slot, and we don't need to > > > disallow that in ReplicationSlotCreate(). > > > > Thanks for reviewing the steps. > > > > The current reproducer aims for simplicity; however, I think it's possible to > reproduce > > the issue even with create_slot = true, although it requires the help of a > > debugger and additional steps. But as long as there are transactions > prepared > > before the two_phase_at position, and they are skipped due to checks in > > ReorderBufferFinishPrepared() (refer to comments[1] for why we skip > sending > > prepared transaction), the issue can be reproduced. > > > > For instance, when creating a subscription with (copy_data=true, > failover=true, > > two_phase=true), the slot's two_phase setting starts as false and shifts to > > true after table sync (refer to [2] for related comments). During this period, > > if a user prepares a transaction where the prepare LSN is less than the > > two_phase_at, the same problem could happen. > > > > Similarly, when setting up a subscription with (copy_data=false, > failover=true, > > two_phase=true), although two_phase is initially set to true and we wait for > > running transactions to finish when building consistent snapshot, a race > > condition may still exist: If the snapshot state reaches FULL_SNAPSHOT, it > > won't check running transactions further (see SnapBuildFindSnapshot() for > > specifics), if a user prepares a transaction at this point, it's possible for > > the prepare LSN to be less than the LSN marking the snapshot's consistent > > state, causing the same issue. > > Thank you for the explanation! I agree that the issue happens in these cases. > > As another idea, I wonder if we could somehow defer to make the synced > slot as 'sync-ready' until we can ensure that the slot doesn't have > any transactions that are prepared before the point of enabling > two_phase. For example, when the slotsync worker fetches the remote > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > slot's two_phase becomes true or the local slot is newly created with > enabling two_phase, and then it makes the slot 'sync-ready' once it > confirmed that the slot's restart_lsn passed LSN-1. Does it work? Thanks for the idea! We considered a similar approach in [1] to confirm there is no prepared transactions before two_phase_at, but the issue is that when the two_phase flag is switched from 'false' to 'true' (as in the case with (copy_data=true, failover=true, two_phase=true)). In this case, the slot may have already been marked as sync-ready before the two_phase flag is enabled, as slotsync is unaware of potential future changes to the two_phase flag. Then the slot have to revert to sync-not-ready after the two_phase flag change, which could be difficult for users to understand. What do you think ? [1] https://www.postgresql.org/message-id/OS0PR01MB57161D9BB5409F229564957994AD2%40OS0PR01MB5716.jpnprd01.prod.outlook.com Best Regards, Hou zj
On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > Thank you for the explanation! I agree that the issue happens in these cases. > > > > As another idea, I wonder if we could somehow defer to make the synced > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > any transactions that are prepared before the point of enabling > > two_phase. For example, when the slotsync worker fetches the remote > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > slot's two_phase becomes true or the local slot is newly created with > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > Thanks for the idea! > > We considered a similar approach in [1] to confirm there is no prepared > transactions before two_phase_at, but the issue is that when the two_phase flag > is switched from 'false' to 'true' (as in the case with (copy_data=true, > failover=true, two_phase=true)). In this case, the slot may have already been > marked as sync-ready before the two_phase flag is enabled, as slotsync is > unaware of potential future changes to the two_phase flag. > This can happen because when copy_data is true, tablesync can take a long time to complete the sync and in the meantime, slot without a two_phase flag would have been synced to standby. Such a slot would be marked as sync-ready even if we follow the calculation proposed by Sawada-san. Note that we enable two_phase once all the tables are in ready state (See run_apply_worker() and comments atop worker.c (TWO_PHASE TRANSACTIONS)). -- With Regards, Amit Kapila.
On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Thank you for the explanation! I agree that the issue happens in these cases. > > > > > > As another idea, I wonder if we could somehow defer to make the synced > > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > > any transactions that are prepared before the point of enabling > > > two_phase. For example, when the slotsync worker fetches the remote > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > > slot's two_phase becomes true or the local slot is newly created with > > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > > > Thanks for the idea! > > > > We considered a similar approach in [1] to confirm there is no prepared > > transactions before two_phase_at, but the issue is that when the two_phase flag > > is switched from 'false' to 'true' (as in the case with (copy_data=true, > > failover=true, two_phase=true)). In this case, the slot may have already been > > marked as sync-ready before the two_phase flag is enabled, as slotsync is > > unaware of potential future changes to the two_phase flag. > > This can happen because when copy_data is true, tablesync can take a > long time to complete the sync and in the meantime, slot without a > two_phase flag would have been synced to standby. Such a slot would be > marked as sync-ready even if we follow the calculation proposed by > Sawada-san. Note that we enable two_phase once all the tables are in > ready state (See run_apply_worker() and comments atop worker.c > (TWO_PHASE TRANSACTIONS)). Right. It doesn't make sense to make the slot not-sync-ready and then back to sync-ready. While I agree with the approach for HEAD and it seems difficult to find a solution, I'm concerned that disallowing to use both failover and two_phase in a minor release would affect users much. Users who are already using that combination might end up needing to re-think their system architecture. So I'm trying to narrow down use cases where we're going to prohibit or to find workarounds. If we agree with the fix for HEAD, we can push the fix for HEAD first, which would be better to be done sooner as it needs to bump the catversion. We can discuss the ideas and workarounds for v17 later. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 3, 2025 at 11:08 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Thank you for the explanation! I agree that the issue happens in these cases. > > > > > > > > As another idea, I wonder if we could somehow defer to make the synced > > > > slot as 'sync-ready' until we can ensure that the slot doesn't have > > > > any transactions that are prepared before the point of enabling > > > > two_phase. For example, when the slotsync worker fetches the remote > > > > slot, it remembers the confirmed_flush_lsn (say LSN-1) if the local > > > > slot's two_phase becomes true or the local slot is newly created with > > > > enabling two_phase, and then it makes the slot 'sync-ready' once it > > > > confirmed that the slot's restart_lsn passed LSN-1. Does it work? > > > > > > Thanks for the idea! > > > > > > We considered a similar approach in [1] to confirm there is no prepared > > > transactions before two_phase_at, but the issue is that when the two_phase flag > > > is switched from 'false' to 'true' (as in the case with (copy_data=true, > > > failover=true, two_phase=true)). In this case, the slot may have already been > > > marked as sync-ready before the two_phase flag is enabled, as slotsync is > > > unaware of potential future changes to the two_phase flag. > > > > This can happen because when copy_data is true, tablesync can take a > > long time to complete the sync and in the meantime, slot without a > > two_phase flag would have been synced to standby. Such a slot would be > > marked as sync-ready even if we follow the calculation proposed by > > Sawada-san. Note that we enable two_phase once all the tables are in > > ready state (See run_apply_worker() and comments atop worker.c > > (TWO_PHASE TRANSACTIONS)). > > Right. It doesn't make sense to make the slot not-sync-ready and then > back to sync-ready. > > While I agree with the approach for HEAD and it seems difficult to > find a solution, I'm concerned that disallowing to use both failover > and two_phase in a minor release would affect users much. Users who > are already using that combination might end up needing to re-think > their system architecture. So I'm trying to narrow down use cases > where we're going to prohibit or to find workarounds. > > If we agree with the fix for HEAD, we can push the fix for HEAD first, > which would be better to be done sooner as it needs to bump the > catversion. We can discuss the ideas and workarounds for v17 later. > Thanks, I'll push the patch for HEAD and then keep thinking if we have a better way to deal with the problem in 17. BTW, the problem for 17 can happen in a much narrower set of cases as explained in the emails above. -- With Regards, Amit Kapila.
On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> > wrote: > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Thank you for the explanation! I agree that the issue happens in these > cases. Just to share the steps to reproduce the issue when creating the subscription With (create_slot=true, failover = true, two_phase=true, copy_data=false) for reference. - pub: begin a txn A. This is to stop the slot from building a consistent snapshot immediately. - sub: create subscription with (slot_name='sub', create_slot=true, failover = true, two_phase=true, copy_data=false); - pub: Attach to the walsender that will create the slot and add a checkpoint in SnapBuildFindSnapshot() -> (builder->state = SNAPBUILD_FULL_SNAPSHOT;). For now, the state should be BUILDING_SNAPSHOT. - pub: begin another txn B and commit txn A. The snapshot should reach FULL_SNAPSHOT now. - pub: prepared a txn C and then commit txn B. - pub: release the checkpoint in SnapBuildFindSnapshot(), then it should reach the SNAPBUILD_CONSISTENT. Now we have a prepared txn whose prepare_lsn is less than the two_phase_at. - stop the primary and promote the standby. - sub: alter the subscription to use the new primary as the publisher. - commit the prepared transaction on new primary, the following error will be reported on subscriber: LOG: logical replication apply worker for subscription "sub" has started ERROR: prepared transaction with identifier "pg_gid_16398_764" does not exist. > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > doesn't have any transactions that are prepared before the point > > > > of enabling two_phase. For example, when the slotsync worker > > > > fetches the remote slot, it remembers the confirmed_flush_lsn (say > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > slot is newly created with enabling two_phase, and then it makes > > > > the slot 'sync-ready' once it confirmed that the slot's restart_lsn passed > LSN-1. Does it work? > > > > > > Thanks for the idea! > > > > > > We considered a similar approach in [1] to confirm there is no > > > prepared transactions before two_phase_at, but the issue is that > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > this case, the slot may have already been marked as sync-ready > > > before the two_phase flag is enabled, as slotsync is unaware of potential > future changes to the two_phase flag. > > > > This can happen because when copy_data is true, tablesync can take a > > long time to complete the sync and in the meantime, slot without a > > two_phase flag would have been synced to standby. Such a slot would be > > marked as sync-ready even if we follow the calculation proposed by > > Sawada-san. Note that we enable two_phase once all the tables are in > > ready state (See run_apply_worker() and comments atop worker.c > > (TWO_PHASE TRANSACTIONS)). > > Right. It doesn't make sense to make the slot not-sync-ready and then back to > sync-ready. > > While I agree with the approach for HEAD and it seems difficult to find a > solution, I'm concerned that disallowing to use both failover and two_phase in > a minor release would affect users much. Users who are already using that > combination might end up needing to re-think their system architecture. So I'm > trying to narrow down use cases where we're going to prohibit or to find > workarounds. > > If we agree with the fix for HEAD, we can push the fix for HEAD first, which > would be better to be done sooner as it needs to bump the catversion. We can > discuss the ideas and workarounds for v17 later. Agreed. I will think more on it. One workaround could be skipping the prepared transaction when such an issue arises, followed by manually replicating the skipped changes. Best Regards, Hou zj
On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote: > > On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > > > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> > > wrote: > > > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > Thank you for the explanation! I agree that the issue happens in > > > > > these > > cases. > > > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > > doesn't have any transactions that are prepared before the point > > > > > of enabling two_phase. For example, when the slotsync worker > > > > > fetches the remote slot, it remembers the confirmed_flush_lsn > > > > > (say > > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > > slot is newly created with enabling two_phase, and then it makes > > > > > the slot 'sync-ready' once it confirmed that the slot's > > > > > restart_lsn passed > > LSN-1. Does it work? > > > > > > > > Thanks for the idea! > > > > > > > > We considered a similar approach in [1] to confirm there is no > > > > prepared transactions before two_phase_at, but the issue is that > > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > > this case, the slot may have already been marked as sync-ready > > > > before the two_phase flag is enabled, as slotsync is unaware of > > > > potential > > future changes to the two_phase flag. > > > > > > This can happen because when copy_data is true, tablesync can take a > > > long time to complete the sync and in the meantime, slot without a > > > two_phase flag would have been synced to standby. Such a slot would > > > be marked as sync-ready even if we follow the calculation proposed > > > by Sawada-san. Note that we enable two_phase once all the tables are > > > in ready state (See run_apply_worker() and comments atop worker.c > > > (TWO_PHASE TRANSACTIONS)). > > > > Right. It doesn't make sense to make the slot not-sync-ready and then > > back to sync-ready. > > > > While I agree with the approach for HEAD and it seems difficult to > > find a solution, I'm concerned that disallowing to use both failover > > and two_phase in a minor release would affect users much. Users who > > are already using that combination might end up needing to re-think > > their system architecture. So I'm trying to narrow down use cases > > where we're going to prohibit or to find workarounds. We analyzed more for the backbranch fix, and here is a summary of different approaches that could be used for PG17. ---------- Approach 1 ---------- This is the original approach implemented in V4 patch. In this approach, we entirely disallow enabling both failover and the two-phase feature together for a replication slot or subscription. pros: This restriction is simple to implement and easy for users to comprehend. cons: Users would be unable to use the two-phase feature in conjunction with failover. Following the upgrade to a new release with this fix, existing subscriptions that have both failover and two-phase enabled would require manual re-creation, which is time-consuming. ---------- Approach 2 ---------- Instead of disallowing the use of two-phase and failover together, a more flexible strategy could be only restrict failover for slots with two-phase enabled when there's a possibility of existing prepared transactions before the two_phase_at that are not yet replicated. During slot creation with two-phase and failover, we could check for any decoded prepared transactions when determining the decoding start point (DecodingContextFindStartpoint). For subsequent attempts to alter failover to true, we ensure that two_phase_at is less than restart_lsn, indicating that all prepared transactions have been committed and replicated, thus the bug would not happen. pros: This method minimizes restrictions for users. Especially during slot creation with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare during consistent snapshot creation, the restriction becomes almost unnoticeable. Users are not always forced to re-create subscriptions post-upgrade. cons: The logic involved for (restart_lsn > two_phase_at) might be less intuitive for users. After upgrading, it's recommended that users verify whether two_phase_at for a source slot is less than restart_lsn of the synced slot. If it isn't, users should be careful when using the synced slot on a standby. This might be complex to understand. ---------- Approach 3 ---------- This approach is similar to Approach 2 but simplifies some aspects by avoiding checks for prepared transactions during slot creation. It disallows enabling both failover and twophase during slot creation, but permits altering failover to true once ensured that no prepared transaction exists (restart_lsn > two_phase_at). pros: Like Approach 2, this reduces user restrictions compared to completely disallowing failover and two-phase usage together. Users are not always forced to re-create subscriptions post-upgrade. cons: User still could not enable failover and twophase when creating a slot. Similar to approach 1, the check (restart_lsn > two_phase_at) might be less intuitive for users. After upgrading, it's recommended that users verify whether two_phase_at for a source slot is less than restart_lsn of the synced slot. If it isn't, users should be careful when using the synced slot on a standby. This might be complex to understand. Best Regards, Hou zj
HI, Please find the patches attached for all three approaches. On Wed, Apr 9, 2025 at 10:45 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Thu, Apr 3, 2025 at 3:16 PM Zhijie Hou (Fujitsu) wrote: > > > > On Thu, Apr 3, 2025 at 1:38 PM Masahiko Sawada wrote: > > > > > > > > On Wed, Apr 2, 2025 at 7:58 PM Amit Kapila <amit.kapila16@gmail.com> > > > wrote: > > > > > > > > On Thu, Apr 3, 2025 at 7:50 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > On Thu, Apr 3, 2025 at 3:30 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > > > > On Wed, Apr 2, 2025 at 6:33 AM Zhijie Hou (Fujitsu) > > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > Thank you for the explanation! I agree that the issue happens in > > > > > > these > > > cases. > > > > > > > > > > > > As another idea, I wonder if we could somehow defer to make the > > > > > > synced slot as 'sync-ready' until we can ensure that the slot > > > > > > doesn't have any transactions that are prepared before the point > > > > > > of enabling two_phase. For example, when the slotsync worker > > > > > > fetches the remote slot, it remembers the confirmed_flush_lsn > > > > > > (say > > > > > > LSN-1) if the local slot's two_phase becomes true or the local > > > > > > slot is newly created with enabling two_phase, and then it makes > > > > > > the slot 'sync-ready' once it confirmed that the slot's > > > > > > restart_lsn passed > > > LSN-1. Does it work? > > > > > > > > > > Thanks for the idea! > > > > > > > > > > We considered a similar approach in [1] to confirm there is no > > > > > prepared transactions before two_phase_at, but the issue is that > > > > > when the two_phase flag is switched from 'false' to 'true' (as in > > > > > the case with (copy_data=true, failover=true, two_phase=true)). In > > > > > this case, the slot may have already been marked as sync-ready > > > > > before the two_phase flag is enabled, as slotsync is unaware of > > > > > potential > > > future changes to the two_phase flag. > > > > > > > > This can happen because when copy_data is true, tablesync can take a > > > > long time to complete the sync and in the meantime, slot without a > > > > two_phase flag would have been synced to standby. Such a slot would > > > > be marked as sync-ready even if we follow the calculation proposed > > > > by Sawada-san. Note that we enable two_phase once all the tables are > > > > in ready state (See run_apply_worker() and comments atop worker.c > > > > (TWO_PHASE TRANSACTIONS)). > > > > > > Right. It doesn't make sense to make the slot not-sync-ready and then > > > back to sync-ready. > > > > > > While I agree with the approach for HEAD and it seems difficult to > > > find a solution, I'm concerned that disallowing to use both failover > > > and two_phase in a minor release would affect users much. Users who > > > are already using that combination might end up needing to re-think > > > their system architecture. So I'm trying to narrow down use cases > > > where we're going to prohibit or to find workarounds. > > We analyzed more for the backbranch fix, and here is a summary of different > approaches that could be used for PG17. > > ---------- > Approach 1 > ---------- > > This is the original approach implemented in V4 patch. > > In this approach, we entirely disallow enabling both failover and the two-phase > feature together for a replication slot or subscription. > > pros: > > This restriction is simple to implement and easy for users to comprehend. > > cons: > > Users would be unable to use the two-phase feature in conjunction with > failover. > > Following the upgrade to a new release with this fix, existing subscriptions > that have both failover and two-phase enabled would require manual re-creation, > which is time-consuming. > Patch "v5_aprch_1-0001" implements the above Approach 1. > ---------- > Approach 2 > ---------- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. > > Users are not always forced to re-create subscriptions post-upgrade. > > cons: > > The logic involved for (restart_lsn > two_phase_at) might be less intuitive for > users. > > After upgrading, it's recommended that users verify whether two_phase_at for a > source slot is less than restart_lsn of the synced slot. If it isn't, users > should be careful when using the synced slot on a standby. This might be > complex to understand. > Patch "v5_aprch_2-0001" implements the above Approach 2. > ---------- > Approach 3 > ---------- > > This approach is similar to Approach 2 but simplifies some aspects by avoiding > checks for prepared transactions during slot creation. It disallows enabling > both failover and twophase during slot creation, but permits altering failover > to true once ensured that no prepared transaction exists (restart_lsn > > two_phase_at). > > pros: > > Like Approach 2, this reduces user restrictions compared to completely > disallowing failover and two-phase usage together. > > Users are not always forced to re-create subscriptions post-upgrade. > > cons: > > User still could not enable failover and twophase when creating a slot. > > Similar to approach 1, the check (restart_lsn > two_phase_at) might be less > intuitive for users. > > After upgrading, it's recommended that users verify whether two_phase_at for a > source slot is less than restart_lsn of the synced slot. If it isn't, users > should be careful when using the synced slot on a standby. This might be > complex to understand. > Patch "v5_aprch_3-0001" implements the above Approach 3. Thanks Hou-san for implementing approach-2 and providing the patch. -- Thanks, Nisha
Attachment
Hi, The recently added tests for slotsync on two-phase enabled slots failed[1] due to a broken assertion triggered while decoding the COMMIT PREPARED record on the promoted standby. ----- Analysis ----- if ((txn->final_lsn < two_phase_at) && is_commit) { /* * txn must have been marked as a prepared transaction and skipped but * not sent a prepare. Also, the prepare info must have been updated * in txn even if we skip prepare. */ Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) == (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)); I think the issue stem from the PREPARED transaction, which was skipped on the primary, not being skipped on the promoted standby. The root cause appears to be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced to the standby. This results in the confirmed_flush (0/6004F90) of the synced slot being less than the synced two_phase_at field (0/6005118). Consequently, the PREPARE record's LSN exceeds the synced confirmed_flush during decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but wasn't skipped. -- The LOGs that proves the confirmed_flush is not synced: decoding on original slot (primary): LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 ... DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. decoding on synced slot (promted standby) - failed run: DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8. decoding on synced slot (promted standby) - success run: LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 ... DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. -- The logs above clearly show that during the test failure, the starting position (6004F90) was less than that of a successful run. Additionally, it was lower than the starting position of the remote slot on the primary, indicating a failure to synchronize confirmed_lsn. The reason confirmed_flush isn't synced should stem from the following check, which skip the syncing of the confirmed_flush value. I also reproduced the assertion failure by creating a scenario that leads to sync omission due to this check: /* * Don't overwrite if we already have a newer catalog_xmin and * restart_lsn. */ if (remote_slot->restart_lsn < slot->data.restart_lsn || TransactionIdPrecedes(remote_slot->catalog_xmin, slot->data.catalog_xmin)) { This check triggered because the synced catalog_xmin surpassed that of the source slot (It can be seen from the log that the restart_lsn was synced to the same value) ). This situation arises due to several factors: On the publisher(primary), the slot may retain an older catalog_xmin and restart_lsn because it is being consumed by a walsender. In this scenario, if the apply worker does not immediately confirm that changes have been flushed, the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old value for candidate_catalog_xmin/candidate_restart_lsn and would not update them until the apply worker confirms via LogicalConfirmReceivedLocation. However, the slotsync worker has the opportunity to begin WAL reading from a more recent point, potentially advancing catalog_xmin/restart_lsn to newer values. And it's also possible to advance only catalog_xmin while keep restart_lsn unchanged, by starting a transaction before the running_xact record. In this case, it would pass builder->last_serialized_snapshot as restart_lsn to LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would point to the position of the last running_xact record instead of the current one, or the value would be NULL if no snapshot has been serialized before, so it would not advance restart_lsn. The advancement of catalog_xmin is not blocked by running transaction, as it would either use the XID of the running transaction or the oldestRunningXid as candidate. ----- Reproduction ----- Accompanying this analysis is a script to reproduce the issue. In these scripts, two running_xacts were logged post-slot creation on the publisher. Below is a detailed progression of restart_lsn/catalog_xmin advancement observed on both the publisher and standby: The process is : slot creation -> log running_xact -> prepare a txn -> log running_xact The process of advancing the catalog_xmin/restart_lsn on publisher: slot creation ... 1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755) got new catalog xmin 755 at 0/301B880 LOG: got new restart lsn 0/301B880 at 0/301B880 2. 0/301B8B8 - PREPAPRE a TRANSACTION. 3. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756) failed to increase restart lsn: proposed 0/301B880, after 0/301BA20, current candidate 0/301B880, current after 0/301B880, flushed up to 0/301B810 final slot data: catalog_xmin | 755 restart_lsn | 0/301B880 confirmed_flush_lsn | 0/301BAC8 The process of advancing the catalog_xmin/restart_lsn on standby (during slot sync): slot creation ... 1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755) building consistent snapshot, so will skip advancing the catalog_xmin/restart_lsn ... 2. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756) got new catalog xmin 756 at 0/301BA20 cannot get new restart_lsn as running txn exists and didn't serialize snapshot yet. first synced slot data: catalog_xmin | 756 restart_lsn | 0/301B880 confirmed_flush_lsn | 0/301BAC8 The second slot sync cycle would skip updating confirmed_lsn because catalog_xmin is newer. ----- Fix ----- I think we should keep the confirmed_flush even if the previous synced restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. Best Regards, Hou zj
Attachment
On Thu, Apr 17, 2025 at 8:44 PM Zhijie Hou (Fujitsu) wrote: > > Hi, > > The recently added tests for slotsync on two-phase enabled slots > failed[1] due to a broken assertion triggered while decoding the > COMMIT PREPARED record on the promoted standby. Here is the missing link: [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sidewinder&dt=2025-04-09%2010%3A36%3A46 Best Regards, Hou zj
On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > Hi, > > The recently added tests for slotsync on two-phase enabled slots failed[1] due > to a broken assertion triggered while decoding the COMMIT PREPARED record on > the promoted standby. > > ----- > Analysis > ----- > > if ((txn->final_lsn < two_phase_at) && is_commit) > { > /* > * txn must have been marked as a prepared transaction and skipped but > * not sent a prepare. Also, the prepare info must have been updated > * in txn even if we skip prepare. > */ > Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) == > (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE)); > > I think the issue stem from the PREPARED transaction, which was skipped on the > primary, not being skipped on the promoted standby. The root cause appears to > be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced > to the standby. This results in the confirmed_flush (0/6004F90) of the synced > slot being less than the synced two_phase_at field (0/6005118). Consequently, > the PREPARE record's LSN exceeds the synced confirmed_flush during > decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when > decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but > wasn't skipped. > > -- The LOGs that proves the confirmed_flush is not synced: > > decoding on original slot (primary): > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. > > decoding on synced slot (promted standby) - failed run: > DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8. > > decoding on synced slot (promted standby) - success run: > LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118 > ... > DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8. > -- > > The logs above clearly show that during the test failure, the starting position > (6004F90) was less than that of a successful run. Additionally, it was lower > than the starting position of the remote slot on the primary, indicating a > failure to synchronize confirmed_lsn. > > The reason confirmed_flush isn't synced should stem from the following check, > which skip the syncing of the confirmed_flush value. I also reproduced the > assertion failure by creating a scenario that leads to sync omission due to > this check: > > /* > * Don't overwrite if we already have a newer catalog_xmin and > * restart_lsn. > */ > if (remote_slot->restart_lsn < slot->data.restart_lsn || > TransactionIdPrecedes(remote_slot->catalog_xmin, > slot->data.catalog_xmin)) > { > > > This check triggered because the synced catalog_xmin surpassed that of the > source slot (It can be seen from the log that the restart_lsn was synced to the > same value) ). This situation arises due to several factors: > > On the publisher(primary), the slot may retain an older catalog_xmin and > restart_lsn because it is being consumed by a walsender. In this scenario, if > the apply worker does not immediately confirm that changes have been flushed, > the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old > value for candidate_catalog_xmin/candidate_restart_lsn and would not update > them until the apply worker confirms via LogicalConfirmReceivedLocation. > > However, the slotsync worker has the opportunity to begin WAL reading from a > more recent point, potentially advancing catalog_xmin/restart_lsn to newer > values. > > And it's also possible to advance only catalog_xmin while keep restart_lsn > unchanged, by starting a transaction before the running_xact record. In this > case, it would pass builder->last_serialized_snapshot as restart_lsn to > LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would > point to the position of the last running_xact record instead of the current > one, or the value would be NULL if no snapshot has been serialized before, so > it would not advance restart_lsn. The advancement of catalog_xmin is not > blocked by running transaction, as it would either use the XID of the running > transaction or the oldestRunningXid as candidate. > Your analysis appears correct to me. > > ----- > Fix > ----- > > I think we should keep the confirmed_flush even if the previous synced > restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. > This will fix the case we are facing but adds a new rule for slot synchronization. Can we think of a simpler way to fix this by avoiding updating other slot fields (like two_phase, two_phase_at) if restart_lsn or catalog_xmin of the local slot is ahead of the remote slot? -- With Regards, Amit Kapila.
On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > > ---------- > Approach 2 > ---------- > > Instead of disallowing the use of two-phase and failover together, a more > flexible strategy could be only restrict failover for slots with two-phase > enabled when there's a possibility of existing prepared transactions before the > two_phase_at that are not yet replicated. During slot creation with two-phase > and failover, we could check for any decoded prepared transactions when > determining the decoding start point (DecodingContextFindStartpoint). For > subsequent attempts to alter failover to true, we ensure that two_phase_at is > less than restart_lsn, indicating that all prepared transactions have been > committed and replicated, thus the bug would not happen. > > pros: > > This method minimizes restrictions for users. Especially during slot creation > with (two_phase=on, failover=on), as it’s uncommon for transactions to prepare > during consistent snapshot creation, the restriction becomes almost > unnoticeable. I think this approach can work for the transactions that are prepared while the slot is created. But if I understand the problem correctly, while the initial table sync is performing, the slot's two_phase is still false, so we need to deal with the transactions that are prepared during the initial table sync too. What do you think? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > > > ---------- > > Approach 2 > > ---------- > > > > Instead of disallowing the use of two-phase and failover together, a more > > flexible strategy could be only restrict failover for slots with two-phase > > enabled when there's a possibility of existing prepared transactions before > the > > two_phase_at that are not yet replicated. During slot creation with > two-phase > > and failover, we could check for any decoded prepared transactions when > > determining the decoding start point (DecodingContextFindStartpoint). For > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > less than restart_lsn, indicating that all prepared transactions have been > > committed and replicated, thus the bug would not happen. > > > > pros: > > > > This method minimizes restrictions for users. Especially during slot creation > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > prepare > > during consistent snapshot creation, the restriction becomes almost > > unnoticeable. > > I think this approach can work for the transactions that are prepared > while the slot is created. But if I understand the problem correctly, > while the initial table sync is performing, the slot's two_phase is > still false, so we need to deal with the transactions that are > prepared during the initial table sync too. What do you think? > Yes, I agree that we need to restrict this case too. Given that we haven't started decoding when setting two_phase=true during CreateDecodingContext() after tablesync, we could check prepared transactions afterwards during decoding. This could involve reporting an ERROR when skipping a prepared transaction during decoding if its prepare LSN is less than two_phase_at. Alternatively, a simpler method would be to prevent this situation entirely during the CREATE SUBSCRIPTION command. For example, we could restrict slots created with failover set to true and twophase is later modified to true after tablesync. Although the simpler check is more user-visible, it may offer less flexibility. What do you think ? Best Regards, Hou zj
On Fri, Apr 18, 2025 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > ----- > > Fix > > ----- > > > > I think we should keep the confirmed_flush even if the previous synced > > restart_lsn/catalog_xmin is newer. Attachments include a patch for the same. > > > > This will fix the case we are facing but adds a new rule for slot > synchronization. Can we think of a simpler way to fix this by avoiding > updating other slot fields (like two_phase, two_phase_at) if > restart_lsn or catalog_xmin of the local slot is ahead of the remote > slot? > Thinking more about this problem, it seems to me that if the catalog_xmin of synced slot is allowed to be ahead than the remote_slot when there is still an open (prepared transaction), it could cause data loss. I mean that after the promotion, some of the required catalog rows could be removed, and decoding corresponding changes (changes from tables affected by DDL) could give unexpected results. Those would be protected on primary/publisher because the catalog_xmin on it was still accurate and behind. If this theory turns out to be true, then this is a drawback/bug of the existing fast_forward mode code. -- With Regards, Amit Kapila.
On Tue, Apr 22, 2025 at 11:23 AM Amit Kapila wrote: > > On Fri, Apr 18, 2025 at 9:58 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > ----- > > > Fix > > > ----- > > > > > > I think we should keep the confirmed_flush even if the previous > > > synced restart_lsn/catalog_xmin is newer. Attachments include a patch > for the same. > > > > > > > This will fix the case we are facing but adds a new rule for slot > > synchronization. Can we think of a simpler way to fix this by avoiding > > updating other slot fields (like two_phase, two_phase_at) if > > restart_lsn or catalog_xmin of the local slot is ahead of the remote > > slot? > > > > Thinking more about this problem, it seems to me that if the catalog_xmin of > synced slot is allowed to be ahead than the remote_slot when there is still an > open (prepared transaction), it could cause data loss. I mean that after the > promotion, some of the required catalog rows could be removed, and decoding > corresponding changes (changes from tables affected by DDL) could give > unexpected results. Those would be protected on primary/publisher because > the catalog_xmin on it was still accurate and behind. If this theory turns out to > be true, then this is a drawback/bug of the existing fast_forward mode code. I agree that could be a problem. Upon further analysis, I find that the core problem is we do not build a base snapshot when decoding changes during fast forward mode, preventing reference to the minimum transaction ID that remains visible in the snapshot when determining the candidate for catalog_xmin. As a result, catalog_xmin was directly advanced to the oldest running transaction ID found in the latest running_xacts record. In code-level, SnapBuildProcessChange -> ReorderBufferSetBaseSnapshot could not be reached during fast forward decoding, resulting in rb->txns_by_base_snapshot_lsn being NULL. When advancing catalog_xmin, the system attempts to refer to rb->txns_by_base_snapshot_lsn via ReorderBufferGetOldestXmin(). However, since rb->txns_by_base_snapshot_lsn is NULL, it defaults to directly using running->oldestRunningXid as the candidate for catalog_xmin. For reference, see the implementation details in SnapBuildProcessRunningXacts. I think this is a general issue in fast forward decoding, which not only affect slotsync. I can reproduce the issue using slot_advance SQL API as well. See the attachment for a patch that includes a test to prove that the catalog data that are still required would be removed after premature catalog_xmin advancement during fast forward decoding. If you test that patch on HEAD, you would find that the output missed a column due to vacuum removal. I will start a new thread to report and fix this general. Best Regards, Hou zj
Attachment
On Fri, Apr 11, 2025 at 1:03 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > Patch "v5_aprch_3-0001" implements the above Approach 3. > > Thanks Hou-san for implementing approach-2 and providing the patch. > I find Approach 2 better than Approach1. Yet to review Approach3. Please find my initial comments: Approach#1: 1) When slot is created with failover enabled and later we try to create a subscription using that pre-created slot with two-phase enabled, it does not error out. But it silently retains two_phase of the existing slot as false. Please check if an error is possible in this case too. Approach #2: 1) + ReorderBufferSkipPrepare(reorder, parsed.twophase_xid); In xact_decode, why do we have a new call to ReorderBufferSkipPrepare, can you please add some comments here? 2) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" is specified", So like approach 1, here as well we disallow creating subscriptions with both failover and two_phase enabled when create_slot and copy_data is true? But users can alter the sub later to allow failover for two-phase enabled slot provided there are no pending PREPARE txns which are not yet consumed by the subscriber. Is my understanding correct? Can you please tell all the ways using which the user can enable both failover and two_phase together? The patch msg is not that clear. It will be good to add such details in patch message. 3) + /* + * Do not allow users to enable the failover and two_phase options together + * when create_slot is specified. + * + * See comments atop the similar check in DecodingContextFindStartpoint() + * for a detailed reason. + */ + if (!opts.create_slot && opts.twophase && opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" is specified", + Is the check '!opts.create_slot' correct? The error msg and comment says otherwise. thanks Shveta
On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote: > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > ---------- > > > Approach 2 > > > ---------- > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > flexible strategy could be only restrict failover for slots with two-phase > > > enabled when there's a possibility of existing prepared transactions before > > the > > > two_phase_at that are not yet replicated. During slot creation with > > two-phase > > > and failover, we could check for any decoded prepared transactions when > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > less than restart_lsn, indicating that all prepared transactions have been > > > committed and replicated, thus the bug would not happen. > > > > > > pros: > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > prepare > > > during consistent snapshot creation, the restriction becomes almost > > > unnoticeable. > > > > I think this approach can work for the transactions that are prepared > > while the slot is created. But if I understand the problem correctly, > > while the initial table sync is performing, the slot's two_phase is > > still false, so we need to deal with the transactions that are > > prepared during the initial table sync too. What do you think? > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > started decoding when setting two_phase=true during CreateDecodingContext() > after tablesync, we could check prepared transactions afterwards during > decoding. This could involve reporting an ERROR when skipping a prepared > transaction during decoding if its prepare LSN is less than two_phase_at. > It will make it difficult for users to detect it as this happens at a later point of time. > Alternatively, a simpler method would be to prevent this situation entirely > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > created with failover set to true and twophase is later modified to true after > tablesync. Although the simpler check is more user-visible, it may offer less > flexibility. > I agree with your point, but OTOH, I am also afraid of adding too many smart checks in the back-branch. If we follow what you say here, then users have the following ways in PG17 to enable both failover and two_phase. (a) During Create Subscription, users can set both 'failover' and 'two_phase', if 'copy_data' is false, or (b), if 'copy_data' is true, during Create Subscription, then users can enable 'two_phase' and wait for it to be enabled. Then use Alter Subscription to set 'failover'. -- With Regards, Amit Kapila.
On Tue, Apr 22, 2025 at 3:23 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Fri, Apr 11, 2025 at 1:03 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > > > Patch "v5_aprch_3-0001" implements the above Approach 3. > > > > Thanks Hou-san for implementing approach-2 and providing the patch. > > > > I find Approach 2 better than Approach1. Yet to review Approach3. > Please find my initial comments: > Thanks for the review! I’ve addressed the comments for Approach 2, as it seems the most suitable. We can revisit the others if needed. > > > Approach #2: > > 1) > + ReorderBufferSkipPrepare(reorder, parsed.twophase_xid); > > In xact_decode, why do we have a new call to ReorderBufferSkipPrepare, > can you please add some comments here? > Done. > 2) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" > is specified", > > So like approach 1, here as well we disallow creating subscriptions > with both failover and two_phase enabled when create_slot and > copy_data is true? But users can alter the sub later to allow failover > for two-phase enabled slot provided there are no pending PREPARE txns > which are not yet consumed by the subscriber. Is my understanding > correct? Can you please tell all the ways using which the user can > enable both failover and two_phase together? The patch msg is not that > clear. It will be good to add such details in patch message. > We allow creating subscriptions in this scenario as long as no prepared transactions exist before the two_phase_at. Similarly, altering a subscription to set failover = true is permitted, provided the slot's restart_lsn is greater than two_phase_at. Amit has suggested the ways at [1] in which users can enable both failover and two_phase together. I've updated the commit message to include more details about the conditions enforced by the fix. > 3) > > + /* > + * Do not allow users to enable the failover and two_phase options together > + * when create_slot is specified. > + * > + * See comments atop the similar check in DecodingContextFindStartpoint() > + * for a detailed reason. > + */ > + if (!opts.create_slot && opts.twophase && opts.failover) > + ereport(ERROR, > + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("\"%s\" and \"%s\" are mutually exclusive options when \"%s\" > is specified", > + > > Is the check '!opts.create_slot' correct? The error msg and comment > says otherwise. > Corrected the check as it should be checking if copy_data is specified. Thanks for the input! Please find the attached v6 patch for approach-2 fix on PG17. [1] https://www.postgresql.org/message-id/CAA4eK1LvMwXxvAzHpK%2BEgjc7vu1NmGxxKcaK_06pE7GKk7JtJQ%40mail.gmail.com -- Thanks, Nisha
Attachment
On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > <houzj.fnst@fujitsu.com> wrote: > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > ---------- > > > > Approach 2 > > > > ---------- > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > enabled when there's a possibility of existing prepared transactions before > > > the > > > > two_phase_at that are not yet replicated. During slot creation with > > > two-phase > > > > and failover, we could check for any decoded prepared transactions when > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > committed and replicated, thus the bug would not happen. > > > > > > > > pros: > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > prepare > > > > during consistent snapshot creation, the restriction becomes almost > > > > unnoticeable. > > > > > > I think this approach can work for the transactions that are prepared > > > while the slot is created. But if I understand the problem correctly, > > > while the initial table sync is performing, the slot's two_phase is > > > still false, so we need to deal with the transactions that are > > > prepared during the initial table sync too. What do you think? > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > started decoding when setting two_phase=true during CreateDecodingContext() > > after tablesync, we could check prepared transactions afterwards during > > decoding. This could involve reporting an ERROR when skipping a prepared > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > It will make it difficult for users to detect it as this happens at a > later point of time. > > > Alternatively, a simpler method would be to prevent this situation entirely > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > created with failover set to true and twophase is later modified to true after > > tablesync. Although the simpler check is more user-visible, it may offer less > > flexibility. > > > > I agree with your point, but OTOH, I am also afraid of adding too many > smart checks in the back-branch. If we follow what you say here, then > users have the following ways in PG17 to enable both failover and > two_phase. (a) During Create Subscription, users can set both > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > 'copy_data' is true, during Create Subscription, then users can enable > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > to set 'failover'. Yet another idea would be to disallow enabling both two_phase and failover at CREATE SUBSCRIPTION regardless of copy_data value and to add check when enabling failover for the two_phase-enabled-slots. For example, users who want to enable both need to do two steps: 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. 2. ALTER SUBSCRIPTION with failover = true. At ALTER SUBSCRIPTION with failover = true, the subscriber checks if the two_phase states is ready (and possibly check if the slot's two_phase has been enabled too), otherwise raises an ERROR. Then, when the publisher enables the failover for the two_phase-enabled-slot up on walrcv_alter_slot() request, it checks the slot's restart_lsn has passed slot's two_phase_at, otherwise raise an error with the message like "the slot need to consume change upto %X/%X to enable failover". Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > ---------- > > > > > Approach 2 > > > > > ---------- > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > > enabled when there's a possibility of existing prepared transactions before > > > > the > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > two-phase > > > > > and failover, we could check for any decoded prepared transactions when > > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > pros: > > > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > prepare > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > unnoticeable. > > > > > > > > I think this approach can work for the transactions that are prepared > > > > while the slot is created. But if I understand the problem correctly, > > > > while the initial table sync is performing, the slot's two_phase is > > > > still false, so we need to deal with the transactions that are > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > started decoding when setting two_phase=true during CreateDecodingContext() > > > after tablesync, we could check prepared transactions afterwards during > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > It will make it difficult for users to detect it as this happens at a > > later point of time. > > > > > Alternatively, a simpler method would be to prevent this situation entirely > > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > > created with failover set to true and twophase is later modified to true after > > > tablesync. Although the simpler check is more user-visible, it may offer less > > > flexibility. > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > smart checks in the back-branch. If we follow what you say here, then > > users have the following ways in PG17 to enable both failover and > > two_phase. (a) During Create Subscription, users can set both > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > 'copy_data' is true, during Create Subscription, then users can enable > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > to set 'failover'. > > Yet another idea would be to disallow enabling both two_phase and > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > add check when enabling failover for the two_phase-enabled-slots. For > example, users who want to enable both need to do two steps: > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > 2. ALTER SUBSCRIPTION with failover = true. > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > the two_phase states is ready (and possibly check if the slot's > two_phase has been enabled too), otherwise raises an ERROR. Then, when > the publisher enables the failover for the two_phase-enabled-slot up > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > passed slot's two_phase_at, otherwise raise an error with the message > like "the slot need to consume change upto %X/%X to enable failover". > This should further simplify the checks with an additional restriction during the CREATE SUBSCRIPTION time. I am in favor of it because I want the code in this area to be as much same in HEAD and backbranches as possible. Please note that the PG17 also suffers from data loss after promotion due to a bug in fast-forward mode as described in email [1]. So, we should try to get that fixed as well. Thanks for your feedback. [1] - https://www.postgresql.org/message-id/OS0PR01MB57163087F86621D44D9A72BF94BB2%40OS0PR01MB5716.jpnprd01.prod.outlook.com -- With Regards, Amit Kapila.
On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > > > ---------- > > > > > > Approach 2 > > > > > > ---------- > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > > > enabled when there's a possibility of existing prepared transactions before > > > > > the > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > two-phase > > > > > > and failover, we could check for any decoded prepared transactions when > > > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > pros: > > > > > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > > prepare > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > unnoticeable. > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > while the slot is created. But if I understand the problem correctly, > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > still false, so we need to deal with the transactions that are > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > > started decoding when setting two_phase=true during CreateDecodingContext() > > > > after tablesync, we could check prepared transactions afterwards during > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > later point of time. > > > > > > > Alternatively, a simpler method would be to prevent this situation entirely > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > > > created with failover set to true and twophase is later modified to true after > > > > tablesync. Although the simpler check is more user-visible, it may offer less > > > > flexibility. > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > smart checks in the back-branch. If we follow what you say here, then > > > users have the following ways in PG17 to enable both failover and > > > two_phase. (a) During Create Subscription, users can set both > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > 'copy_data' is true, during Create Subscription, then users can enable > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > to set 'failover'. > > > > Yet another idea would be to disallow enabling both two_phase and > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > add check when enabling failover for the two_phase-enabled-slots. For > > example, users who want to enable both need to do two steps: > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > 2. ALTER SUBSCRIPTION with failover = true. > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > the two_phase states is ready (and possibly check if the slot's > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > the publisher enables the failover for the two_phase-enabled-slot up > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > passed slot's two_phase_at, otherwise raise an error with the message > > like "the slot need to consume change upto %X/%X to enable failover". > > > > This should further simplify the checks with an additional restriction > during the CREATE SUBSCRIPTION time. I am in favor of it because I > want the code in this area to be as much same in HEAD and backbranches > as possible. > Please find the updated patch for Approach 3, which implements the idea suggested by Swada-san above. -- Thanks, Nisha
Attachment
On Thu, Apr 24, 2025 at 2:54 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > Please find the updated patch for Approach 3, which implements the > idea suggested by Swada-san above. > Thank You for the patch. 1) CreateDecodingContext: if (ctx->twophase && !slot->data.two_phase) { + /* + * Do not allow two-phase decoding for failover enabled slots. + * + * See comments atop the similar check in ReplicationSlotCreate() for + * a detailed reason. + */ + if (slot->data.failover) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", + NameStr(slot->data.name)))); + Can you please add tetscase to cover this scenario? 2) We shall update create-sub documents as well for these mutually exclusive options. Review other pages (alter-sub, create-slot) as well for any required change. 3) +################################################## +# Test that the failover option can be enabled for a two_phase enabled +# subscription. +################################################## Suggestion: 'Test that the failover option can be enabled for a two_phase enabled subscription only through Alter Subscription (failover=true)' thanks Shveta
On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond <nisha.moond412@gmail.com> wrote: > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > > > > > > ---------- > > > > > > > Approach 2 > > > > > > > ---------- > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > > > > enabled when there's a possibility of existing prepared transactions before > > > > > > the > > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > > two-phase > > > > > > > and failover, we could check for any decoded prepared transactions when > > > > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > > > prepare > > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > > unnoticeable. > > > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > > while the slot is created. But if I understand the problem correctly, > > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > > still false, so we need to deal with the transactions that are > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > > > started decoding when setting two_phase=true during CreateDecodingContext() > > > > > after tablesync, we could check prepared transactions afterwards during > > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > later point of time. > > > > > > > > > Alternatively, a simpler method would be to prevent this situation entirely > > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > > > > created with failover set to true and twophase is later modified to true after > > > > > tablesync. Although the simpler check is more user-visible, it may offer less > > > > > flexibility. > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > smart checks in the back-branch. If we follow what you say here, then > > > > users have the following ways in PG17 to enable both failover and > > > > two_phase. (a) During Create Subscription, users can set both > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > to set 'failover'. > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > add check when enabling failover for the two_phase-enabled-slots. For > > > example, users who want to enable both need to do two steps: > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > the two_phase states is ready (and possibly check if the slot's > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > the publisher enables the failover for the two_phase-enabled-slot up > > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > > passed slot's two_phase_at, otherwise raise an error with the message > > > like "the slot need to consume change upto %X/%X to enable failover". > > > > > > > This should further simplify the checks with an additional restriction > > during the CREATE SUBSCRIPTION time. I am in favor of it because I > > want the code in this area to be as much same in HEAD and backbranches > > as possible. > > > > Please find the updated patch for Approach 3, which implements the > idea suggested by Swada-san above. Thank you for the patch! I think we need to update the documentation as well. I'll also review the patch shortly so could you please prepare the documentation changes? Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Thu, Apr 24, 2025 at 10:48 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Thu, Apr 24, 2025 at 2:24 AM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > On Thu, Apr 24, 2025 at 12:28 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Apr 23, 2025 at 11:04 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > > > > > On Tue, Apr 22, 2025 at 3:00 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > > > > > On Mon, Apr 21, 2025 at 8:44 AM Zhijie Hou (Fujitsu) > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > On Sat, Apr 19, 2025 at 2:19 AM Masahiko Sawada wrote: > > > > > > > > > > > > > > On Tue, Apr 8, 2025 at 10:14 PM Zhijie Hou (Fujitsu) > > > > > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > ---------- > > > > > > > > Approach 2 > > > > > > > > ---------- > > > > > > > > > > > > > > > > Instead of disallowing the use of two-phase and failover together, a more > > > > > > > > flexible strategy could be only restrict failover for slots with two-phase > > > > > > > > enabled when there's a possibility of existing prepared transactions before > > > > > > > the > > > > > > > > two_phase_at that are not yet replicated. During slot creation with > > > > > > > two-phase > > > > > > > > and failover, we could check for any decoded prepared transactions when > > > > > > > > determining the decoding start point (DecodingContextFindStartpoint). For > > > > > > > > subsequent attempts to alter failover to true, we ensure that two_phase_at is > > > > > > > > less than restart_lsn, indicating that all prepared transactions have been > > > > > > > > committed and replicated, thus the bug would not happen. > > > > > > > > > > > > > > > > pros: > > > > > > > > > > > > > > > > This method minimizes restrictions for users. Especially during slot creation > > > > > > > > with (two_phase=on, failover=on), as it’s uncommon for transactions to > > > > > > > prepare > > > > > > > > during consistent snapshot creation, the restriction becomes almost > > > > > > > > unnoticeable. > > > > > > > > > > > > > > I think this approach can work for the transactions that are prepared > > > > > > > while the slot is created. But if I understand the problem correctly, > > > > > > > while the initial table sync is performing, the slot's two_phase is > > > > > > > still false, so we need to deal with the transactions that are > > > > > > > prepared during the initial table sync too. What do you think? > > > > > > > > > > > > > > > > > > > Yes, I agree that we need to restrict this case too. Given that we haven't > > > > > > started decoding when setting two_phase=true during CreateDecodingContext() > > > > > > after tablesync, we could check prepared transactions afterwards during > > > > > > decoding. This could involve reporting an ERROR when skipping a prepared > > > > > > transaction during decoding if its prepare LSN is less than two_phase_at. > > > > > > > > > > > > > > > > It will make it difficult for users to detect it as this happens at a > > > > > later point of time. > > > > > > > > > > > Alternatively, a simpler method would be to prevent this situation entirely > > > > > > during the CREATE SUBSCRIPTION command. For example, we could restrict slots > > > > > > created with failover set to true and twophase is later modified to true after > > > > > > tablesync. Although the simpler check is more user-visible, it may offer less > > > > > > flexibility. > > > > > > > > > > > > > > > > I agree with your point, but OTOH, I am also afraid of adding too many > > > > > smart checks in the back-branch. If we follow what you say here, then > > > > > users have the following ways in PG17 to enable both failover and > > > > > two_phase. (a) During Create Subscription, users can set both > > > > > 'failover' and 'two_phase', if 'copy_data' is false, or (b), if > > > > > 'copy_data' is true, during Create Subscription, then users can enable > > > > > 'two_phase' and wait for it to be enabled. Then use Alter Subscription > > > > > to set 'failover'. > > > > > > > > Yet another idea would be to disallow enabling both two_phase and > > > > failover at CREATE SUBSCRIPTION regardless of copy_data value and to > > > > add check when enabling failover for the two_phase-enabled-slots. For > > > > example, users who want to enable both need to do two steps: > > > > > > > > 1. CREATE SUBSCRIPTION with two_phase = true and failover = false. > > > > 2. ALTER SUBSCRIPTION with failover = true. > > > > > > > > At ALTER SUBSCRIPTION with failover = true, the subscriber checks if > > > > the two_phase states is ready (and possibly check if the slot's > > > > two_phase has been enabled too), otherwise raises an ERROR. Then, when > > > > the publisher enables the failover for the two_phase-enabled-slot up > > > > on walrcv_alter_slot() request, it checks the slot's restart_lsn has > > > > passed slot's two_phase_at, otherwise raise an error with the message > > > > like "the slot need to consume change upto %X/%X to enable failover". > > > > > > > > > > This should further simplify the checks with an additional restriction > > > during the CREATE SUBSCRIPTION time. I am in favor of it because I > > > want the code in this area to be as much same in HEAD and backbranches > > > as possible. > > > > > > > Please find the updated patch for Approach 3, which implements the > > idea suggested by Swada-san above. > > Thank you for the patch! I think we need to update the documentation > as well. I'll also review the patch shortly so could you please > prepare the documentation changes? Here are reviews comments on v7 patch: + /* + * Do not allow users to enable the failover and two_phase options + * together. + * + * See comments atop the similar check in ReplicationSlotCreate() for a + * detailed reason. + */ + if (opts.twophase && opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options", + "failover", "two_phase")); I think we can reword the error message to something like "cannot enable both failover and two_phase options at CREATE SUBSCRIPTION", and we can give a errhint that failover can be enabled after two_phase state is ready. --- + if (two_phase && !IsSyncingReplicationSlots()) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("\"%s\" and \"%s\" are mutually exclusive options", + "failover", "two_phase")); Similar to the above comment, how about rewriting it to something like "cannot enable both failover and two_phase during replication slot creation". I realized that users who create a logical slot using pg_create_logical_replication_slot() would not be able to enable both options at slot creation, and there is no easy way to enable the failover after two_phase-enabled-slot creation. Users would need to use ALTER_REPLICATION_SLOT replication command, which seems unrealistics for users to use. On the other hand, if we allow creating a logical slot with enabling failover and two_phase using SQL API, there is still a chance for this bug to occur. Would it be worth considering that if a logical slot is created with enabling failover and two_phase using SQL API, we create the slot with only two_phase=true, then advance the slot until the slot satisfies restart_lsn >= two_phase_at, and then enable the failover? --- + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING && + opts.failover) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a subscription with a pending two_phase state")); + I think ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE is a more appropriate error code. --- + if (failover && MyReplicationSlot->data.two_phase && + MyReplicationSlot->data.restart_lsn < MyReplicationSlot->data.two_phase_at) + ereport(ERROR, + errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot enable failover for a two-phase enabled replication slot"), + errdetail("The slot need to consume change upto %X/%X to enable failover.", + LSN_FORMAT_ARGS(MyReplicationSlot->data.two_phase_at))); Same as above comment with regards to the error code. --- We need regression tests for replication slot creation SQL API too. --- +# Confirm that the failover flag on the slot has been turned on +is( $publisher->safe_psql( + 'postgres', + q{SELECT failover from pg_replication_slots WHERE slot_name = 'lsub1_slot';} + ), + "t", + 'logical slot has failover true on the publisher'); It should check 'lsub2_slot' instead. --- +# Create a subscription with two_phase enabled +$subscriber1->safe_psql('postgres', + "CREATE SUBSCRIPTION regress_mysub2 CONNECTION '$publisher_connstr' PUBLICATION regress_mypub WITH (create_slot = false, copy_data = false, enabled = false, two_phase = true);" +); + +# Enable failover for the subscription with two_phase in pending state +($result, $stdout, $stderr) = $subscriber1->psql('postgres', + "ALTER SUBSCRIPTION regress_mysub2 SET (failover = true)"); +ok( $stderr =~ /ERROR: cannot enable failover for a subscription with a pending two_phase state/, + "Enabling failover is not allowed for a two_phase pending subscription"); + I think it would be good to have more tests where we make the two_phase state ready and execute 'ALTER SUBSCRIPTION SET (failover = true)' successfully. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > I realized that users who create a logical slot using > pg_create_logical_replication_slot() would not be able to enable both > options at slot creation, and there is no easy way to enable the > failover after two_phase-enabled-slot creation. Users would need to > use ALTER_REPLICATION_SLOT replication command, which seems > unrealistics for users to use. On the other hand, if we allow creating > a logical slot with enabling failover and two_phase using SQL API, > there is still a chance for this bug to occur. Would it be worth > considering that if a logical slot is created with enabling failover > and two_phase using SQL API, we create the slot with only > two_phase=true, then advance the slot until the slot satisfies > restart_lsn >= two_phase_at, and then enable the failover? > This means we either need to maintain somewhere that user has provided failover flag till restart_lsn >= two_phase_at or and then set failover flag in the slot or initially mark it but enable the functionality of failover when we reach the condition restart_lsn >= two_phase_at. Both seem to have different kinds of problems. The first idea seems to have an issue with persistence, which means we can lose track of the flag after the restart. The second can mislead the user for a long period in cases where prepare and commit have a large time gap. I feel this will introduce complexity either in the form of code or in giving the information to the user. -- With Regards, Amit Kapila.
On Thu, Apr 24, 2025 at 3:57 PM shveta malik <shveta.malik@gmail.com> wrote: > > On Thu, Apr 24, 2025 at 2:54 PM Nisha Moond <nisha.moond412@gmail.com> wrote: > > > > Please find the updated patch for Approach 3, which implements the > > idea suggested by Swada-san above. > > > > Thank You for the patch. > > 1) > > CreateDecodingContext: > > if (ctx->twophase && !slot->data.two_phase) > { > + /* > + * Do not allow two-phase decoding for failover enabled slots. > + * > + * See comments atop the similar check in ReplicationSlotCreate() for > + * a detailed reason. > + */ > + if (slot->data.failover) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot enable two-phase decoding for failover enabled slot \"%s\"", > + NameStr(slot->data.name)))); > + > > Can you please add tetscase to cover this scenario? > Added a test for the above scenario. > 2) > We shall update create-sub documents as well for these mutually > exclusive options. Review other pages (alter-sub, create-slot) as well > for any required change. > Updated docs where I felt this new change should be mentioned. Please let me know if I missed any place. > 3) > +################################################## > +# Test that the failover option can be enabled for a two_phase enabled > +# subscription. > +################################################## > > Suggestion: 'Test that the failover option can be enabled for a two_phase > enabled subscription only through Alter Subscription (failover=true)' > > Done. ~~~ Please find the attached v8 patch with above comments addressed. This version includes the documentation updates suggested by Sawada-san at [1], and incorporates his feedback from [2] as well. [1] https://www.postgresql.org/message-id/CAD21AoD08Nb588fAJ%2BJd5xRxtAT8yWnOfZ3zq-K5sto9b3ntsA%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAD21AoD08Nb588fAJ%2BJd5xRxtAT8yWnOfZ3zq-K5sto9b3ntsA%40mail.gmail.com -- Thanks, Nisha
Attachment
On Fri, Apr 25, 2025 at 3:43 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Fri, Apr 25, 2025 at 6:02 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > I realized that users who create a logical slot using > > pg_create_logical_replication_slot() would not be able to enable both > > options at slot creation, and there is no easy way to enable the > > failover after two_phase-enabled-slot creation. Users would need to > > use ALTER_REPLICATION_SLOT replication command, which seems > > unrealistics for users to use. On the other hand, if we allow creating > > a logical slot with enabling failover and two_phase using SQL API, > > there is still a chance for this bug to occur. Would it be worth > > considering that if a logical slot is created with enabling failover > > and two_phase using SQL API, we create the slot with only > > two_phase=true, then advance the slot until the slot satisfies > > restart_lsn >= two_phase_at, and then enable the failover? > > > > This means we either need to maintain somewhere that user has provided > failover flag till restart_lsn >= two_phase_at or and then set > failover flag in the slot I was thinking of this idea. > or initially mark it but enable the > functionality of failover when we reach the condition restart_lsn >= > two_phase_at. IIUC the slot could be synchronized to the standby as soon as we complete DecodingContextFindStartpoint() for a failover-enabled slot. So we would need some mechanisms to make sure that the slot is not synchronized while we're waiting to reach the condition restart_lsn >= two_phase_at even if the failover is enabled. > Both seem to have different kinds of problems. The first > idea seems to have an issue with persistence, which means we can lose > track of the flag after the restart. I think we can do this series of operations while the slot is not persistent, that is the slot is still RS_EPHEMERAL. > The second can mislead the user > for a long period in cases where prepare and commit have a large time > gap. I feel this will introduce complexity either in the form of code > or in giving the information to the user. Agreed. Both ways introduce complexity so we need to consider the user-unfriendliness (by not having a proper way to enable failover for the two_phase-enabled-slot using SQL API) vs. risk (of introducing complexity). Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com