Re: Fix slot synchronization with two_phase decoding enabled - Mailing list pgsql-hackers

From Masahiko Sawada
Subject Re: Fix slot synchronization with two_phase decoding enabled
Date
Msg-id CAD21AoCjSa6hF2MH-dKPb3fBvX9YYwPF1kdJa4M=knwqGSiYgw@mail.gmail.com
Whole thread Raw
In response to Re: Fix slot synchronization with two_phase decoding enabled  (Nisha Moond <nisha.moond412@gmail.com>)
List pgsql-hackers
On Tue, Apr 29, 2025 at 5:00 AM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> On Tue, Apr 29, 2025 at 2:51 PM shveta malik <shveta.malik@gmail.com> wrote:
> >
> > On Mon, Apr 28, 2025 at 4:33 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
> > >
> > > Please find the v9 patch, addressing the above and all other comments as well.
> > >
> >
> > Thanks for the patch.
> >
> > 1)
> >
> > +          The default is false. This option cannot be set together with
> > +          <literal>two_phase</literal> when creating the slot. However, once
> > +          the slot is created with <literal>two_phase=true</literal>, the
> > +          <literal>failover</literal> can be set to true after the
> > +          <literal>restart_lsn</literal> has advanced beyond its
> > +          <literal>two_phase_at</literal> value
> >
> > As the user is not aware of two_phase_at, we shall change all the docs
> > to have a more user friendly explanation.
> >
>
> Done.
>
> > 2)
> >  # Confirm that the invalidated slot has been dropped.
> >  $standby1->wait_for_log(
> > - qr/dropped replication slot "lsub1_slot" of database with OID
> > [0-9]+/, $log_offset);
> > + qr/dropped replication slot "lsub1_slot" of database with OID [0-9]+/,
> > + $log_offset);
> >
> > I think this change is not needed.
> >
>
> The perltidy is auto-correcting it now, though it didn't for earlier
> patches. I've kept the original for now.
> ~~~
>
> Thanks for review, please find the updated patch v10.

Thank you for updating the patch! Here are some comments on v10.

We refer to the following comment in ReplicationSlotCreate() from many places:

/*
 * Do not allow users to enable both failover and two_phase for slots.
 *
 * This is because the two_phase_at field of a slot, which tracks the
 * LSN, from which two-phase decoding starts, is not synchronized to
 * standby servers. Without two_phase_at, the logical decoding might
 * incorrectly identify prepared transaction as already replicated to
 * the subscriber after promotion of standby server, causing them to
 * be skipped.
 *
 * However, both failover and two_phase enabled slots can be created
 * during slot synchronization because we need to retain the same
 * values as the remote slot.
 */

I think this is an important implementation restriction, so how about
describing the overall idea of combination of two_phase and failover
in the comment atop slotsync.c or worker.c? It would be easier to
find.

---
+   if (failover && MyReplicationSlot->data.two_phase &&
+       MyReplicationSlot->data.restart_lsn <
MyReplicationSlot->data.two_phase_at)
+       ereport(ERROR,
+               errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+               errmsg("cannot enable failover for a two-phase enabled
replication slot"),
+               errdetail("The slot need to consume change upto %X/%X
to enable failover.",
+
LSN_FORMAT_ARGS(MyReplicationSlot->data.two_phase_at)));

I think it's better to mention the reason in the error message why we
cannot enable two-phase in this situation. How about:

-               errmsg("cannot enable failover for a two-phase enabled
replication slot"),
+               errmsg("cannot enable failover for a two-phase enabled
replication slot due to unconsumed changes"),

---
+       ereport(ERROR,
+               errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+               errmsg("cannot enable both \"%s\" and \"%s\" options
at CREATE SUBSCRIPTION",
+                      "failover", "two_phase"),
+               errhint("The \"%s\" option can be enabled after \"%s\"
state is ready using ALTER SUBSCRIPTION.",
+                       "failover", "two_phase"));

How about rewriting the errhint message to:

-               errhint("The \"%s\" option can be enabled after \"%s\"
state is ready using ALTER SUBSCRIPTION.",
-                       "failover", "two_phase"));
+               errhint("Use ALTER SUBSCRIPTION ... SET (failover =
true) to enable \"%s\" after two_phase state is ready",
+                       "failover"));

---
@@ -29073,9 +29073,13 @@ postgres=# SELECT '0/0'::pg_lsn +
pd.segment_number * ps.setting::int + :offset
         <parameter>failover</parameter>, when set to true,
         specifies that this slot is enabled to be synced to the
         standbys so that logical replication can be resumed after
-        failover. A call to this function has the same effect as
-        the replication protocol command
-        <literal>CREATE_REPLICATION_SLOT ... LOGICAL</literal>.
+        failover. The parameters <parameter>twophase</parameter> and
+        <parameter>failover</parameter> cannot be enabled together when
+        creating a replication slot. However, a slot created with
<parameter>twophase</parameter>
+        enabled can later have <parameter>failover</parameter> set to true via
+        <command>ALTER SUBSCRIPTION</command>, if a subscription is using this
+        slot. A call to this function has the same effect as the replication
+        protocol command <literal>CREATE_REPLICATION_SLOT ...
LOGICAL</literal>.
        </para></entry>
       </row>

Can we also mention ALTER_REPLICATION to enable the failover for slots
that are not used by subscriptions?

---
+# Also wait for two-phase to be enabled
+$subscriber1->poll_query_until(
+   'postgres', qq[
+   SELECT count(1) = 0 FROM pg_subscription WHERE subname =
'regress_mysub2' and subtwophasestate NOT IN ('e');]
+) or die "Timed out while waiting for subscriber to enable twophase";

Since we check the twophasestate of the specific slot, how about doing
something like:

+ok( $subscriber1->poll_query_until(
+   'postgres',
+       qq[SELECT subtwophasestate FROM pg_subscription WHERE subname
= 'regress_mysub2';],
+   'e',
+    ),
+    "Timed out while waiting for subscriber to enable twophase");

---
+
+# Also wait for two-phase to be enabled
+$subscriber1->poll_query_until(
+   'postgres', qq[
+   SELECT count(1) = 0 FROM pg_subscription WHERE subname =
'regress_mysub2' and subtwophasestate NOT IN ('e');]
+) or die "Timed out while waiting for subscriber to enable twophase";
+
+# Try to enable the failover for the subscription, should give error
+($result, $stdout, $stderr) = $subscriber1->psql(
+   'postgres',
+   "ALTER SUBSCRIPTION regress_mysub2 DISABLE;
+    ALTER SUBSCRIPTION regress_mysub2 SET (failover = true);");
+ok( $stderr =~
+     qr/ERROR:  could not alter replication slot "lsub2_slot": ERROR:
 cannot enable failover for a two-phase enabled replication slot/,
+   "failover can't be enabled if restart_lsn < two_phase_at on a
two_phase subscription."
+);

I think it's possible between two tests that the walsender consumes
some changes (e.g. generated by autovacuums) then the second check
fails (i.e., ALTER SUBSCRIPTION SET (failover = ture) succeeds).

---
+# Test that SQL API disallows slot creation with both two_phase and
failover enabled
+($result, $stdout, $stderr) = $publisher->psql('postgres',
+   "SELECT pg_create_logical_replication_slot('regress_mysub3',
'pgoutput', false, true, true);"
+);
+ok( $stderr =~
+     /ERROR:  cannot enable both "failover" and "two_phase" options
during replication slot creation/,
+   "cannot create slot with both two_phase and failover enabled");

Isn't this test already covered by test_decoding/sql/slot.sql?

I've attached a patch that contains comment changes I mentioned above.
Please incorporate it if you agree with them.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Sequence Access Methods, round two
Next
From: Michael Paquier
Date:
Subject: Re: add --sequence-data to pg_dumpall