Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Amit Kapila
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAA4eK1J6BqO5=ueFAQO+aYyHLaU-oCHrrVFJqHS-i0Ce9aPY2w@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Tue, Oct 17, 2023 at 2:01 PM shveta malik <shveta.malik@gmail.com> wrote:
>
> On Tue, Oct 17, 2023 at 12:44 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > FYI - the latest patch failed to apply.
> >
> > [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> > ../patches_misc/v24-0001-Allow-logical-walsenders-to-wait-for-the-physica.patch
> > error: patch failed: src/include/utils/guc_hooks.h:160
> > error: src/include/utils/guc_hooks.h: patch does not apply
>
> Rebased v24. PFA.
>

Few comments:
==============
1.
+        List of physical replication slots that logical replication
with failover
+        enabled waits for.

/logical replication/logical replication slots

2.
 If
+        <varname>enable_syncslot</varname> is not enabled on the
+        corresponding standbys, then it may result in indefinite waiting
+        on the primary for physical replication slots configured in
+        <varname>standby_slot_names</varname>
+       </para>

Why the above leads to indefinite wait? I think we should just ignore
standby_slot_names and probably LOG a message in the server for the
same.

3.
+++ b/src/backend/replication/logical/tablesync.c
@@ -1412,7 +1412,8 @@ LogicalRepSyncTableStart(XLogRecPtr *origin_startpos)
  */
  walrcv_create_slot(LogRepWorkerWalRcvConn,
     slotname, false /* permanent */ , false /* two_phase */ ,
-    CRS_USE_SNAPSHOT, origin_startpos);
+    false /* enable_failover */ , CRS_USE_SNAPSHOT,
+    origin_startpos);

As per this code, we won't enable failover for tablesync slots. So,
what happens if we need to failover to new node after the tablesync
worker has reached SUBREL_STATE_FINISHEDCOPY or SUBREL_STATE_DATASYNC?
I think we won't be able to continue replication from failed over
node. If this theory is correct, we have two options (a) enable
failover for sync slots as well, if it is enabled for main slot; but
then after we drop the slot on primary once sync is complete, same
needs to be taken care at standby. (b) enable failover even for the
main slot after all tables are in ready state, something similar to
what we do for two_phase.

4.
+ /* Verify syntax */
+ if (!validate_slot_names(newval, &elemlist))
+ return false;
+
+ /* Now verify if these really exist and have correct type */
+ if (!validate_standby_slots(elemlist))

These two functions serve quite similar functionality which makes
their naming quite confusing. Can we directly move the functionality
of validate_slot_names() into validate_standby_slots()?

5.
+SlotSyncInitConfig(void)
+{
+ char    *rawname;
+
+ /* Free the old one */
+ list_free(standby_slot_names_list);
+ standby_slot_names_list = NIL;
+
+ if (strcmp(standby_slot_names, "") != 0)
+ {
+ rawname = pstrdup(standby_slot_names);
+ SplitIdentifierString(rawname, ',', &standby_slot_names_list);

How does this handle the case where '*' is specified for standby_slot_names?


--
With Regards,
Amit Kapila.



pgsql-hackers by date:

Previous
From: Ashutosh Bapat
Date:
Subject: Re: BRIN minmax multi - incorrect distance for infinite timestamp/date
Next
From: Dean Rasheed
Date:
Subject: Re: BRIN minmax multi - incorrect distance for infinite timestamp/date