Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id ZdhvbqwT7789uDVt@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to RE: Synchronizing slots from primary to standby  ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>)
Responses RE: Synchronizing slots from primary to standby
List pgsql-hackers
Hi,

On Fri, Feb 23, 2024 at 04:36:44AM +0000, Zhijie Hou (Fujitsu) wrote:
> On Friday, February 23, 2024 10:18 AM Zhijie Hou (Fujitsu) <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Hi,
> > >
> > > Since the slotsync worker patch has been committed, I rebased the
> > > remaining patches.
> > > And here is the V95 patch set.
> > >
> > > Also, I fixed a bug in the current 0001 patch where the member of the
> > > standby slot names list pointed to the freed memory after calling
> > ProcessConfigFile().
> > > Now, we will obtain a new list when we call ProcessConfigFile(). The
> > > optimization to only get the new list when the names actually change
> > > has been removed. I think this change is acceptable because
> > > ProcessConfigFile is not a frequent occurrence.
> > >
> > > Additionally, I reordered the tests in
> > > 040_standby_failover_slots_sync.pl. Now the new test will be conducted
> > > after the sync slot test to prevent the risk of the logical slot
> > > occasionally not catching up to the latest catalog_xmin and, as a result, not
> > being able to be synced immediately.
> > 
> > There is one unexpected change in the previous version, sorry for that.
> > Here is the correct version.
> 
> I noticed one CFbot failure[1] which is because the tap-test doesn't wait for the
> standby to catch up before promoting, thus the data inserted after promotion
> could not be replicated to the subscriber. Add a wait_for_replay_catchup to fix it.
> 
> Apart from this, I also adjusted some variable names in the tap-test to be
> consistent. And added back a mis-removed ProcessConfigFile call.

Thanks!

Here are some random comments:

1 ===

Commit message "Allow logical walsenders to wait for the physical" 

s/physical/physical standby/?

2 ==

+++ b/src/backend/replication/logical/logicalfuncs.c
@@ -30,6 +30,7 @@
 #include "replication/decode.h"
 #include "replication/logical.h"
 #include "replication/message.h"
+#include "replication/walsender.h"

Is this include needed?

3 ===

+        * Slot sync is currently not supported on the cascading standby. This is

s/on the/on a/?

4 ===

+       if (!ok)
+               GUC_check_errdetail("List syntax is invalid.");
+
+       /*
+        * If there is a syntax error in the name or if the replication slots'
+        * data is not initialized yet (i.e., we are in the startup process), skip
+        * the slot verification.
+        */
+       if (!ok || !ReplicationSlotCtl)
+       {
+               pfree(rawname);
+               list_free(elemlist);
+               return ok;
+       }

we are testing the "ok" value twice, what about using if...else if... instead
and test it once? If so, it might be worth to put the:

"
+       pfree(rawname);
+       list_free(elemlist);
+       return ok;
"

in a "goto".

5 ===

+        * for which all standbys to wait for. Even if we have physical-slots

s/physical-slots/physical slots/?

6 ===

        * Switch to the same memory context under which GUC variables are

s/to the same memory/to the memory/?

7 ===

+ * Return a copy of standby_slot_names_list if the copy flag is set to true,

Not sure, but would it be worth explaining why one would want to set to flag to
true or false? (i.e why one would not want to receive the original list).

8 ===

+       if (RecoveryInProgress())
+               return NIL;

The need is well documented just above, but are we not violating the fact that
we return the original list or a copy of it? (that's what the comment above
the GetStandbySlotList() function definition is saying).

I think the comment above the GetStandbySlotList() function needs a bit of
rewording to cover that case.

9 ===

+                        * harmless, a WARNING should be enough, no need to error-out.

s/error-out/error out/?

10 ===

+                       if (slot->data.invalidated != RS_INVAL_NONE)
+                       {
+                               /*
+                                * Specified physical slot have been invalidated, so no point
+                                * in waiting for it.

We discovered in [1], that if the wal_status is "unreserved" then the slot is 
still serving the standby. I think we should handle this case differently,
thoughts?

11 ===

+                                * Specified physical slot have been invalidated, so no point

s/have been/has been/?

12 ===

+++ b/src/backend/replication/slotfuncs.c
@@ -22,6 +22,7 @@
 #include "replication/logical.h"
 #include "replication/slot.h"
 #include "replication/slotsync.h"
+#include "replication/walsender.h"

Is this include needed?

[1]: https://www.postgresql.org/message-id/CALj2ACWE9asmvN1B18LqfHE8uBuWGsCEP7OO5trRCxPtTPeHVA%40mail.gmail.com

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



pgsql-hackers by date:

Previous
From: Jelte Fennema-Nio
Date:
Subject: Re: Improve readability by using designated initializers when possible
Next
From: vignesh C
Date:
Subject: Re: A failure in t/001_rep_changes.pl