Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Peter Smith
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAHut+Pt0uum+6Hg5UDofWMEJWhVEyArM1b0_B94UJmRcQmz7DA@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (shveta malik <shveta.malik@gmail.com>)
List pgsql-hackers
On Tue, Jan 16, 2024 at 10:57 PM shveta malik <shveta.malik@gmail.com> wrote:
>
...
> v62-006:
> Separated the failover-ready validation steps into this separate
> doc-patch (which were earlier present in v61-002 and v61-003). Also
> addressed some of the doc comments by Peter in [1].
> Thanks Hou-San for providing this patch.
>
> [1]: https://www.postgresql.org/message-id/CAHut%2BPteZVNx1jQ6Hs3mEdoC%3DDNALVpJJ2mZDYim7sU-04tiaw%40mail.gmail.com
>

Thanks for addressing my previous review in the new patch 0006. I
checked it again and below are a few more comments

======
1. GENERAL

I was wondering if some other documentation (like somewhere from
chapter 27, or maybe the pgctl promote docs?) should be referring back
to this new information about how to decide if the standby is ready
for promotion.

======
doc/src/sgml/logical-replication.sgml

2.
+
+  <para>
+   Because the slot synchronization logic copies asynchronously, it is
+   necessary to confirm that replication slots have been synced to the standby
+   server before the failover happens. Furthermore, to ensure a successful
+   failover, the standby server must not be lagging behind the subscriber. It
+   is highly recommended to use <varname>standby_slot_names</varname> to
+   prevent the subscriber from consuming changes faster than the hot standby.
+   To confirm that the standby server is indeed ready for failover, follow
+   these 2 steps:
+  </para>

For easier navigation, perhaps that standby_slot_names should include
a link back to where the standby_slot_names GUC is described.

~~~

3.
+    <substeps>
+     <step performance="required">
+      <para>
+       Firstly, on the subscriber node, use the following SQL to identify the
+       slot names that should be synced to the standby that we plan to promote.

Minor change to wording.

SUGGESTION
Firstly, on the subscriber node, use the following SQL to identify
which slots should be synced to the standby that we plan to promote.

~~~

4.
+<programlisting>
+test_sub=# SELECT
+               array_agg(slotname) AS slots
+           FROM
+           ((
+               SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid ||
'_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname
+               FROM pg_control_system() ctl, pg_subscription_rel r,
pg_subscription s
+               WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover
+           ) UNION (
+               SELECT s.oid AS subid, s.subslotname as slotname
+               FROM pg_subscription s
+               WHERE s.subfailover
+           ));
+ slots
+-------
+ {sub}
+(1 row)
+</programlisting></para>
+     </step>

I think the example might be better if the result shows > 1 slot.  e.g.
{sub1,sub2,sub3}

This would also make the next step 1.b. more clear.

~~~

5.
+</programlisting></para>
+     </step>
+     <step performance="required">
+      <para>
+       Next, check that the logical replication slots identified above exist on
+       the standby server. This step can be skipped if
+       <varname>standby_slot_names</varname> has been correctly configured.
+<programlisting>
+test_standby=# SELECT bool_and(synced AND NOT temporary AND
conflict_reason IS NULL) AS failover_ready
+               FROM pg_replication_slots
+               WHERE slot_name in ('sub');
+ failover_ready
+----------------
+ t
+(1 row)
+</programlisting></para>

5a.

(uppercase SQL keyword)

/in/IN/

~

5b.
I felt this might be easier to understand if the SQL gives a
two-column result instead of one all-of-nothing T/F where you might no
be sure which slot was the one giving a problem. e.g.

failover_ready | slot
---------------------
t              | sub1
t              | sub2
f              | sub3
...

~~~

6.
+       <para>
+        Firstly, on the subscriber node check the last replayed WAL. If the
+        query result is NULL, it indicates that the subscriber has not yet
+        replayed any WAL. Therefore, the next step can be skipped, as the
+        standby server must be ahead of the subscriber.

IMO all of that part "If the query result is NULL" does not really
belong here because it describes skipping the *next* step. So, it
would be better to say this in the next step. Something like:

SUGGESTION (for step 2b)
Next, on the standby server check that the last-received WAL location
is ahead of the replayed WAL location on the subscriber identified
above. If the above SQL result was NULL, it means the subscriber has
not yet replayed any WAL, so the standby server must be ahead of the
subscriber, and this step can be skipped.

~~~

7.
+<programlisting>
+test_sub=# SELECT
+               MAX(remote_lsn) AS remote_lsn_on_subscriber
+           FROM
+           ((
+               SELECT (CASE WHEN r.srsubstate = 'f' THEN
pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' ||
r.srrelid), false)
+                           WHEN r.srsubstate IN ('s', 'r') THEN
r.srsublsn END) as remote_lsn
+               FROM pg_subscription_rel r, pg_subscription s
+               WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid =
r.srsubid AND s.subfailover
+           ) UNION (
+               SELECT pg_replication_origin_progress(CONCAT('pg_' ||
s.oid), false) AS remote_lsn
+               FROM pg_subscription s
+               WHERE subfailover
+           ));
+ remote_lsn_on_subscriber
+--------------------------
+ 0/3000388
+</programlisting></para>

7a.
(uppercase SQL keyword)

/as/AS/

~

7b.
missing table alias

/WHERE subfailover/WHERE s.subfailover/

~~~

8.
+    </step>
+    <step performance="required">
+     <para>
+      Next, on the standby server check that the last-received WAL location
+      is ahead of the replayed WAL location on the subscriber identified above.


See the review comment above (#6) which suggested adding some more info here.

======
Kind Regards,
Peter Smith.
Fujitsu Australia



pgsql-hackers by date:

Previous
From: Peter Smith
Date:
Subject: Re: Synchronizing slots from primary to standby
Next
From: jian he
Date:
Subject: Re: [PATCH] Add support function for containment operators