Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Peter Smith |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAHut+Pt0uum+6Hg5UDofWMEJWhVEyArM1b0_B94UJmRcQmz7DA@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (shveta malik <shveta.malik@gmail.com>) |
List | pgsql-hackers |
On Tue, Jan 16, 2024 at 10:57 PM shveta malik <shveta.malik@gmail.com> wrote: > ... > v62-006: > Separated the failover-ready validation steps into this separate > doc-patch (which were earlier present in v61-002 and v61-003). Also > addressed some of the doc comments by Peter in [1]. > Thanks Hou-San for providing this patch. > > [1]: https://www.postgresql.org/message-id/CAHut%2BPteZVNx1jQ6Hs3mEdoC%3DDNALVpJJ2mZDYim7sU-04tiaw%40mail.gmail.com > Thanks for addressing my previous review in the new patch 0006. I checked it again and below are a few more comments ====== 1. GENERAL I was wondering if some other documentation (like somewhere from chapter 27, or maybe the pgctl promote docs?) should be referring back to this new information about how to decide if the standby is ready for promotion. ====== doc/src/sgml/logical-replication.sgml 2. + + <para> + Because the slot synchronization logic copies asynchronously, it is + necessary to confirm that replication slots have been synced to the standby + server before the failover happens. Furthermore, to ensure a successful + failover, the standby server must not be lagging behind the subscriber. It + is highly recommended to use <varname>standby_slot_names</varname> to + prevent the subscriber from consuming changes faster than the hot standby. + To confirm that the standby server is indeed ready for failover, follow + these 2 steps: + </para> For easier navigation, perhaps that standby_slot_names should include a link back to where the standby_slot_names GUC is described. ~~~ 3. + <substeps> + <step performance="required"> + <para> + Firstly, on the subscriber node, use the following SQL to identify the + slot names that should be synced to the standby that we plan to promote. Minor change to wording. SUGGESTION Firstly, on the subscriber node, use the following SQL to identify which slots should be synced to the standby that we plan to promote. ~~~ 4. +<programlisting> +test_sub=# SELECT + array_agg(slotname) AS slots + FROM + (( + SELECT r.srsubid AS subid, CONCAT('pg_' || srsubid || '_sync_' || srrelid || '_' || ctl.system_identifier) AS slotname + FROM pg_control_system() ctl, pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate = 'f' AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT s.oid AS subid, s.subslotname as slotname + FROM pg_subscription s + WHERE s.subfailover + )); + slots +------- + {sub} +(1 row) +</programlisting></para> + </step> I think the example might be better if the result shows > 1 slot. e.g. {sub1,sub2,sub3} This would also make the next step 1.b. more clear. ~~~ 5. +</programlisting></para> + </step> + <step performance="required"> + <para> + Next, check that the logical replication slots identified above exist on + the standby server. This step can be skipped if + <varname>standby_slot_names</varname> has been correctly configured. +<programlisting> +test_standby=# SELECT bool_and(synced AND NOT temporary AND conflict_reason IS NULL) AS failover_ready + FROM pg_replication_slots + WHERE slot_name in ('sub'); + failover_ready +---------------- + t +(1 row) +</programlisting></para> 5a. (uppercase SQL keyword) /in/IN/ ~ 5b. I felt this might be easier to understand if the SQL gives a two-column result instead of one all-of-nothing T/F where you might no be sure which slot was the one giving a problem. e.g. failover_ready | slot --------------------- t | sub1 t | sub2 f | sub3 ... ~~~ 6. + <para> + Firstly, on the subscriber node check the last replayed WAL. If the + query result is NULL, it indicates that the subscriber has not yet + replayed any WAL. Therefore, the next step can be skipped, as the + standby server must be ahead of the subscriber. IMO all of that part "If the query result is NULL" does not really belong here because it describes skipping the *next* step. So, it would be better to say this in the next step. Something like: SUGGESTION (for step 2b) Next, on the standby server check that the last-received WAL location is ahead of the replayed WAL location on the subscriber identified above. If the above SQL result was NULL, it means the subscriber has not yet replayed any WAL, so the standby server must be ahead of the subscriber, and this step can be skipped. ~~~ 7. +<programlisting> +test_sub=# SELECT + MAX(remote_lsn) AS remote_lsn_on_subscriber + FROM + (( + SELECT (CASE WHEN r.srsubstate = 'f' THEN pg_replication_origin_progress(CONCAT('pg_' || r.srsubid || '_' || r.srrelid), false) + WHEN r.srsubstate IN ('s', 'r') THEN r.srsublsn END) as remote_lsn + FROM pg_subscription_rel r, pg_subscription s + WHERE r.srsubstate IN ('f', 's', 'r') AND s.oid = r.srsubid AND s.subfailover + ) UNION ( + SELECT pg_replication_origin_progress(CONCAT('pg_' || s.oid), false) AS remote_lsn + FROM pg_subscription s + WHERE subfailover + )); + remote_lsn_on_subscriber +-------------------------- + 0/3000388 +</programlisting></para> 7a. (uppercase SQL keyword) /as/AS/ ~ 7b. missing table alias /WHERE subfailover/WHERE s.subfailover/ ~~~ 8. + </step> + <step performance="required"> + <para> + Next, on the standby server check that the last-received WAL location + is ahead of the replayed WAL location on the subscriber identified above. See the review comment above (#6) which suggested adding some more info here. ====== Kind Regards, Peter Smith. Fujitsu Australia
pgsql-hackers by date: