Re: speed up a logical replica setup - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: speed up a logical replica setup |
Date | |
Msg-id | CAA4eK1KG7+jUdONhsELY2poyBrmesNnkt-m5QSgPhgyyM6cFHQ@mail.gmail.com Whole thread Raw |
In response to | Re: speed up a logical replica setup (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
RE: speed up a logical replica setup
Re: speed up a logical replica setup |
List | pgsql-hackers |
On Tue, Jul 9, 2024 at 4:30 PM Alexander Lakhin <exclusion@gmail.com> wrote: > > Please look at another failure of the test [1]: > [13:28:05.647](2.460s) not ok 26 - failover slot is synced > [13:28:05.648](0.001s) # Failed test 'failover slot is synced' > # at /home/bf/bf-build/skink-master/HEAD/pgsql/src/bin/pg_basebackup/t/040_pg_createsubscriber.pl line 307. > [13:28:05.648](0.000s) # got: '' > # expected: 'failover_slot' > > with 040_pg_createsubscriber_node_s.log containing: > 2024-07-08 13:28:05.369 UTC [3985464][client backend][0/2:0] LOG: statement: SELECT pg_sync_replication_slots() > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] LOG: could not sync slot "failover_slot" as remote slot > precedes local slot > 2024-07-08 13:28:05.557 UTC [3985464][client backend][0/2:0] DETAIL: Remote slot has LSN 0/30047B8 and catalog xmin > 743, but local slot has LSN 0/30047B8 and catalog xmin 744. > > I could not reproduce it locally, but I've discovered that that subtest > somehow depends on pg_createsubscriber executed for the > 'primary contains unmet conditions on node P' check. For example with this > test modification: > @@ -249,7 +249,7 @@ command_fails( > $node_p->connstr($db1), '--socket-directory', > $node_s->host, '--subscriber-port', > $node_s->port, '--database', > - $db1, '--database', > + 'XXX', '--database', > $db2 > ], > 'primary contains unmet conditions on node P'); > > I see the same failure: > 2024-07-09 10:19:43.284 UTC [938890] 040_pg_createsubscriber.pl LOG: statement: SELECT pg_sync_replication_slots() > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl LOG: could not sync slot "failover_slot" as remote slot > precedes local slot > 2024-07-09 10:19:43.292 UTC [938890] 040_pg_createsubscriber.pl DETAIL: Remote slot has LSN 0/3004780 and catalog xmin > 743, but local slot has LSN 0/3004780 and catalog xmin 744. > > Thus maybe even a normal pg_createsubscriber run can affect the primary > server (it's catalog xmin) differently? > Yes, pg_createsubscriber can affect the primary server's catalog xmin because it starts the standby server that can send HSFeedback (See XLogWalRcvSendHSFeedback()), which can advance the physical slot's xmin corresponding the following Insert in the test: # Insert another row on node P and wait node S to catch up $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')"); $node_p->wait_for_replay_catchup($node_s); In the success case, pg_createsubscriber is able to send HSFeedback and in the failure case, it won't. We can see the following logs in 040_pg_createsubscriber_node_p.log: 2024-07-08 13:28:00.872 UTC [3982331][walsender][:0] FATAL: the database system is starting up 2024-07-08 13:28:00.875 UTC [3982328][startup][:0] LOG: database system was shut down at 2024-07-08 13:28:00 UTC 2024-07-08 13:28:01.105 UTC [3981996][postmaster][:0] LOG: database system is ready to accept connections This shows that when the test 'primary contains unmet conditions on node P' starts the standby server the corresponding primary node was not ready because we just restarted node_p before that test and didn't ensure that the node_p is up and ready to accept connections before starting the pg_createsubscriber test. Even in the successful cases where the standby is able to connect to primary for test 'primary contains unmet conditions on node P', there is no guarantee that xmin of the physical slot will be updated at least, we don't have anything in the test to ensure the same. Now as before creating logical replication, we didn't ensure that the physical slot's xmin has been caught up to the latest value, the test can lead to failure like: "Remote slot has LSN 0/3004780 and catalog xmin 743, but local slot has LSN 0/3004780 and catalog xmin 744". The xmin on standby could have been advanced due to the following Insert: # Insert another row on node P and wait node S to catch up $node_p->safe_psql($db1, "INSERT INTO tbl1 VALUES('second row')"); $node_p->wait_for_replay_catchup($node_s); We don't wait for the xmin to catch up corresponding to this insert and I don't know if there is a way to do that. So, we should move this Insert to after the call to pg_sync_replication_slots(). It won't impact the general test of pg_createsubscriber. Thanks to Hou-San for helping me in the analysis of this BF failure. -- With Regards, Amit Kapila.
pgsql-hackers by date: