RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Zhijie Hou (Fujitsu) |
---|---|
Subject | RE: Synchronizing slots from primary to standby |
Date | |
Msg-id | OS0PR01MB57160A281D09334EB0B6D602944E2@OS0PR01MB5716.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > <houzj.fnst@fujitsu.com> wrote: > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > I have pushed this patch and it leads to a BF failure: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > t=2024-02-14%2004%3A43%3A37 > > > > The test failures are: > > # Failed test 'logical decoding is not allowed on synced slot' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 272. > > # Failed test 'synced slot on standby cannot be altered' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 281. > > # Failed test 'synced slot on standby cannot be dropped' > > # at > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > ailover_slots_sync.pl > > line 287. > > > > The reason is that in LOGs, we see a different ERROR message than what > > is expected: > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > replication slot "lsub1_slot" is active for PID 1760871 > > > > Now, we see the slot still active because a test before these tests (# > > Test that if the synchronized slot is invalidated while the remote > > slot is still valid, ....) is not able to successfully persist the > > slot and the synced temporary slot remains active. > > > > The reason is clear by referring to below standby LOGS: > > > > LOG: connection authorized: user=bf database=postgres > > application_name=040_standby_failover_slots_sync.pl > > LOG: statement: SELECT pg_sync_replication_slots(); > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > WHERE slot_name = 'lsub1_slot'; > > > > In the above LOGs, we should ideally see: "newly created slot > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > accomplish what it was supposed to. Ideally, the same test should have > > failed but the pass criteria for the test failed to check whether the > > slot is persisted or not. > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > behind the oldest WAL segment on standby. Now, in the test, we do > > ensure that the publisher and subscriber are caught up by following > > steps: > > # Enable the subscription to let it catch up to the latest wal > > position $subscriber1->safe_psql('postgres', > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > However, this doesn't guarantee that restart_lsn is moved to a > > position new enough that standby has a WAL corresponding to it. > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > XLOG_RUNNING_XACTS and make sure the same is processed as well by > walsender. The attached patch does the required change. > > Hou-San can reproduce this problem by adding additional checkpoints in the > test and after applying the attached it fixes the problem. Now, this patch is > mostly based on the theory we formed based on LOGs on BF and a reproducer > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in > which case I'll again look into those. I have verified that the patch can fix the issue on my machine(after adding few more checkpoints before slot invalidation test.) I also added one more check in the test to confirm the synced slot is not temp slot. Here is the v2 patch. Best Regards, Hou zj
Attachment
pgsql-hackers by date: