Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | ZczGf7tZaD0p8tNk@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | RE: Synchronizing slots from primary to standby ("Zhijie Hou (Fujitsu)" <houzj.fnst@fujitsu.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
Hi, On Wed, Feb 14, 2024 at 10:40:11AM +0000, Zhijie Hou (Fujitsu) wrote: > On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > > > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu) > > > <houzj.fnst@fujitsu.com> wrote: > > > > > > > > Here is V87 patch that adds test for the suggested cases. > > > > > > > > > > I have pushed this patch and it leads to a BF failure: > > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d > > > t=2024-02-14%2004%3A43%3A37 > > > > > > The test failures are: > > > # Failed test 'logical decoding is not allowed on synced slot' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 272. > > > # Failed test 'synced slot on standby cannot be altered' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 281. > > > # Failed test 'synced slot on standby cannot be dropped' > > > # at > > /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f > > ailover_slots_sync.pl > > > line 287. > > > > > > The reason is that in LOGs, we see a different ERROR message than what > > > is expected: > > > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR: > > > replication slot "lsub1_slot" is active for PID 1760871 > > > > > > Now, we see the slot still active because a test before these tests (# > > > Test that if the synchronized slot is invalidated while the remote > > > slot is still valid, ....) is not able to successfully persist the > > > slot and the synced temporary slot remains active. > > > > > > The reason is clear by referring to below standby LOGS: > > > > > > LOG: connection authorized: user=bf database=postgres > > > application_name=040_standby_failover_slots_sync.pl > > > LOG: statement: SELECT pg_sync_replication_slots(); > > > LOG: dropped replication slot "lsub1_slot" of dbid 5 > > > STATEMENT: SELECT pg_sync_replication_slots(); ... > > > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots > > > WHERE slot_name = 'lsub1_slot'; > > > > > > In the above LOGs, we should ideally see: "newly created slot > > > "lsub1_slot" is sync-ready now" after the "LOG: dropped replication > > > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't > > > accomplish what it was supposed to. Ideally, the same test should have > > > failed but the pass criteria for the test failed to check whether the > > > slot is persisted or not. > > > > > > The probable reason for failure is that remote_slot's restart_lsn lags > > > behind the oldest WAL segment on standby. Now, in the test, we do > > > ensure that the publisher and subscriber are caught up by following > > > steps: > > > # Enable the subscription to let it catch up to the latest wal > > > position $subscriber1->safe_psql('postgres', > > > "ALTER SUBSCRIPTION regress_mysub1 ENABLE"); > > > > > > $primary->wait_for_catchup('regress_mysub1'); > > > > > > However, this doesn't guarantee that restart_lsn is moved to a > > > position new enough that standby has a WAL corresponding to it. > > > > > > > To ensure that restart_lsn has been moved to a recent position, we need to log > > XLOG_RUNNING_XACTS and make sure the same is processed as well by > > walsender. The attached patch does the required change. > > > > Hou-San can reproduce this problem by adding additional checkpoints in the > > test and after applying the attached it fixes the problem. Now, this patch is > > mostly based on the theory we formed based on LOGs on BF and a reproducer > > by Hou-San, so still, there is some chance that this doesn't fix the BF failures in > > which case I'll again look into those. > > I have verified that the patch can fix the issue on my machine(after adding few > more checkpoints before slot invalidation test.) I also added one more check in > the test to confirm the synced slot is not temp slot. Here is the v2 patch. Thanks! +# To ensure that restart_lsn has moved to a recent WAL position, we need +# to log XLOG_RUNNING_XACTS and make sure the same is processed as well +$primary->psql('postgres', "CHECKPOINT"); Instead of "CHECKPOINT" wouldn't a less heavy "SELECT pg_log_standby_snapshot();" be enough? Not a big deal but maybe we could do the change while modifying 040_standby_failover_slots_sync.pl in the next patch "Add a new slotsync worker". Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
pgsql-hackers by date: