RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Synchronizing slots from primary to standby
Date
Msg-id OS0PR01MB57160A281D09334EB0B6D602944E2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Wednesday, February 14, 2024 6:05 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> 
> On Wed, Feb 14, 2024 at 2:14 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Wed, Feb 14, 2024 at 9:34 AM Zhijie Hou (Fujitsu)
> > <houzj.fnst@fujitsu.com> wrote:
> > >
> > > Here is V87 patch that adds test for the suggested cases.
> > >
> >
> > I have pushed this patch and it leads to a BF failure:
> > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris&d
> > t=2024-02-14%2004%3A43%3A37
> >
> > The test failures are:
> > #   Failed test 'logical decoding is not allowed on synced slot'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 272.
> > #   Failed test 'synced slot on standby cannot be altered'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 281.
> > #   Failed test 'synced slot on standby cannot be dropped'
> > #   at
> /home/bf/bf-build/flaviventris/HEAD/pgsql/src/test/recovery/t/040_standby_f
> ailover_slots_sync.pl
> > line 287.
> >
> > The reason is that in LOGs, we see a different ERROR message than what
> > is expected:
> > 2024-02-14 04:52:32.916 UTC [1767765][client backend][3/4:0] ERROR:
> > replication slot "lsub1_slot" is active for PID 1760871
> >
> > Now, we see the slot still active because a test before these tests (#
> > Test that if the synchronized slot is invalidated while the remote
> > slot is still valid, ....) is not able to successfully persist the
> > slot and the synced temporary slot remains active.
> >
> > The reason is clear by referring to below standby LOGS:
> >
> > LOG:  connection authorized: user=bf database=postgres
> > application_name=040_standby_failover_slots_sync.pl
> > LOG:  statement: SELECT pg_sync_replication_slots();
> > LOG:  dropped replication slot "lsub1_slot" of dbid 5
> > STATEMENT:  SELECT pg_sync_replication_slots(); ...
> > SELECT conflict_reason IS NULL AND synced FROM pg_replication_slots
> > WHERE slot_name = 'lsub1_slot';
> >
> > In the above LOGs, we should ideally see: "newly created slot
> > "lsub1_slot" is sync-ready now" after the "LOG:  dropped replication
> > slot "lsub1_slot" of dbid 5" but lack of that means the test didn't
> > accomplish what it was supposed to. Ideally, the same test should have
> > failed but the pass criteria for the test failed to check whether the
> > slot is persisted or not.
> >
> > The probable reason for failure is that remote_slot's restart_lsn lags
> > behind the oldest WAL segment on standby. Now, in the test, we do
> > ensure that the publisher and subscriber are caught up by following
> > steps:
> > # Enable the subscription to let it catch up to the latest wal
> > position $subscriber1->safe_psql('postgres',
> > "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
> >
> > $primary->wait_for_catchup('regress_mysub1');
> >
> > However, this doesn't guarantee that restart_lsn is moved to a
> > position new enough that standby has a WAL corresponding to it.
> >
> 
> To ensure that restart_lsn has been moved to a recent position, we need to log
> XLOG_RUNNING_XACTS and make sure the same is processed as well by
> walsender. The attached patch does the required change.
> 
> Hou-San can reproduce this problem by adding additional checkpoints in the
> test and after applying the attached it fixes the problem. Now, this patch is
> mostly based on the theory we formed based on LOGs on BF and a reproducer
> by Hou-San, so still, there is some chance that this doesn't fix the BF failures in
> which case I'll again look into those.

I have verified that the patch can fix the issue on my machine(after adding few
more checkpoints before slot invalidation test.) I also added one more check in
the test to confirm the synced slot is not temp slot. Here is the v2 patch.

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Dave Page
Date:
Subject: Re: Fix a typo in pg_rotate_logfile
Next
From: Ajin Cherian
Date:
Subject: Re: Improve eviction algorithm in ReorderBuffer