Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | shveta malik |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAJpy0uCSS5zmdyUXhvw41HSdTbRqX1hbYqkOfHNj7qQ+2zn0AQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Amit Kapila <amit.kapila16@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby |
List | pgsql-hackers |
On Wed, Apr 3, 2024 at 3:36 PM Amit Kapila <amit.kapila16@gmail.com> wrote: > > On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy > > <bharath.rupireddyforpostgres@gmail.com> wrote: > > > > > I quickly looked at v8, and have a nit, rest all looks good. > > > > > > + if (DecodingContextReady(ctx) && found_consistent_snapshot) > > > + *found_consistent_snapshot = true; > > > > > > Can the found_consistent_snapshot be checked first to help avoid the > > > function call DecodingContextReady() for pg_replication_slot_advance > > > callers? > > > > > > > Okay, changed. Additionally, I have updated the comments and commit > > message. I'll push this patch after some more testing. > > > > Pushed! There is an intermittent BF failure observed at [1] after this commit (2ec005b). Analysis: We see in BF logs, that during the first call of the sync function, restart_lsn of the synced slot is advanced to a lsn which is > than remote slot's restart_lsn. And when sync call is done second time without any further change on primary, it hits the error: ERROR: cannot synchronize local slot "lsub1_slot" LSN(0/3000060) to remote slot's LSN(0/3000028) as synchronization would move it backwards Relevant BF logs are given at [2]. This reproduces intermittently depending upon if bgwriter logs running txn record when the test is running. We were able to mimic the test case to reproduce the failure. Please see attached bf-test.txt for steps. Issue: Issue is that we are having a wrong sanity check based on 'restart_lsn' in synchronize_one_slot(): if (remote_slot->restart_lsn < slot->data.restart_lsn) elog(ERROR, ...); Prior to commit 2ec005b, this check was okay, as we did not expect restart_lsn of the synced slot to be ahead of remote since we were directly copying the lsns. But now when we use 'advance' to do logical decoding on standby, there is a possibility that restart lsn of the synced slot is ahead of remote slot, if there are running txns records found after reaching consistent-point while consuming WALs from restart_lsn till confirmed_lsn. In such a case, slot-sync's advance may end up serializing snapshots and setting restart_lsn to the serialized snapshot point, ahead of remote one. Fix: The sanity check needs to be corrected. Attached a patch to address the issue. a) The sanity check is corrected to compare confirmed_lsn rather than restart_lsn. Additional changes: b) A log has been added after LogicalSlotAdvanceAndCheckSnapState() to log the case when the local and remote slots' confirmed-lsn were not found to be the same after sync (if at all). c) Now we attempt to sync in update_local_synced_slot() if one of confirmed_lsn, restart_lsn, and catalog_xmin for remote slot is ahead of local slot instead of them just being unequal. [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-04-03%2017%3A57%3A28 [2]: 2024-04-03 18:00:41.896 UTC [3239617][client backend][0/2:0] LOG: statement: SELECT pg_sync_replication_slots(); LOG: starting logical decoding for slot "lsub1_slot" DETAIL: Streaming transactions committing after 0/0, reading WAL from 0/3000028. LOG: logical decoding found consistent point at 0/3000028 DEBUG: serializing snapshot to pg_logical/snapshots/0-3000060.snap DEBUG: got new restart lsn 0/3000060 at 0/3000060 LOG: newly created slot "lsub1_slot" is sync-ready now 2024-04-03 18:00:45.218 UTC [3243487][client backend][2/2:0] LOG: statement: SELECT pg_sync_replication_slots(); ERROR: cannot synchronize local slot "lsub1_slot" LSN(0/3000060) to remote slot's LSN(0/3000028) as synchronization would move it backwards thanks Shveta
Attachment
pgsql-hackers by date: