RE: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From Zhijie Hou (Fujitsu)
Subject RE: Synchronizing slots from primary to standby
Date
Msg-id OS0PR01MB5716CE0729CEB3B5994A954194B3A@OS0PR01MB5716.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  ("Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Friday, November 10, 2023 4:16 PM Drouvot, Bertrand <bertranddrouvot.pg@gmail.com> wrote:
> 
> 
> >>> Also, the current coding doesn't ensure we will always give WARNING.
> >>> If we see the below code that deals with this WARNING,
> >>>
> >>> +  /* User created slot with the same name exists, emit WARNING. */
> >>> +  else if (found && s->data.sync_state == SYNCSLOT_STATE_NONE)
> >>> +  {
> >>> +    ereport(WARNING,
> >>> +        errmsg("not synchronizing slot %s; it is a user created slot",
> >>> +             remote_slot->name));
> >>> +  }
> >>> +  /* Otherwise create the slot first. */
> >>> +  else
> >>> +  {
> >>> +    TransactionId xmin_horizon = InvalidTransactionId;
> >>> +    ReplicationSlot *slot;
> >>> +
> >>> +    ReplicationSlotCreate(remote_slot->name, true, RS_EPHEMERAL,
> >>> +                remote_slot->two_phase, false);
> >>>
> >>> I think this is not a solid check to ensure that the slot existed
> >>> before. Because it could be created as soon as the slot sync worker
> >>> invokes ReplicationSlotCreate() here.
> >>
> >> Agree.
> >>
> >
> > So, having a concrete check to give WARNING would require some more
> > logic which I don't think is a good idea to handle this boundary case.
> >
> 
> Yeah good point, agree to just error out in all the case then (if we discard the
> sync_ reserved wording proposal, which seems to be the case as probably not
> worth the extra work).

Thanks for the discussion!

Here is the V33 patch set which includes the following changes:

1) Drop slots with state 'i' in promotion flow after we shut down WalReceiver.

2) Raise error if user slot with same name already exists on standby.

3) Hold spinlock when updating the porperty of the replication slot or when
   reading the slot's info without acuqiring it.

4) Fixed the bug:
- if slot is invalidated on standby but standby is restarted immediately after
  that, then it fails to recreate that slot and instead end up syncing it
  again. It is fixed by checking the invalidated flag after acquiring the slot
  and skip syncing for invalidated slots.

5) Fixed the bugs reported by Ajin[1][2].

6) Removed some unused variables.

Thanks Shveta for working on 1) 2) 4) and 5).
Thanks Ajin for testing 5).

---
TODO
There are few pending comments and bugs have not been addressed, I will work on
them in next version:

1) Comments posted by me[3]:
2) Shutdown the slotsync worker on promotion and probably let slotsync do necessary cleanups[4]
3) Consider documenting the hazard that create slot on standby may cause ERRORs
   in the log and consider making it visible in the view.
4) One bug found internally: If we give non-existing dbname in primary_conninfo
   on standby, it should be handled gracefully, launcher should skip
   slots-synchronization. 
5) One bug found internally: when wait_for_primary_slot_catchup is doing
   WaitLatch and I stop the standby using: ./pg_ctl -D ../../standbydb/ -l
   standby.log stop it does not come out of wait and shutdown hangs.

[1] https://www.postgresql.org/message-id/CAFPTHDZNV_HFAXULkaJOv_MMtLukCzDEgTaixxBwjEO_0Jg-kg%40mail.gmail.com
[2] https://www.postgresql.org/message-id/CAFPTHDa5C_vHQbeqemToyucWySB0kEFbdS2WOA0PB%2BGSei2v7A%40mail.gmail.com
[3]
https://www.postgresql.org/message-id/OS0PR01MB571652CCD42F1D08D5BD69D494B3A%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[4] https://www.postgresql.org/message-id/64056e35-1916-461c-a816-26e40ffde3a0%40gmail.com

Best Regards,
Hou zj

Attachment

pgsql-hackers by date:

Previous
From: Daniel Gustafsson
Date:
Subject: Re: proposal: possibility to read dumped table's name from file
Next
From: Bharath Rupireddy
Date:
Subject: Re: add log messages when replication slots become active and inactive (was Re: Is it worth adding ReplicationSlot active_pid to ReplicationSlotPersistentData?)