Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers

From shveta malik
Subject Re: Synchronizing slots from primary to standby
Date
Msg-id CAJpy0uCBHRX-GSKCeVza44kFEC=uTMD_6uzuXXnbUX32Vt-g8Q@mail.gmail.com
Whole thread Raw
In response to Re: Synchronizing slots from primary to standby  (Nisha Moond <nisha.moond412@gmail.com>)
Responses Re: Synchronizing slots from primary to standby
Re: Synchronizing slots from primary to standby
List pgsql-hackers
On Fri, Dec 1, 2023 at 5:40 PM Nisha Moond <nisha.moond412@gmail.com> wrote:
>
> Review for v41 patch.

Thanks for the feedback.

>
> 1.
> ======
> src/backend/utils/misc/postgresql.conf.sample
>
> +#enable_syncslot = on # enables slot synchronization on the physical
> standby from the primary
>
> enable_syncslot is disabled by default, so, it should be 'off' here.
>

Sure, I will change it.

> ~~~
> 2.
> IIUC, the slotsyncworker's connection to the primary is to execute a
> query. Its aim is not walsender type connection, but at primary when
> queried, the 'backend_type' is set to 'walsender'.
> Snippet from primary db-
>
> datname  |   usename   | application_name | wait_event_type | backend_type
> ---------+-------------+------------------+-----------------+--------------
> postgres | replication | slotsyncworker   | Client          | walsender
>
> Is it okay?
>

Slot sync worker uses 'libpqrcv_connect' for connection which sends
'replication'-'database' key-value pair as one of the connection
options. And on the primary side, 'ProcessStartupPacket' on the basis
of this key-value pair sets the process as walsender one (am_walsender
= true).
And thus this reflects as backend_type='walsender' in
pg_stat_activity. I do not see any harm in this backend_type for
slot-sync worker currently. This is on a similar line of connections
used for logical-replications. And since a slot-sync worker also deals
with wals-positions (lsns), it is okay to maintain backend_type as
walsender unless you (or others) see any potential issue in doing
that. So let me know.

> ~~~
> 3.
> As per current logic, If there are slots on primary with disabled
> subscriptions, then, when standby is created it replicates these slots
> but can't make them sync-ready until any activity happens on the
> slots.
> So, such slots stay in 'i' sync-state and get dropped when failover
> happens. Now, if the subscriber tries to enable their existing
> subscription after failover, it gives an error that the slot does not
> exist.
>

yes, this is expected as Amit explained in [1]. But let me review if
we need to document this case for disabled subscriptions. i.e.
disabled subscription if enabled after promotion might not work.

> ~~~
> 4. primary_slot_name GUC value test:
>
> When standby is started with a non-existing primary_slot_name, the
> wal-receiver gives an error but the slot-sync worker does not raise
> any error/warning. It is no-op though as it has a check 'if
> (XLogRecPtrIsInvalid(WalRcv->latestWalEnd)) do nothing'.   Is this
> okay or shall the slot-sync worker too raise an error and exit?
>
> In another case, when standby is started with valid primary_slot_name,
> but it is changed to some invalid value in runtime, then walreceiver
> starts giving error but the slot-sync worker keeps on running. In this
> case, unlike the previous case, it even did not go to no-op mode (as
> it sees valid WalRcv->latestWalEnd from the earlier run) and keep
> pinging primary repeatedly for slots.  Shall here it should error out
> or at least be no-op until we give a valid primary_slot_name?
>

 I reviewed it. There is no way to test the existence/validity of
'primary_slot_name' on standby without making a connection to primary.
If primary_slot_name is invalid from the start, slot-sync worker will
be no-op (as you tested) as WalRecv->latestWalENd will be invalid, and
if 'primary_slot_name' is changed to invalid on runtime, slot-sync
worker will still keep on pinging primary. But that should be okay (in
fact needed) as it needs to sync at-least the previous slot's
positions (in case it is delayed in doing so for some reason earlier).
And once the slots are up-to-date on standby, even if worker pings
primary, it will not see any change in slots lsns and thus go for
longer nap. I think, it is not worth the effort to introduce the
complexity of checking validity of 'primary_slot_name' on primary from
standby for this rare scenario.

It will be good to know thoughts of others on above 3 points.

thanks
Shveta



pgsql-hackers by date:

Previous
From: Michael Paquier
Date:
Subject: Re: Bug in pgbench prepared statements
Next
From: John Naylor
Date:
Subject: Re: Change GUC hashtable to use simplehash?