Re: Synchronizing slots from primary to standby - Mailing list pgsql-hackers
From | Amit Kapila |
---|---|
Subject | Re: Synchronizing slots from primary to standby |
Date | |
Msg-id | CAA4eK1LwP-h6XSYg=Dgqos9k-t2vrwWJwXRbCk2QkeHSpoYEjQ@mail.gmail.com Whole thread Raw |
In response to | Re: Synchronizing slots from primary to standby (Masahiko Sawada <sawada.mshk@gmail.com>) |
Responses |
Re: Synchronizing slots from primary to standby
|
List | pgsql-hackers |
On Wed, Jan 31, 2024 at 2:02 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > Thank you for updating the patches. As for the slotsync worker patch, > is there any reason why 0001, 0002, and 0004 patches are still > separated? > No specific reason, it could be easier to review those parts. > > Beside, here are some comments on v74 0001, 0002, and 0004 patches: > > --- > +static char * > +wait_for_valid_params_and_get_dbname(void) > +{ > + char *dbname; > + int rc; > + > + /* Sanity check. */ > + Assert(enable_syncslot); > + > + for (;;) > + { > + if (validate_parameters_and_get_dbname(&dbname)) > + break; > + ereport(LOG, errmsg("skipping slot synchronization")); > + > + ProcessSlotSyncInterrupts(NULL); > > When reading this function, I expected that the slotsync worker would > resume working once the parameters became valid, but it was not > correct. For example, if I changed hot_standby_feedback from off to > on, the slotsync worker reads the config file, exits, and then > restarts. Given that the slotsync worker ends up exiting on parameter > changes anyway, why do we want to have it wait for parameters to > become valid? > Right, the reason for waiting is to avoid repeated re-start of slotsync worker if the required parameter is not changed. To follow that, I think we should simply continue when the required parameter is changed and is valid. But, I think during actual slotsync, if connection_info is changed then there is no option but to restart. > > --- > +bool > +SlotSyncWorkerCanRestart(void) > +{ > +#define SLOTSYNC_RESTART_INTERVAL_SEC 10 > + > > IIUC depending on how busy the postmaster is and the timing, the user > could wait for 1 min to re-launch the slotsync worker. But I think the > user might want to re-launch the slotsync worker more quickly for > example when the slotsync worker restarts due to parameter changes. > IIUC SloSyncWorkerCanRestart() doesn't consider the fact that the > slotsync worker previously exited with 0 or 1. > Considering my previous where we don't want to restart for a required parameter change, isn't it better to avoid repeated restart (say when the user gave an invalid dbname)? BTW, I think this restart interval is added based on your previous complaint [1]. > > --- > When I dropped a database on the primary that has a failover slot, I > got the following logs on the standby: > > 2024-01-31 17:25:21.750 JST [1103933] FATAL: replication slot "s" is > active for PID 1103935 > 2024-01-31 17:25:21.750 JST [1103933] CONTEXT: WAL redo at 0/3020D20 > for Database/DROP: dir 1663/16384 > 2024-01-31 17:25:21.751 JST [1103930] LOG: startup process (PID > 1103933) exited with exit code 1 > > It seems that because the slotsync worker created the slot on the > standby, the slot's active_pid is still valid. > But we release the slot after sync. And we do take a shared lock on the database to make the startup process wait for slotsync. There is one gap which is that we don't reset active_pid for temp slots in ReplicationSlotRelease(), so for temp slots such an error can occur but OTOH, we immediately make the slot persistent after sync. As per my understanding, it is only possible to get this error if the initial sync doesn't happen and the slot remains temporary. Is that your case? How did reproduce this? That is why the startup > process could not drop the slot. > [1] - https://www.postgresql.org/message-id/CAD21AoApGoTZu7D_7%3DbVYQqKnj%2BPZ2Rz%2Bnc8Ky1HPQMS_XL6%2BA%40mail.gmail.com -- With Regards, Amit Kapila.
pgsql-hackers by date: