On Thu, Jan 18, 2024 at 10:31 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> I have one question about the new code in v63-0002.
>
> ======
> src/backend/replication/logical/slotsync.c
>
> 1. ReplSlotSyncWorkerMain
>
> + Assert(SlotSyncWorker->pid == InvalidPid);
> +
> + /*
> + * Startup process signaled the slot sync worker to stop, so if meanwhile
> + * postmaster ended up starting the worker again, exit.
> + */
> + if (SlotSyncWorker->stopSignaled)
> + {
> + SpinLockRelease(&SlotSyncWorker->mutex);
> + proc_exit(0);
> + }
>
> Can we be sure a worker crash can't occur (in ShutDownSlotSync?) in
> such a way that SlotSyncWorker->stopSignaled was already assigned
> true, but SlotSyncWorker->pid was not yet reset to InvalidPid;
>
> e.g. Is the Assert above still OK?
We are good with the Assert here. I tried below cases:
1) When slotsync worker is say killed using 'kill', it is considered
as SIGTERM; slot sync worker invokes 'slotsync_worker_onexit()' before
going down and thus sets SlotSyncWorker->pid = InvalidPid. This means
when it is restarted (considering we have put the breakpoints in such
a way that postmaster had already reached do_start_bgworker() before
promotion finished), it is able to see stopSignaled set but pid is
InvalidPid and thus we are good.
2) Another case is when we kill slot sync worker using 'kill -9' (or
say we make it crash), in such a case, postmaster signals each sibling
process to quit (including startup process) and cleans up the shared
memory used by each (including SlotSyncWorker). In such a case
promotion fails. And if slot sync worker is started again, it will
find pid as InvalidPid. So we are good.
thanks
Shveta