On Sat, Mar 16, 2024 at 10:29:03PM +0800, feichanghong wrote:
> A process utilizing replication slots (usually walsender) calls callback
> functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
> Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
> ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
> via ReplicationSlotRelease. By this time, the flag may already be assigned to
> another process.
Oops.
> To replicate the issue, execute the following steps:
> 1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
> where pgxactoff is set to an invalid value in ProcArrayRemove, and some
> checks are added.
> 2. Use the SQL below to terminate the walsender process.
> ```
> select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
> ```
> # Fix
>
> To fix the issue, I have provided some patches in the attachment:
> 1. Backpatching 2f6501f into the PG14 version will fix the problem.
> 2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
> checks should be done when setting ProcGlobal->statusFlags.
Yeah, that's something that we had better fix in all stable branches.
The asserts would offer some protection moving on, but I would take
the safer move of only adding a protection like what you are
suggestion on HEAD and not in stable branches, just in case we're
missing something around them.
--
Michael