Re: ReplicationSlotRelease may set the statusFlags of other processes in PG14 - Mailing list pgsql-bugs

From Michael Paquier
Subject Re: ReplicationSlotRelease may set the statusFlags of other processes in PG14
Date
Msg-id ZfkNP1OdgBSPPTsR@paquier.xyz
Whole thread Raw
In response to ReplicationSlotRelease may set the statusFlags of other processes in PG14  ("feichanghong" <feichanghong@qq.com>)
Responses Re: ReplicationSlotRelease may set the statusFlags of other processes in PG14
List pgsql-bugs
On Sat, Mar 16, 2024 at 10:29:03PM +0800, feichanghong wrote:
> A process utilizing replication slots (usually walsender) calls callback
> functions in the order of RemoveProcFromArray->ProcKill upon abnormal exit.
> Within RemoveProcFromArray, MyProc is already removed from the ProcArray.
> ProcKill then attempts to set ProcGlobal->statusFlags[MyProc->pgxactoff] again
> via ReplicationSlotRelease. By this time, the flag may already be assigned to
> another process.

Oops.

> To replicate the issue, execute the following steps:
> 1. Apply the attached v1-0000-v14-invalidate-pgxactoff-after-remove-pgproc.patch,
> where pgxactoff is set to an invalid value in ProcArrayRemove, and some
> checks are added.
> 2. Use the SQL below to terminate the walsender process.
> ```
> select pg_terminate_backend(pid) from pg_stat_activity where backend_type = 'walsender';
> ```
> # Fix
>
> To fix the issue, I have provided some patches in the attachment:
> 1. Backpatching 2f6501f into the PG14 version will fix the problem.
> 2. In PG14-head, ProcArrayRemove needs to reset pgxactoff, and some assert
> checks should be done when setting ProcGlobal->statusFlags.

Yeah, that's something that we had better fix in all stable branches.
The asserts would offer some protection moving on, but I would take
the safer move of only adding a protection like what you are
suggestion on HEAD and not in stable branches, just in case we're
missing something around them.
--
Michael

Attachment

pgsql-bugs by date:

Previous
From: ocean_li_996
Date:
Subject: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Next
From: "Hayato Kuroda (Fujitsu)"
Date:
Subject: RE: Potential data loss due to race condition during logical replication slot creation