Re: Weird test mixup - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Weird test mixup
Date
Msg-id ZjNAUN5Hek-MyVWv@paquier.xyz
Whole thread Raw
In response to Re: Weird test mixup  (Noah Misch <noah@leadboat.com>)
Responses Re: Weird test mixup
Re: Weird test mixup
List pgsql-hackers
On Wed, May 01, 2024 at 04:12:14PM -0700, Noah Misch wrote:
> While writing an injection point test, I encountered a variant of the race
> condition that f4083c4 fixed.  It had three sessions and this sequence of
> events:
>
> s1: local-attach to POINT
> s2: enter InjectionPointRun(POINT), yield CPU just before injection_callback()
> s3: detach POINT, deleting the InjectionPointCondition record
> s2: wake up and run POINT as though it had been non-local

Fun.  One thing I would ask is why it makes sense to be able to detach
a local point from a different session than the one who defined it as
local.  Shouldn't the operation of s3 be restricted rather than
authorized as a safety measure, instead?

> On Sat, Mar 16, 2024 at 08:40:21AM +0900, Michael Paquier wrote:
>> On Fri, Mar 15, 2024 at 11:23:31AM +0200, Heikki Linnakangas wrote:
>>> Wrt. the spinlock and shared memory handling, I think this would be simpler
>>> if you could pass some payload in the InjectionPointAttach() call, which
>>> would be passed back to the callback function:
>>>
>>> In this case, the payload would be the "slot index" in shared memory.
>>>
>>> Or perhaps always allocate, say, 1024 bytes of working area for every
>>> attached injection point that the test module can use any way it wants. Like
>>> for storing extra conditions, or for the wakeup counter stuff in
>>> injection_wait(). A fixed size working area is a little crude, but would be
>>> very handy in practice.
>
> That would be one good way to solve it.  (Storing a slot index has the same
> race condition, but it fixes the race to store a struct containing the PID.)
>
> The best alternative I see is to keep an InjectionPointCondition forever after
> creating it.  Give it a "bool valid" field that we set on detach.  I don't see
> a major reason to prefer one of these over the other.  One puts a negligible
> amount of memory pressure on the main segment, but it simplifies the module
> code.  I lean toward the "1024 bytes of working area" idea.  Other ideas or
> opinions?

If more state data is needed, the fixed area injection_point.c would
be better.  Still, I am not sure that this is required here, either.

> Separately, injection_points_cleanup() breaks the rules by calling
> InjectionPointDetach() while holding a spinlock.  The latter has an
> elog(ERROR), and reaching that elog(ERROR) leaves a stuck spinlock.  I haven't
> given as much thought to solutions for this one.

Indeed.  That's a brain fade.  This one could be fixed by collecting
the point names when cleaning up the conditions and detach after
releasing the spinlock.  This opens a race condition between the
moment when the spinlock is released and the detach, where another
backend could come in and detach a point before the shmem_exit
callback has the time to do its cleanup, even if detach() is
restricted for local points.  So we could do the callback cleanup in
three steps in the shmem exit callback:
- Collect the names of the points to detach, while holding the
spinlock.
- Do the Detach.
- Take again the spinlock, clean up the conditions.

Please see the attached.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Frédéric Yhuel
Date:
Subject: Re: New GUC autovacuum_max_threshold ?
Next
From: Yuya Watari
Date:
Subject: Re: [PoC] Reducing planning time when tables have many partitions