Re: Fix race condition in InvalidatePossiblyObsoleteSlot() - Mailing list pgsql-hackers

From Michael Paquier
Subject Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date
Msg-id ZdLxr2CPpPVNh9Zi@paquier.xyz
Whole thread Raw
In response to Re: Fix race condition in InvalidatePossiblyObsoleteSlot()  (Bertrand Drouvot <bertranddrouvot.pg@gmail.com>)
Responses Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
List pgsql-hackers
On Thu, Feb 15, 2024 at 11:24:59AM +0000, Bertrand Drouvot wrote:
> Thanks for looking at it!
>
> Agree, using an assertion instead in v3 attached.

And you did not forget the PG_USED_FOR_ASSERTS_ONLY.

> > It seems important to document why we are saving this data here; while
> > we hold the LWLock for repslots, before we perform any termination,
> > and we  want to protect conflict reports with incorrect data if the
> > slot data got changed while the lwlock is temporarily released while
> > there's a termination.
>
> Yeah, comments added in v3.

The contents look rather OK, I may do some word-smithing for both.

>> For example just after the kill()?  It seems to me
>> that we should stuck the checkpointer, perform a forced upgrade of the
>> xmins, release the checkpointer and see the effects of the change in
>> the second loop.  Now, modules/injection_points/ does not allow that,
>> yet, but I've posted a patch among these lines that can stop a process
>> and release it using a condition variable (should be 0006 on [1]).  I
>> was planning to start a new thread with a patch posted for the next CF
>> to add this kind of facility with a regression test for an old bug,
>> the patch needs a few hours of love, first.  I should be able to post
>> that next week.
>
> Great, that looks like a good idea!

I've been finally able to spend some time on what I had in mind and
posted it here for the next CF:
https://www.postgresql.org/message-id/ZdLuxBk5hGpol91B@paquier.xyz

You should be able to reuse that the same way I do in 0002 posted on
the thread, where I'm causing the checkpointer to wait, then wake it
up.

> Are we going to fix this on master and 16 stable first and then later on add a
> test on master with the injection points?

Still, the other patch is likely going to take a couple of weeks
before getting into the tree, so I have no objection to fix the bug
first and backpatch, then introduce a test.  Things have proved that
failures could show up in the buildfarm in this area, so more time
running this code across two branches is not a bad concept, either.
--
Michael

Attachment

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: A new message seems missing a punctuation
Next
From: Amul Sul
Date:
Subject: Re: Add system identifier to backup manifest