Re: Fix race condition in InvalidatePossiblyObsoleteSlot() - Mailing list pgsql-hackers
From | Bertrand Drouvot |
---|---|
Subject | Re: Fix race condition in InvalidatePossiblyObsoleteSlot() |
Date | |
Msg-id | Zc30i8fB7VYrTVw+@ip-10-97-1-34.eu-west-3.compute.internal Whole thread Raw |
In response to | Re: Fix race condition in InvalidatePossiblyObsoleteSlot() (Michael Paquier <michael@paquier.xyz>) |
Responses |
Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
|
List | pgsql-hackers |
Hi, On Thu, Feb 15, 2024 at 02:09:45PM +0900, Michael Paquier wrote: > On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote: > > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > >> I'm not sure if it > >> can happen at all, but I think we can rely on previous conflict reason > >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > > > I'm not sure what you mean here. I think we should still keep the "initial" LSN > > so that the next check done with it still makes sense. The previous conflict > > reason as you're proposing also makes sense to me but for another reason: PANIC > > in case the issue still happen (for cases we did not think about, means not > > covered by what the added previous LSNs are covering). > > Using a PANIC seems like an overreaction to me for this path. Sure, > we should not record twice a conflict reason, but wouldn't an > assertion be more adapted enough to check that a termination is not > logged twice for this code path run in the checkpointer? Thanks for looking at it! Agree, using an assertion instead in v3 attached. > > + if (!terminated) > + { > + initial_restart_lsn = s->data.restart_lsn; > + initial_effective_xmin = s->effective_xmin; > + initial_catalog_effective_xmin = s->effective_catalog_xmin; > + } > > It seems important to document why we are saving this data here; while > we hold the LWLock for repslots, before we perform any termination, > and we want to protect conflict reports with incorrect data if the > slot data got changed while the lwlock is temporarily released while > there's a termination. Yeah, comments added in v3. > >> 3. Is there a way to reproduce this racy issue, may be by adding some > >> sleeps or something? If yes, can you share it here, just for the > >> records and verify the whatever fix provided actually works? > > > > Alexander was able to reproduce it on a slow machine and the issue was not there > > anymore with v1 in place. I think it's tricky to reproduce as it would need the > > slot to advance between the 2 checks. > > I'd really want a test for that in the long term. And an injection > point stuck between the point where ReplicationSlotControlLock is > released then re-acquired when there is an active PID should be > enough, isn't it? Yeah, that should be enough. > For example just after the kill()? It seems to me > that we should stuck the checkpointer, perform a forced upgrade of the > xmins, release the checkpointer and see the effects of the change in > the second loop. Now, modules/injection_points/ does not allow that, > yet, but I've posted a patch among these lines that can stop a process > and release it using a condition variable (should be 0006 on [1]). I > was planning to start a new thread with a patch posted for the next CF > to add this kind of facility with a regression test for an old bug, > the patch needs a few hours of love, first. I should be able to post > that next week. Great, that looks like a good idea! Are we going to fix this on master and 16 stable first and then later on add a test on master with the injection points? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: