Re: Fix race condition in InvalidatePossiblyObsoleteSlot() - Mailing list pgsql-hackers

From Bertrand Drouvot
Subject Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
Date
Msg-id ZakzrE6Oib9N5Nar@ip-10-97-1-34.eu-west-3.compute.internal
Whole thread Raw
In response to Re: Fix race condition in InvalidatePossiblyObsoleteSlot()  (Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>)
Responses Re: Fix race condition in InvalidatePossiblyObsoleteSlot()
List pgsql-hackers
Hi,

On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> IIUC, the issue is that we terminate the process holding the
> replication slot, and the conflict cause that we recorded before
> terminating the process changes in the next iteration due to the
> advancement in effective_xmin and/or effective_catalog_xmin.

Thanks for looking at it!

Yeah, and that could lead to no conflict detected anymore (like in the
case [2] reported up-thread).

> FWIW, a test code something like [1], can help detect above race issues, right?

I think so and I've added it in v2 attached (except that it uses the new
"terminated" variable, see below), thanks!

> Some comments on the patch:
> 
> 1.
>                  last_signaled_pid = active_pid;
> +                terminated = true;
>              }
> 
> Why is a separate variable needed? Can't last_signaled_pid != 0 enough
> to determine that a process was terminated earlier?

Yeah probably, I thought about it but preferred to add a new variable for this 
purpose for clarity and avoid race conditions (in case futur patches "touch" the
last_signaled_pid anyhow). I was thinking that this part of the code is already
not that easy.

> 2. If my understanding of the racy behavior is right, can the same
> issue happen due to the advancement in restart_lsn?

I'm not sure as I never saw it but it should not hurt to also consider this
"potential" case so it's done in v2 attached.

> I'm not sure if it
> can happen at all, but I think we can rely on previous conflict reason
> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.

I'm not sure what you mean here. I think we should still keep the "initial" LSN
so that the next check done with it still makes sense. The previous conflict
reason as you're proposing also makes sense to me but for another reason: PANIC
in case the issue still happen (for cases we did not think about, means not
covered by what the added previous LSNs are covering).

> 3. Is there a way to reproduce this racy issue, may be by adding some
> sleeps or something? If yes, can you share it here, just for the
> records and verify the whatever fix provided actually works?

Alexander was able to reproduce it on a slow machine and the issue was not there
anymore with v1 in place. I think it's tricky to reproduce as it would need the
slot to advance between the 2 checks.

Regards,

-- 
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

pgsql-hackers by date:

Previous
From: Andrei Lepikhov
Date:
Subject: Re: POC: GROUP BY optimization
Next
From: Aleksander Alekseev
Date:
Subject: Re: UUID v7