Hi,
On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote:
> IIUC, the issue is that we terminate the process holding the
> replication slot, and the conflict cause that we recorded before
> terminating the process changes in the next iteration due to the
> advancement in effective_xmin and/or effective_catalog_xmin.
Thanks for looking at it!
Yeah, and that could lead to no conflict detected anymore (like in the
case [2] reported up-thread).
> FWIW, a test code something like [1], can help detect above race issues, right?
I think so and I've added it in v2 attached (except that it uses the new
"terminated" variable, see below), thanks!
> Some comments on the patch:
>
> 1.
> last_signaled_pid = active_pid;
> + terminated = true;
> }
>
> Why is a separate variable needed? Can't last_signaled_pid != 0 enough
> to determine that a process was terminated earlier?
Yeah probably, I thought about it but preferred to add a new variable for this
purpose for clarity and avoid race conditions (in case futur patches "touch" the
last_signaled_pid anyhow). I was thinking that this part of the code is already
not that easy.
> 2. If my understanding of the racy behavior is right, can the same
> issue happen due to the advancement in restart_lsn?
I'm not sure as I never saw it but it should not hurt to also consider this
"potential" case so it's done in v2 attached.
> I'm not sure if it
> can happen at all, but I think we can rely on previous conflict reason
> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn.
I'm not sure what you mean here. I think we should still keep the "initial" LSN
so that the next check done with it still makes sense. The previous conflict
reason as you're proposing also makes sense to me but for another reason: PANIC
in case the issue still happen (for cases we did not think about, means not
covered by what the added previous LSNs are covering).
> 3. Is there a way to reproduce this racy issue, may be by adding some
> sleeps or something? If yes, can you share it here, just for the
> records and verify the whatever fix provided actually works?
Alexander was able to reproduce it on a slow machine and the issue was not there
anymore with v1 in place. I think it's tricky to reproduce as it would need the
slot to advance between the 2 checks.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com