Thread: Fix race condition in InvalidatePossiblyObsoleteSlot()
Hi hackers, While working on [1], we discovered (thanks Alexander for the testing) that an conflicting active logical slot on a standby could be "terminated" without leading to an "obsolete" message (see [2]). Indeed, in case of an active slot we proceed in 2 steps in InvalidatePossiblyObsoleteSlot(): - terminate the backend holding the slot - report the slot as obsolete This is racy because between the two we release the mutex on the slot, which means that the slot's effective_xmin and effective_catalog_xmin could advance during that time (leading to exit the loop). I think that holding the mutex longer is not an option (given what we'd to do while holding it) so the attached proposal is to record the effective_xmin and effective_catalog_xmin instead that was used during the backend termination. [1]: https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com [2]: https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com Looking forward to your feedback, Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Jan 15, 2024 at 1:18 PM Bertrand Drouvot <bertranddrouvot.pg@gmail.com> wrote: > > Hi hackers, > > While working on [1], we discovered (thanks Alexander for the testing) that an > conflicting active logical slot on a standby could be "terminated" without > leading to an "obsolete" message (see [2]). > > Indeed, in case of an active slot we proceed in 2 steps in > InvalidatePossiblyObsoleteSlot(): > > - terminate the backend holding the slot > - report the slot as obsolete > > This is racy because between the two we release the mutex on the slot, which > means that the slot's effective_xmin and effective_catalog_xmin could advance > during that time (leading to exit the loop). > > I think that holding the mutex longer is not an option (given what we'd to do > while holding it) so the attached proposal is to record the effective_xmin and > effective_catalog_xmin instead that was used during the backend termination. > > [1]: https://www.postgresql.org/message-id/flat/bf67e076-b163-9ba3-4ade-b9fc51a3a8f6%40gmail.com > [2]: https://www.postgresql.org/message-id/7c025095-5763-17a6-8c80-899b35bd0459%40gmail.com > > Looking forward to your feedback, IIUC, the issue is that we terminate the process holding the replication slot, and the conflict cause that we recorded before terminating the process changes in the next iteration due to the advancement in effective_xmin and/or effective_catalog_xmin. FWIW, a test code something like [1], can help detect above race issues, right? Some comments on the patch: 1. last_signaled_pid = active_pid; + terminated = true; } Why is a separate variable needed? Can't last_signaled_pid != 0 enough to determine that a process was terminated earlier? 2. If my understanding of the racy behavior is right, can the same issue happen due to the advancement in restart_lsn? I'm not sure if it can happen at all, but I think we can rely on previous conflict reason instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. 3. Is there a way to reproduce this racy issue, may be by adding some sleeps or something? If yes, can you share it here, just for the records and verify the whatever fix provided actually works? [1] diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 52da694c79..d020b038bc 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -1352,6 +1352,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, { int last_signaled_pid = 0; bool released_lock = false; + ReplicationSlotInvalidationCause conflict_prev = RS_INVAL_NONE; for (;;) { @@ -1417,6 +1418,18 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, } } + /* + * Check if the conflict cause recorded previously before we terminate + * the process changed now for any reason. + */ + if (conflict_prev != RS_INVAL_NONE && + last_signaled_pid != 0 && + conflict_prev != conflict) + elog(PANIC, "conflict cause recorded before terminating process %d has been changed; previous cause: %d, current cause: %d", + last_signaled_pid, + conflict_prev, + conflict); + /* if there's no conflict, we're done */ if (conflict == RS_INVAL_NONE) { @@ -1499,6 +1512,7 @@ InvalidatePossiblyObsoleteSlot(ReplicationSlotInvalidationCause cause, (void) kill(active_pid, SIGTERM); last_signaled_pid = active_pid; + conflict_prev = conflict; } /* Wait until the slot is released. */ -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > IIUC, the issue is that we terminate the process holding the > replication slot, and the conflict cause that we recorded before > terminating the process changes in the next iteration due to the > advancement in effective_xmin and/or effective_catalog_xmin. Thanks for looking at it! Yeah, and that could lead to no conflict detected anymore (like in the case [2] reported up-thread). > FWIW, a test code something like [1], can help detect above race issues, right? I think so and I've added it in v2 attached (except that it uses the new "terminated" variable, see below), thanks! > Some comments on the patch: > > 1. > last_signaled_pid = active_pid; > + terminated = true; > } > > Why is a separate variable needed? Can't last_signaled_pid != 0 enough > to determine that a process was terminated earlier? Yeah probably, I thought about it but preferred to add a new variable for this purpose for clarity and avoid race conditions (in case futur patches "touch" the last_signaled_pid anyhow). I was thinking that this part of the code is already not that easy. > 2. If my understanding of the racy behavior is right, can the same > issue happen due to the advancement in restart_lsn? I'm not sure as I never saw it but it should not hurt to also consider this "potential" case so it's done in v2 attached. > I'm not sure if it > can happen at all, but I think we can rely on previous conflict reason > instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. I'm not sure what you mean here. I think we should still keep the "initial" LSN so that the next check done with it still makes sense. The previous conflict reason as you're proposing also makes sense to me but for another reason: PANIC in case the issue still happen (for cases we did not think about, means not covered by what the added previous LSNs are covering). > 3. Is there a way to reproduce this racy issue, may be by adding some > sleeps or something? If yes, can you share it here, just for the > records and verify the whatever fix provided actually works? Alexander was able to reproduce it on a slow machine and the issue was not there anymore with v1 in place. I think it's tricky to reproduce as it would need the slot to advance between the 2 checks. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote: > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: >> I'm not sure if it >> can happen at all, but I think we can rely on previous conflict reason >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > I'm not sure what you mean here. I think we should still keep the "initial" LSN > so that the next check done with it still makes sense. The previous conflict > reason as you're proposing also makes sense to me but for another reason: PANIC > in case the issue still happen (for cases we did not think about, means not > covered by what the added previous LSNs are covering). Using a PANIC seems like an overreaction to me for this path. Sure, we should not record twice a conflict reason, but wouldn't an assertion be more adapted enough to check that a termination is not logged twice for this code path run in the checkpointer? + if (!terminated) + { + initial_restart_lsn = s->data.restart_lsn; + initial_effective_xmin = s->effective_xmin; + initial_catalog_effective_xmin = s->effective_catalog_xmin; + } It seems important to document why we are saving this data here; while we hold the LWLock for repslots, before we perform any termination, and we want to protect conflict reports with incorrect data if the slot data got changed while the lwlock is temporarily released while there's a termination. No need to mention the pesky standby snapshot records, I guess, as there could be different reasons why these xmins advance. >> 3. Is there a way to reproduce this racy issue, may be by adding some >> sleeps or something? If yes, can you share it here, just for the >> records and verify the whatever fix provided actually works? > > Alexander was able to reproduce it on a slow machine and the issue was not there > anymore with v1 in place. I think it's tricky to reproduce as it would need the > slot to advance between the 2 checks. I'd really want a test for that in the long term. And an injection point stuck between the point where ReplicationSlotControlLock is released then re-acquired when there is an active PID should be enough, isn't it? For example just after the kill()? It seems to me that we should stuck the checkpointer, perform a forced upgrade of the xmins, release the checkpointer and see the effects of the change in the second loop. Now, modules/injection_points/ does not allow that, yet, but I've posted a patch among these lines that can stop a process and release it using a condition variable (should be 0006 on [1]). I was planning to start a new thread with a patch posted for the next CF to add this kind of facility with a regression test for an old bug, the patch needs a few hours of love, first. I should be able to post that next week. [1]: https://www.postgresql.org/message-id/CAExHW5uwP9RmCt9bA91bUfKNQeUrosAWtMens4Ah2PuYZv2NRA@mail.gmail.com -- Michael
Attachment
Hi, On Thu, Feb 15, 2024 at 02:09:45PM +0900, Michael Paquier wrote: > On Thu, Jan 18, 2024 at 02:20:28PM +0000, Bertrand Drouvot wrote: > > On Thu, Jan 18, 2024 at 04:59:39PM +0530, Bharath Rupireddy wrote: > >> I'm not sure if it > >> can happen at all, but I think we can rely on previous conflict reason > >> instead of previous effective_xmin/effective_catalog_xmin/restart_lsn. > > > > I'm not sure what you mean here. I think we should still keep the "initial" LSN > > so that the next check done with it still makes sense. The previous conflict > > reason as you're proposing also makes sense to me but for another reason: PANIC > > in case the issue still happen (for cases we did not think about, means not > > covered by what the added previous LSNs are covering). > > Using a PANIC seems like an overreaction to me for this path. Sure, > we should not record twice a conflict reason, but wouldn't an > assertion be more adapted enough to check that a termination is not > logged twice for this code path run in the checkpointer? Thanks for looking at it! Agree, using an assertion instead in v3 attached. > > + if (!terminated) > + { > + initial_restart_lsn = s->data.restart_lsn; > + initial_effective_xmin = s->effective_xmin; > + initial_catalog_effective_xmin = s->effective_catalog_xmin; > + } > > It seems important to document why we are saving this data here; while > we hold the LWLock for repslots, before we perform any termination, > and we want to protect conflict reports with incorrect data if the > slot data got changed while the lwlock is temporarily released while > there's a termination. Yeah, comments added in v3. > >> 3. Is there a way to reproduce this racy issue, may be by adding some > >> sleeps or something? If yes, can you share it here, just for the > >> records and verify the whatever fix provided actually works? > > > > Alexander was able to reproduce it on a slow machine and the issue was not there > > anymore with v1 in place. I think it's tricky to reproduce as it would need the > > slot to advance between the 2 checks. > > I'd really want a test for that in the long term. And an injection > point stuck between the point where ReplicationSlotControlLock is > released then re-acquired when there is an active PID should be > enough, isn't it? Yeah, that should be enough. > For example just after the kill()? It seems to me > that we should stuck the checkpointer, perform a forced upgrade of the > xmins, release the checkpointer and see the effects of the change in > the second loop. Now, modules/injection_points/ does not allow that, > yet, but I've posted a patch among these lines that can stop a process > and release it using a condition variable (should be 0006 on [1]). I > was planning to start a new thread with a patch posted for the next CF > to add this kind of facility with a regression test for an old bug, > the patch needs a few hours of love, first. I should be able to post > that next week. Great, that looks like a good idea! Are we going to fix this on master and 16 stable first and then later on add a test on master with the injection points? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Thu, Feb 15, 2024 at 11:24:59AM +0000, Bertrand Drouvot wrote: > Thanks for looking at it! > > Agree, using an assertion instead in v3 attached. And you did not forget the PG_USED_FOR_ASSERTS_ONLY. > > It seems important to document why we are saving this data here; while > > we hold the LWLock for repslots, before we perform any termination, > > and we want to protect conflict reports with incorrect data if the > > slot data got changed while the lwlock is temporarily released while > > there's a termination. > > Yeah, comments added in v3. The contents look rather OK, I may do some word-smithing for both. >> For example just after the kill()? It seems to me >> that we should stuck the checkpointer, perform a forced upgrade of the >> xmins, release the checkpointer and see the effects of the change in >> the second loop. Now, modules/injection_points/ does not allow that, >> yet, but I've posted a patch among these lines that can stop a process >> and release it using a condition variable (should be 0006 on [1]). I >> was planning to start a new thread with a patch posted for the next CF >> to add this kind of facility with a regression test for an old bug, >> the patch needs a few hours of love, first. I should be able to post >> that next week. > > Great, that looks like a good idea! I've been finally able to spend some time on what I had in mind and posted it here for the next CF: https://www.postgresql.org/message-id/ZdLuxBk5hGpol91B@paquier.xyz You should be able to reuse that the same way I do in 0002 posted on the thread, where I'm causing the checkpointer to wait, then wake it up. > Are we going to fix this on master and 16 stable first and then later on add a > test on master with the injection points? Still, the other patch is likely going to take a couple of weeks before getting into the tree, so I have no objection to fix the bug first and backpatch, then introduce a test. Things have proved that failures could show up in the buildfarm in this area, so more time running this code across two branches is not a bad concept, either. -- Michael
Attachment
Hi, On Mon, Feb 19, 2024 at 03:14:07PM +0900, Michael Paquier wrote: > On Thu, Feb 15, 2024 at 11:24:59AM +0000, Bertrand Drouvot wrote: > > Thanks for looking at it! > > > > Agree, using an assertion instead in v3 attached. > > And you did not forget the PG_USED_FOR_ASSERTS_ONLY. Thanks to the "CompilerWarnings" cirrus test ;-) > > > > It seems important to document why we are saving this data here; while > > > we hold the LWLock for repslots, before we perform any termination, > > > and we want to protect conflict reports with incorrect data if the > > > slot data got changed while the lwlock is temporarily released while > > > there's a termination. > > > > Yeah, comments added in v3. > > The contents look rather OK, I may do some word-smithing for both. Thanks! > >> For example just after the kill()? It seems to me > >> that we should stuck the checkpointer, perform a forced upgrade of the > >> xmins, release the checkpointer and see the effects of the change in > >> the second loop. Now, modules/injection_points/ does not allow that, > >> yet, but I've posted a patch among these lines that can stop a process > >> and release it using a condition variable (should be 0006 on [1]). I > >> was planning to start a new thread with a patch posted for the next CF > >> to add this kind of facility with a regression test for an old bug, > >> the patch needs a few hours of love, first. I should be able to post > >> that next week. > > > > Great, that looks like a good idea! > > I've been finally able to spend some time on what I had in mind and > posted it here for the next CF: > https://www.postgresql.org/message-id/ZdLuxBk5hGpol91B@paquier.xyz > > You should be able to reuse that the same way I do in 0002 posted on > the thread, where I'm causing the checkpointer to wait, then wake it > up. Thanks! I'll look at it. > > Are we going to fix this on master and 16 stable first and then later on add a > > test on master with the injection points? > > Still, the other patch is likely going to take a couple of weeks > before getting into the tree, so I have no objection to fix the bug > first and backpatch, then introduce a test. Things have proved that > failures could show up in the buildfarm in this area, so more time > running this code across two branches is not a bad concept, either. Fully agree. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote: > > > Yeah, comments added in v3. > > The contents look rather OK, I may do some word-smithing for both. Here are some comments on v3: 1. + XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr; + XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr; Prefix 'initial_' makes the variable names a bit longer, I think we can just use effective_xmin, catalog_effective_xmin and restart_lsn, the code updating then only when if (!terminated) tells one that they aren't updated every time. 2. + /* + * We'll release the slot's mutex soon, so it's possible that + * those values change since the process holding the slot has been + * terminated (if any), so record them here to ensure we would + * report the slot as obsolete correctly. + */ This needs a bit more info as to why and how effective_xmin, catalog_effective_xmin and restart_lsn can move ahead after signaling a backend and before the signalled backend reports back. 3. + /* + * Assert that the conflict cause recorded previously before we + * terminate the process did not change now for any reason. + */ + Assert(!(conflict_prev != RS_INVAL_NONE && terminated && + conflict_prev != conflict)); It took a while for me to understand the above condition, can we simplify it like below using De Morgan's laws for better readability? Assert((conflict_prev == RS_INVAL_NONE) || !terminated || (conflict_prev == conflict)); > > Are we going to fix this on master and 16 stable first and then later on add a > > test on master with the injection points? > > Still, the other patch is likely going to take a couple of weeks > before getting into the tree, so I have no objection to fix the bug > first and backpatch, then introduce a test. Things have proved that > failures could show up in the buildfarm in this area, so more time > running this code across two branches is not a bad concept, either. While I couldn't agree more on getting this fix in, it's worth pulling in the required injection points patch here and writing the test to reproduce this race issue. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote: > On Mon, Feb 19, 2024 at 11:44 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > > Yeah, comments added in v3. > > > > The contents look rather OK, I may do some word-smithing for both. > > Here are some comments on v3: Thanks for looing at it! > 1. > + XLogRecPtr initial_effective_xmin = InvalidXLogRecPtr; > + XLogRecPtr initial_catalog_effective_xmin = InvalidXLogRecPtr; > + XLogRecPtr initial_restart_lsn = InvalidXLogRecPtr; > > Prefix 'initial_' makes the variable names a bit longer, I think we > can just use effective_xmin, catalog_effective_xmin and restart_lsn, > the code updating then only when if (!terminated) tells one that they > aren't updated every time. I'm not sure about that. I prefer to see meaningfull names instead of having to read the code where they are used. > 2. > + /* > + * We'll release the slot's mutex soon, so it's possible that > + * those values change since the process holding the slot has been > + * terminated (if any), so record them here to ensure we would > + * report the slot as obsolete correctly. > + */ > > This needs a bit more info as to why and how effective_xmin, > catalog_effective_xmin and restart_lsn can move ahead after signaling > a backend and before the signalled backend reports back. I'm not sure of the added value of such extra details in this comment and if the comment would be easy to maintain. I've the feeling that knowing it's possible is enough here. Happy to hear what others think about it too. > 3. > + /* > + * Assert that the conflict cause recorded previously before we > + * terminate the process did not change now for any reason. > + */ > + Assert(!(conflict_prev != RS_INVAL_NONE && terminated && > + conflict_prev != conflict)); > > It took a while for me to understand the above condition, can we > simplify it like below using De Morgan's laws for better readability? > > Assert((conflict_prev == RS_INVAL_NONE) || !terminated || > (conflict_prev == conflict)); I don't have a strong opinon on this, looks like a matter of taste. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Mon, Feb 19, 2024 at 09:49:24AM +0000, Bertrand Drouvot wrote: > On Mon, Feb 19, 2024 at 01:45:16PM +0530, Bharath Rupireddy wrote: >> Prefix 'initial_' makes the variable names a bit longer, I think we >> can just use effective_xmin, catalog_effective_xmin and restart_lsn, >> the code updating then only when if (!terminated) tells one that they >> aren't updated every time. > > I'm not sure about that. I prefer to see meaningfull names instead of having > to read the code where they are used. Prefixing these with "initial_" is fine, IMO. That shows the intention that these come from the slot's data before doing the termination. So I'm OK with what's been proposed in v3. >> This needs a bit more info as to why and how effective_xmin, >> catalog_effective_xmin and restart_lsn can move ahead after signaling >> a backend and before the signalled backend reports back. > > I'm not sure of the added value of such extra details in this comment and if > the comment would be easy to maintain. I've the feeling that knowing it's possible > is enough here. Happy to hear what others think about it too. Documenting that the risk exists rather than all the possible reasons why this could happen is surely more maintainable. In short, I'm OK with what the patch does, just telling that it is possible. >> + Assert(!(conflict_prev != RS_INVAL_NONE && terminated && >> + conflict_prev != conflict)); >> >> It took a while for me to understand the above condition, can we >> simplify it like below using De Morgan's laws for better readability? >> >> Assert((conflict_prev == RS_INVAL_NONE) || !terminated || >> (conflict_prev == conflict)); > > I don't have a strong opinon on this, looks like a matter of taste. Both are the same to me, so I have no extra opinion to offer. ;) -- Michael
Attachment
On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote: > Prefixing these with "initial_" is fine, IMO. That shows the > intention that these come from the slot's data before doing the > termination. So I'm OK with what's been proposed in v3. I was looking at that a second time, and just concluded that this is OK, so I've applied that down to 16, wordsmithing a bit the comments. > Both are the same to me, so I have no extra opinion to offer. ;) I've kept this one as-is, though. -- Michael
Attachment
Hi, On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote: > On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote: > > Prefixing these with "initial_" is fine, IMO. That shows the > > intention that these come from the slot's data before doing the > > termination. So I'm OK with what's been proposed in v3. > > I was looking at that a second time, and just concluded that this is > OK, so I've applied that down to 16, wordsmithing a bit the comments. Thanks! FWIW, I've started to write a POC regarding the test we mentioned up-thread. The POC test is based on what has been submitted by Michael in [1]. The POC test seems to work fine and it seems that nothing more is needed in [1] (at some point I thought I would need the ability to wake up multiple "wait" injection points in sequence but that was not necessary). I'll polish and propose my POC test once [1] is pushed (unless you're curious about it before). [1]: https://www.postgresql.org/message-id/flat/ZdLuxBk5hGpol91B%40paquier.xyz Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Tue, Feb 20, 2024 at 04:03:53PM +0000, Bertrand Drouvot wrote: > Hi, > > On Tue, Feb 20, 2024 at 02:33:44PM +0900, Michael Paquier wrote: > > On Tue, Feb 20, 2024 at 08:51:17AM +0900, Michael Paquier wrote: > > > Prefixing these with "initial_" is fine, IMO. That shows the > > > intention that these come from the slot's data before doing the > > > termination. So I'm OK with what's been proposed in v3. > > > > I was looking at that a second time, and just concluded that this is > > OK, so I've applied that down to 16, wordsmithing a bit the comments. > > Thanks! > FWIW, I've started to write a POC regarding the test we mentioned up-thread. > > The POC test is based on what has been submitted by Michael in [1]. The POC test > seems to work fine and it seems that nothing more is needed in [1] (at some point > I thought I would need the ability to wake up multiple "wait" injection points > in sequence but that was not necessary). > > I'll polish and propose my POC test once [1] is pushed (unless you're curious > about it before). Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now (see the attached). Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Mon, Feb 26, 2024 at 02:01:45PM +0000, Bertrand Drouvot wrote: > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now > (see the attached). I have looked at what you have here. First, in a build where 818fefd8fd is included, this makes the test script a lot slower. Most of the logic is quick, but we're spending 10s or so checking that catalog_xmin has advanced. Could it be possible to make that faster? A second issue is the failure mode when 818fefd8fd is reverted. The test is getting stuck when we are waiting on the standby to catch up, until a timeout decides to kick in to fail the test, and all the previous tests pass. Could it be possible to make that more responsive? I assume that in the failure mode we would get an incorrect conflict_reason for injection_inactiveslot, succeeding in checking the failure. + my $terminated = 0; + for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++) + { + if ($node_standby->log_contains( + 'terminating process .* to release replication slot \"injection_activeslot\"', $logstart)) + { + $terminated = 1; + last; + } + usleep(100_000); + } + ok($terminated, 'terminating process holding the active slot is logged with injection point'); The LOG exists when we are sure that the startup process is waiting in the injection point, so this loop could be replaced with something like: + $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot'); + ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';) Nit: the name of the injection point should be terminate-process-holding-slot rather than TerminateProcessHoldingSlot, to be consistent with the other ones. -- Michael
Attachment
Hi, On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote: > On Mon, Feb 26, 2024 at 02:01:45PM +0000, Bertrand Drouvot wrote: > > Though [1] mentioned up-thread is not pushed yet, I'm Sharing the POC patch now > > (see the attached). > > I have looked at what you have here. Thanks! > First, in a build where 818fefd8fd is included, this makes the test > script a lot slower. Most of the logic is quick, but we're spending > 10s or so checking that catalog_xmin has advanced. Could it be > possible to make that faster? Yeah, v2 attached changes this. It moves the injection point after the process has been killed so that another process can decode at wish (without the need to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation(). > A second issue is the failure mode when 818fefd8fd is reverted. The > test is getting stuck when we are waiting on the standby to catch up, > until a timeout decides to kick in to fail the test, and all the > previous tests pass. Could it be possible to make that more > responsive? I assume that in the failure mode we would get an > incorrect conflict_reason for injection_inactiveslot, succeeding in > checking the failure. I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1" before the initial_* assignements). The issue is that then the new ASSERT is triggered leading to the standby shutdown. So, I'm not sure how to improve this case. > + my $terminated = 0; > + for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++) > + { > + if ($node_standby->log_contains( > + 'terminating process .* to release replication slot \"injection_activeslot\"', $logstart)) > + { > + $terminated = 1; > + last; > + } > + usleep(100_000); > + } > + ok($terminated, 'terminating process holding the active slot is logged with injection point'); > > The LOG exists when we are sure that the startup process is waiting > in the injection point, so this loop could be replaced with something > like: > + $node_standby->wait_for_event('startup', 'TerminateProcessHoldingSlot'); > + ok( $node_standby->log_contains('terminating process .* .. ', 'termin .. ';) > Yeah, now that wait_for_event() is there, let's use it: done in v2. > Nit: the name of the injection point should be > terminate-process-holding-slot rather than > TerminateProcessHoldingSlot, to be consistent with the other ones. Done in v2. Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Tue, Mar 05, 2024 at 10:17:03AM +0000, Bertrand Drouvot wrote: > On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote: >> First, in a build where 818fefd8fd is included, this makes the test >> script a lot slower. Most of the logic is quick, but we're spending >> 10s or so checking that catalog_xmin has advanced. Could it be >> possible to make that faster? > > Yeah, v2 attached changes this. It moves the injection point after the process > has been killed so that another process can decode at wish (without the need > to wait for a walsender timeout) to reach LogicalConfirmReceivedLocation(). Ah, OK. Indeed that's much faster this way. > I try to simulate a revert of 818fefd8fd (replacing "!terminated" by "1 == 1" > before the initial_* assignements). Yeah. I can see how this messes up with the calculation of the conditions, which is enough from my perspective, even if we don't have any sanity checks in 818fefd8fd originally. > The issue is that then the new ASSERT is > triggered leading to the standby shutdown. So, I'm not sure how to improve this > case. It's been mentioned recently that we are not good at detecting crashes in the TAP tests. I am wondering if we should check the status of the node in the most popular routines of Cluster.pm and die hard, as one way to make the tests more responsive.. A topic for a different thread, for sure. > Done in v2. Reworded a few things and applied this version. -- Michael
Attachment
Hi, On Wed, Mar 06, 2024 at 02:47:55PM +0900, Michael Paquier wrote: > On Tue, Mar 05, 2024 at 10:17:03AM +0000, Bertrand Drouvot wrote: > > On Tue, Mar 05, 2024 at 09:42:20AM +0900, Michael Paquier wrote: > > The issue is that then the new ASSERT is > > triggered leading to the standby shutdown. So, I'm not sure how to improve this > > case. > > It's been mentioned recently that we are not good at detecting crashes > in the TAP tests. I am wondering if we should check the status of the > node in the most popular routines of Cluster.pm and die hard, as one > way to make the tests more responsive.. A topic for a different > thread, for sure. Right, somehow out of context here. > > Done in v2. > > Reworded a few things and applied this version. Thanks! Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote: > Right, somehow out of context here. We're not yet in the green yet, one of my animals has complained: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03 This is telling us that the xmin horizon is unchanged, and the test cannot move on with the injection point wake up that would trigger the following logs: 2024-03-06 20:12:59.039 JST [21143] LOG: invalidating obsolete replication slot "injection_activeslot" 2024-03-06 20:12:59.039 JST [21143] DETAIL: The slot conflicted with xid horizon 770. Not sure what to think about that yet. -- Michael
Attachment
On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote: > > Right, somehow out of context here. > > We're not yet in the green yet, one of my animals has complained: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03 > > This is telling us that the xmin horizon is unchanged, and the test > cannot move on with the injection point wake up that would trigger the > following logs: > 2024-03-06 20:12:59.039 JST [21143] LOG: invalidating obsolete replication slot "injection_activeslot" > 2024-03-06 20:12:59.039 JST [21143] DETAIL: The slot conflicted with xid horizon 770. > > Not sure what to think about that yet. Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't happy about that as well [1]. It looks like the slot's catalog_xmin on the standby isn't moving forward. [1] https://cirrus-ci.com/task/5132148995260416 [09:11:17.851] 285/285 postgresql:recovery / recovery/035_standby_logical_decoding ERROR 553.48s (exit status 255 or signal 127 SIGinvalid) [09:11:17.855] >>> INITDB_TEMPLATE=C:/cirrus/build/tmp_install/initdb-template enable_injection_points=yes PG_REGRESS=C:\cirrus\build\src/test\regress\pg_regress.exe MALLOC_PERTURB_=172 REGRESS_SHLIB=C:\cirrus\build\src/test\regress\regress.dll PATH=C:/cirrus/build/tmp_install/usr/local/pgsql/bin;C:\cirrus\build\src/test\recovery;C:/cirrus/build/src/test/recovery/test;C:\BuildTools\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64;C:\BuildTools\MSBuild\Current\bin\Roslyn;C:\Program Files (x86)\Windows Kits\10\bin\10.0.20348.0\x64;C:\Program Files (x86)\Windows Kits\10\bin\x64;C:\BuildTools\\MSBuild\Current\Bin;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\BuildTools\Common7\IDE\;C:\BuildTools\Common7\Tools\;C:\BuildTools\VC\Auxiliary\Build;C:\zstd\zstd-v1.5.2-win64;C:\zlib;C:\lz4;C:\icu;C:\winflexbison;C:\strawberry\5.26.3.1\perl\bin;C:\python\Scripts\;C:\python\;C:\Windows Kits\10\Debuggers\x64;C:\Program Files\Git\usr\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\ProgramData\GooGet;C:\Program Files\Google\Compute Engine\metadata_scripts;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files\PowerShell\7\;C:\Program Files\Google\Compute Engine\sysprep;C:\ProgramData\chocolatey\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Windows\system32\config\systemprofile\AppData\Local\Microsoft\WindowsApps C:\python\python3.EXE C:\cirrus\build\..\src/tools/testwrap --basedir C:\cirrus\build --srcdir C:\cirrus\src/test\recovery --testgroup recovery --testname 035_standby_logical_decoding -- C:\strawberry\5.26.3.1\perl\bin\perl.EXE -I C:/cirrus/src/test/perl -I C:\cirrus\src/test\recovery C:/cirrus/src/test/recovery/t/035_standby_logical_decoding.pl [09:11:17.855] ------------------------------------- 8< ------------------------------------- [09:11:17.855] stderr: [09:11:17.855] # poll_query_until timed out executing this query: [09:11:17.855] # SELECT (SELECT catalog_xmin::text::int - 770 from pg_catalog.pg_replication_slots where slot_name = 'injection_activeslot') > 0 [09:11:17.855] # expecting this output: [09:11:17.855] # t [09:11:17.855] # last actual query output: [09:11:17.855] # f [09:11:17.855] # with stderr: [09:11:17.855] # Tests were run but no plan was declared and done_testing() was not seen. [09:11:17.855] # Looks like your test exited with 255 just after 57. -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On Wed, Mar 06, 2024 at 05:45:56PM +0530, Bharath Rupireddy wrote: > On Wed, Mar 6, 2024 at 4:51 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Wed, Mar 06, 2024 at 09:17:58AM +0000, Bertrand Drouvot wrote: > > > Right, somehow out of context here. > > > > We're not yet in the green yet, one of my animals has complained: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hachi&dt=2024-03-06%2010%3A10%3A03 > > > > This is telling us that the xmin horizon is unchanged, and the test > > cannot move on with the injection point wake up that would trigger the > > following logs: > > 2024-03-06 20:12:59.039 JST [21143] LOG: invalidating obsolete replication slot "injection_activeslot" > > 2024-03-06 20:12:59.039 JST [21143] DETAIL: The slot conflicted with xid horizon 770. > > > > Not sure what to think about that yet. > > Windows - Server 2019, VS 2019 - Meson & ninja on my CI setup isn't > happy about that as well [1]. It looks like the slot's catalog_xmin on > the standby isn't moving forward. > Thank you both for the report! I did a few test manually and can see the issue from times to times. When the issue occurs, the logical decoding was able to go through the place where LogicalConfirmReceivedLocation() updates the slot's catalog_xmin before being killed. In that case I can see that the catalog_xmin is updated to the xid horizon. Means in a failed test we have something like: slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839." While when the test is ok you'll see something like: slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842." In the failing test the call to SELECT pg_logical_slot_get_changes() does not advance the slot's catalog xmin anymore. To fix this, I think we need a new transacion to decode from the primary before executing pg_logical_slot_get_changes(). But this transaction has to be replayed on the standby first by the startup process. Which means we need to wakeup "terminate-process-holding-slot" and that we probably need another injection point somewehere in this test. I'll look at it unless you've another idea? Regards, -- Bertrand Drouvot PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, Mar 06, 2024 at 05:45:56PM +0000, Bertrand Drouvot wrote: > Thank you both for the report! I did a few test manually and can see the issue > from times to times. When the issue occurs, the logical decoding was able to > go through the place where LogicalConfirmReceivedLocation() updates the > slot's catalog_xmin before being killed. In that case I can see that the > catalog_xmin is updated to the xid horizon. Two buildfarm machines have complained here, and one of them twice in a row. That's quite amazing, because a couple of dozen runs done in a row on the same host as these animals all pass. The CI did not complain either (did 2~3 runs there yesterday). > Means in a failed test we have something like: > > slot's catalog_xmin: 839 and "The slot conflicted with xid horizon 839." > > While when the test is ok you'll see something like: > > slot's catalog_xmin: 841 and "The slot conflicted with xid horizon 842." Perhaps we should also make the test report the catalog_xmin of the slot. That may make debugging a bit easier. > In the failing test the call to SELECT pg_logical_slot_get_changes() does > not advance the slot's catalog xmin anymore. Is that something that we could enforce in the test in a stronger way, cross-checking the xmin horizon before and after the call? > To fix this, I think we need a new transacion to decode from the primary before > executing pg_logical_slot_get_changes(). But this transaction has to be replayed > on the standby first by the startup process. Which means we need to wakeup > "terminate-process-holding-slot" and that we probably need another injection > point somewehere in this test. > > I'll look at it unless you've another idea? I am wondering if there is something else lurking here, actually, so for now I am going to revert the change as it is annoying to get sporadic failures in the CF bot at this time of the year and there are a lot of patches under discussion. Let's give it more time and more thoughts, without pressure. -- Michael