Thread: Race condition in InvalidateObsoleteReplicationSlots()
Hi, I was looking at InvalidateObsoleteReplicationSlots() while reviewing / polishing the logical decoding on standby patch. Which lead me to notice that I think there's a race in InvalidateObsoleteReplicationSlots() (because ResolveRecoveryConflictWithLogicalSlots has a closely related one). void InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno) { ... LWLockAcquire(ReplicationSlotControlLock, LW_SHARED); for (int i = 0; i < max_replication_slots; i++) { ... if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN) continue; LWLockRelease(ReplicationSlotControlLock); ... for (;;) { ... wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire); ... SpinLockAcquire(&s->mutex); s->data.invalidated_at = s->data.restart_lsn; s->data.restart_lsn = InvalidXLogRecPtr; SpinLockRelease(&s->mutex); ... As far as I can tell there's no guarantee that the slot wasn't concurrently dropped and another replication slot created at the same offset in ReplicationSlotCtl->replication_slots. Which we then promptly would invalidate, regardless of the slot not actually needing to be invalidated. Note that this is different from the race mentioned in a comment: /* * Signal to terminate the process that owns the slot. * * There is the race condition where other process may own * the slot after the process using it was terminated and before * this process owns it. To handle this case, we signal again * if the PID of the owning process is changed than the last. * * XXX This logic assumes that the same PID is not reused * very quickly. */ It's one thing to terminate a connection erroneously - permanently breaking a replica due to invalidating the wrong slot or such imo is different. Interestingly this problem seems to have been present both in commit c6550776394e25c1620bc8258427c8f1d448080d Author: Alvaro Herrera <alvherre@alvh.no-ip.org> Date: 2020-04-07 18:35:00 -0400 Allow users to limit storage reserved by replication slots commit f9e9704f09daf882f5a1cf1fbe3f5a3150ae2bb9 Author: Fujii Masao <fujii@postgresql.org> Date: 2020-06-19 17:15:52 +0900 Fix issues in invalidation of obsolete replication slots. I think this can be solved in two different ways: 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new slot in the to-be-obsoleted-slot's place. 2) Atomically check whether the slot needs to be invalidated and try to acquire if needed. Don't release ReplicationSlotControlLock between those two steps. Signal the owner to release the slot iff we couldn't acquire the slot. In the latter case wait and then recheck if the slot still needs to be dropped. To me 2) seems better, because we then can also be sure that the slot still needs to be obsoleted, rather than potentially doing so unnecessarily. It looks to me like several of the problems here stem from trying to reuse code from ReplicationSlotAcquireInternal() (which before this was just named ReplicationSlotAcquire()). I don't think that makes sense, because cases like this want to check if a condition is true, and acquire it only if so. IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), except that a different condition is checked, and the if (active_pid) case needs to prepare a condition variable, signal the owner and then wait on the condition variable, to restart after. Greetings, Andres Freund
Hi, On 2021-04-07 17:10:37 -0700, Andres Freund wrote: > I think this can be solved in two different ways: > > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of > InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new > slot in the to-be-obsoleted-slot's place. > > 2) Atomically check whether the slot needs to be invalidated and try to > acquire if needed. Don't release ReplicationSlotControlLock between those > two steps. Signal the owner to release the slot iff we couldn't acquire the > slot. In the latter case wait and then recheck if the slot still needs to > be dropped. > > To me 2) seems better, because we then can also be sure that the slot still > needs to be obsoleted, rather than potentially doing so unnecessarily. > > > It looks to me like several of the problems here stem from trying to reuse > code from ReplicationSlotAcquireInternal() (which before this was just named > ReplicationSlotAcquire()). I don't think that makes sense, because cases like > this want to check if a condition is true, and acquire it only if so. > > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), > except that a different condition is checked, and the if (active_pid) case > needs to prepare a condition variable, signal the owner and then wait on the > condition variable, to restart after. I'm also confused by the use of ConditionVariableTimedSleep(timeout = 10). Why do we need a timed sleep here in the first place? And why with such a short sleep? I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - but is aware it's running in checkpointer. I don't think CFI does much there? If we are worried about needing to check for interrupts, more work is needed. Sketch for a fix attached. I did leave the odd ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's there... After this I don't see a reason to have SAB_Inquire - as far as I can tell it's practically impossible to use without race conditions? Except for raising an error - which is "builtin"... Greetings, Andres Freund
Attachment
On Thu, Apr 8, 2021 at 7:39 AM Andres Freund <andres@anarazel.de> wrote: > > On 2021-04-07 17:10:37 -0700, Andres Freund wrote: > > I think this can be solved in two different ways: > > > > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of > > InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new > > slot in the to-be-obsoleted-slot's place. > > > > 2) Atomically check whether the slot needs to be invalidated and try to > > acquire if needed. Don't release ReplicationSlotControlLock between those > > two steps. Signal the owner to release the slot iff we couldn't acquire the > > slot. In the latter case wait and then recheck if the slot still needs to > > be dropped. > > > > To me 2) seems better, because we then can also be sure that the slot still > > needs to be obsoleted, rather than potentially doing so unnecessarily. > > +1. > > > > It looks to me like several of the problems here stem from trying to reuse > > code from ReplicationSlotAcquireInternal() (which before this was just named > > ReplicationSlotAcquire()). I don't think that makes sense, because cases like > > this want to check if a condition is true, and acquire it only if so. > > > > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(), > > except that a different condition is checked, and the if (active_pid) case > > needs to prepare a condition variable, signal the owner and then wait on the > > condition variable, to restart after. > > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > 10). Why do we need a timed sleep here in the first place? And why with > such a short sleep? > > I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - > but is aware it's running in checkpointer. I don't think CFI does much > there? If we are worried about needing to check for interrupts, more > work is needed. > > > Sketch for a fix attached. I did leave the odd > ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's > there... > I haven't tested the patch but I couldn't spot any problems while reading it. A minor point, don't we need to use ConditionVariableCancelSleep() at some point after doing ConditionVariableTimedSleep? -- With Regards, Amit Kapila.
On 2021-Apr-07, Andres Freund wrote: > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > 10). Why do we need a timed sleep here in the first place? And why with > such a short sleep? I was scared of the possibility that a process would not set the CV for whatever reason, causing checkpointing to become stuck. Maybe that's misguided thinking if CVs are reliable enough. > I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); - > but is aware it's running in checkpointer. I don't think CFI does much > there? If we are worried about needing to check for interrupts, more > work is needed. Hmm .. yeah, doing CFI seems pretty useless. I think that should just be removed. If checkpointer gets USR2 (request for shutdown) it's not going to affect the behavior of CFI anyway. I attach a couple of changes to your 0001. It's all cosmetic; what looks not so cosmetic is the change of "continue" to "break" in helper routine; if !s->in_use, we'd loop infinitely. The other routine already checks that before calling the helper; since you hold ReplicationSlotControlLock at that point, it should not be possible to drop it in between. Anyway, it's a trivial change to make, so it should be correct. I also added a "continue" at the bottom of one block; currently that doesn't change any behavior, but if we add code at the other block, it might not be what's intended. > After this I don't see a reason to have SAB_Inquire - as far as I can > tell it's practically impossible to use without race conditions? Except > for raising an error - which is "builtin"... Hmm, interesting ... If not needed, yeah let's get rid of that. Are you getting this set pushed, or would you like me to handle it? (There seems to be some minor conflict in 13) -- Álvaro Herrera Valdivia, Chile "Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)
Attachment
Hi, On 2021-04-08 17:03:41 +0530, Amit Kapila wrote: > I haven't tested the patch but I couldn't spot any problems while > reading it. A minor point, don't we need to use > ConditionVariableCancelSleep() at some point after doing > ConditionVariableTimedSleep? It's not really necessary - unless the CV could get deallocated as part of dynamic shared memory or such. Greetings, Andres Freund
Hi, On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote: > On 2021-Apr-07, Andres Freund wrote: > > > I'm also confused by the use of ConditionVariableTimedSleep(timeout = > > 10). Why do we need a timed sleep here in the first place? And why with > > such a short sleep? > > I was scared of the possibility that a process would not set the CV for > whatever reason, causing checkpointing to become stuck. Maybe that's > misguided thinking if CVs are reliable enough. They better be, or we have bigger problems. And if it's an escape hatch we surely ought not to use 10ms as the timeout. That's an appropriate time for something *not* using condition variables... > I attach a couple of changes to your 0001. It's all cosmetic; what > looks not so cosmetic is the change of "continue" to "break" in helper > routine; if !s->in_use, we'd loop infinitely. The other routine > already checks that before calling the helper; since you hold > ReplicationSlotControlLock at that point, it should not be possible to > drop it in between. Anyway, it's a trivial change to make, so it should > be correct. > I also added a "continue" at the bottom of one block; currently that > doesn't change any behavior, but if we add code at the other block, it > might not be what's intended. Seems sane. > Are you getting this set pushed, or would you like me to handle it? > (There seems to be some minor conflict in 13) I'd be quite happy for you to handle it... Greetings, Andres Freund
Here's a version that I feel is committable (0001). There was an issue when returning from the inner loop, if in a previous iteration we had released the lock. In that case we need to return with the lock not held, so that the caller can acquire it again, but weren't. This seems pretty hard to hit in practice (I suppose somebody needs to destroy the slot just as checkpointer killed the walsender, but before checkpointer marks it as its own) ... but if it did happen, I think checkpointer would block with no recourse. Also added some comments and slightly restructured the code. There are plenty of conflicts in pg13, but it's all easy to handle. I wrote a test (0002) to cover the case of signalling a walsender, which is currently not covered (we only deal with the case of a standby that's not running). There are some sharp edges in this code -- I had to make it use background_psql() to send a CHECKPOINT, which hangs, because I previously send a SIGSTOP to the walreceiver. Maybe there's a better way to achieve a walreceiver that remains connected but doesn't consume input from the primary, but I don't know what it is. Anyway, the code becomes covered with this. I would like to at least see it in master, to gather some reactions from buildfarm. -- Álvaro Herrera Valdivia, Chile <Schwern> It does it in a really, really complicated way <crab> why does it need to be complicated? <Schwern> Because it's MakeMaker.
Attachment
On 2021-Jun-10, Álvaro Herrera wrote: > I wrote a test (0002) to cover the case of signalling a walsender, which > is currently not covered (we only deal with the case of a standby that's > not running). There are some sharp edges in this code -- I had to make > it use background_psql() to send a CHECKPOINT, which hangs, because I > previously send a SIGSTOP to the walreceiver. Maybe there's a better > way to achieve a walreceiver that remains connected but doesn't consume > input from the primary, but I don't know what it is. Anyway, the code > becomes covered with this. I would like to at least see it in master, > to gather some reactions from buildfarm. Small fixup to the test one, so that skipping it on Windows works correctly. -- Álvaro Herrera 39°49'30"S 73°17'W Voy a acabar con todos los humanos / con los humanos yo acabaré voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)
Attachment
On 2021-Jun-10, Álvaro Herrera wrote: > Here's a version that I feel is committable (0001). There was an issue > when returning from the inner loop, if in a previous iteration we had > released the lock. In that case we need to return with the lock not > held, so that the caller can acquire it again, but weren't. This seems > pretty hard to hit in practice (I suppose somebody needs to destroy the > slot just as checkpointer killed the walsender, but before checkpointer > marks it as its own) ... but if it did happen, I think checkpointer > would block with no recourse. Also added some comments and slightly > restructured the code. > > There are plenty of conflicts in pg13, but it's all easy to handle. Pushed, with additional minor changes. > I wrote a test (0002) to cover the case of signalling a walsender, which > is currently not covered (we only deal with the case of a standby that's > not running). There are some sharp edges in this code -- I had to make > it use background_psql() to send a CHECKPOINT, which hangs, because I > previously send a SIGSTOP to the walreceiver. Maybe there's a better > way to achieve a walreceiver that remains connected but doesn't consume > input from the primary, but I don't know what it is. Anyway, the code > becomes covered with this. I would like to at least see it in master, > to gather some reactions from buildfarm. I tried hard to make this stable, but it just isn't (it works fine one thousand runs, then I grab some coffee and run it once more and that one fails. Why? that's not clear to me). Attached is the last one I have, in case somebody wants to make it better. Maybe there's some completely different approach that works better, but I'm out of ideas for now. -- Álvaro Herrera Valdivia, Chile "La experiencia nos dice que el hombre peló millones de veces las patatas, pero era forzoso admitir la posibilidad de que en un caso entre millones, las patatas pelarían al hombre" (Ijon Tichy)
Attachment
On 2021-Apr-07, Andres Freund wrote: > After this I don't see a reason to have SAB_Inquire - as far as I can > tell it's practically impossible to use without race conditions? Except > for raising an error - which is "builtin"... Pushed 0002. Thanks! -- Álvaro Herrera 39°49'30"S 73°17'W "La persona que no quería pecar / estaba obligada a sentarse en duras y empinadas sillas / desprovistas, por cierto de blandos atenuantes" (Patricio Vogel)
On 2021-06-11 15:52:21 -0400, Álvaro Herrera wrote: > On 2021-Apr-07, Andres Freund wrote: > > > After this I don't see a reason to have SAB_Inquire - as far as I can > > tell it's practically impossible to use without race conditions? Except > > for raising an error - which is "builtin"... > > Pushed 0002. > > Thanks! Thank you for your work on this!
On 2021-Jun-11, Álvaro Herrera wrote: > I tried hard to make this stable, but it just isn't (it works fine one > thousand runs, then I grab some coffee and run it once more and that one > fails. Why? that's not clear to me). Attached is the last one I have, > in case somebody wants to make it better. Maybe there's some completely > different approach that works better, but I'm out of ideas for now. It occurred to me that this could be made better by sigstopping both walreceiver and walsender, then letting only the latter run; AFAICS this makes the test stable. I'll register this on the upcoming commitfest to let cfbot run it, and if it looks good there I'll get it pushed to master. If there's any problem I'll just remove it before beta2 is stamped. -- Álvaro Herrera Valdivia, Chile
Attachment
Apologies, I inadvertently sent the version before I added a maximum number of iterations in the final loop. -- Álvaro Herrera Valdivia, Chile "La fuerza no está en los medios físicos sino que reside en una voluntad indomable" (Gandhi)
Attachment
=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes: > It occurred to me that this could be made better by sigstopping both > walreceiver and walsender, then letting only the latter run; AFAICS this > makes the test stable. I'll register this on the upcoming commitfest to > let cfbot run it, and if it looks good there I'll get it pushed to > master. If there's any problem I'll just remove it before beta2 is > stamped. Hmm ... desmoxytes has failed this test once, out of four runs since it went in: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04 None of the other animals that have reported in so far are unhappy. Still, maybe that's not a track record we want to have for beta2? I've just launched a run on gaur, which given its dinosaur status might be the most likely animal to have an actual portability problem with this test technique. If you want to wait a few hours to see what it says, that'd be fine with me. regards, tom lane
Hah, desmoxytes failed once: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04 I'll revert it and investigate later. There have been no other failures. -- Álvaro Herrera 39°49'30"S 73°17'W "Hay que recordar que la existencia en el cosmos, y particularmente la elaboración de civilizaciones dentro de él no son, por desgracia, nada idílicas" (Ijon Tichy)
I wrote: > Hmm ... desmoxytes has failed this test once, out of four runs since > it went in: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04 I studied this failure a bit more, and I think the test itself has a race condition. It's doing # freeze walsender and walreceiver. Slot will still be active, but walreceiver # won't get anything anymore. kill 'STOP', $senderpid, $receiverpid; $logstart = get_log_size($node_primary3); advance_wal($node_primary3, 4); ok(find_in_log($node_primary3, "to release replication slot", $logstart), "walreceiver termination logged"); The string it's looking for does show up in node_primary3's log, but not for another second or so; we can see instances of the following poll_query_until query before that happens. So the problem is that there is no interlock to ensure that the walreceiver terminates before this find_in_log check looks for it. You should be able to fix this by adding a retry loop around the find_in_log check (which would likely mean that you don't need to do multiple advance_wal iterations here). However, I agree with reverting the test for now and then trying again after beta2. regards, tom lane
I wrote: > I studied this failure a bit more, and I think the test itself has > a race condition. It's doing > > # freeze walsender and walreceiver. Slot will still be active, but walreceiver > # won't get anything anymore. > kill 'STOP', $senderpid, $receiverpid; > $logstart = get_log_size($node_primary3); > advance_wal($node_primary3, 4); > ok(find_in_log($node_primary3, "to release replication slot", $logstart), > "walreceiver termination logged"); Actually ... isn't there a second race, in the opposite direction? IIUC, the point of this is that once we force some WAL to be sent to the frozen sender/receiver, they'll be killed for failure to respond. But the advance_wal call is not the only possible cause of that; a background autovacuum for example could emit some WAL. So I fear it's possible for the 'to release replication slot' message to come out before we capture $logstart. I think you need to capture that value before the kill not after. I also suggest that it wouldn't be a bad idea to make the find_in_log check more specific, by including the expected PID and perhaps the expected slot name in the string. The full message in primary3's log looks like 2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG: terminating process 3548959 to release replication slot "rep3" and I don't understand why we wouldn't match on the whole message text. (I think doing so will also reveal that what we are looking for here is the walsender pid, not the walreceiver pid, and thus that the description in the ok() call is backwards. Or maybe we do want to check the walreceiver side, in which case we are searching the wrong postmaster's log?) regards, tom lane
On 2021-Jun-20, Tom Lane wrote: > Actually ... isn't there a second race, in the opposite direction? > IIUC, the point of this is that once we force some WAL to be sent > to the frozen sender/receiver, they'll be killed for failure to > respond. But the advance_wal call is not the only possible cause > of that; a background autovacuum for example could emit some WAL. > So I fear it's possible for the 'to release replication slot' > message to come out before we capture $logstart. I think you > need to capture that value before the kill not after. I accounted for all those things and pushed again. -- Álvaro Herrera Valdivia, Chile "I can see support will not be a problem. 10 out of 10." (Simon Wittber) (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)
On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Jun-20, Tom Lane wrote: > > > Actually ... isn't there a second race, in the opposite direction? > > IIUC, the point of this is that once we force some WAL to be sent > > to the frozen sender/receiver, they'll be killed for failure to > > respond. But the advance_wal call is not the only possible cause > > of that; a background autovacuum for example could emit some WAL. > > So I fear it's possible for the 'to release replication slot' > > message to come out before we capture $logstart. I think you > > need to capture that value before the kill not after. > > I accounted for all those things and pushed again. I saw that this patch is pushed. If there is no pending work left for this, can we change the commitfest entry to Committed. Regards, Vignesh
On 2021-Jul-05, vignesh C wrote: > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > On 2021-Jun-20, Tom Lane wrote: > > > > > Actually ... isn't there a second race, in the opposite direction? > > > IIUC, the point of this is that once we force some WAL to be sent > > > to the frozen sender/receiver, they'll be killed for failure to > > > respond. But the advance_wal call is not the only possible cause > > > of that; a background autovacuum for example could emit some WAL. > > > So I fear it's possible for the 'to release replication slot' > > > message to come out before we capture $logstart. I think you > > > need to capture that value before the kill not after. > > > > I accounted for all those things and pushed again. > > I saw that this patch is pushed. If there is no pending work left for > this, can we change the commitfest entry to Committed. There is none that I'm aware of, please mark it committed. Thanks -- Álvaro Herrera PostgreSQL Developer — https://www.EnterpriseDB.com/ "Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)
On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2021-Jul-05, vignesh C wrote: > > > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > > > > > On 2021-Jun-20, Tom Lane wrote: > > > > > > > Actually ... isn't there a second race, in the opposite direction? > > > > IIUC, the point of this is that once we force some WAL to be sent > > > > to the frozen sender/receiver, they'll be killed for failure to > > > > respond. But the advance_wal call is not the only possible cause > > > > of that; a background autovacuum for example could emit some WAL. > > > > So I fear it's possible for the 'to release replication slot' > > > > message to come out before we capture $logstart. I think you > > > > need to capture that value before the kill not after. > > > > > > I accounted for all those things and pushed again. > > > > I saw that this patch is pushed. If there is no pending work left for > > this, can we change the commitfest entry to Committed. > > There is none that I'm aware of, please mark it committed. Thanks Thanks for confirming, I have marked it as committed. Regards, Vignesh
Hi, On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote: > On 2021-Jun-10, Álvaro Herrera wrote: > > > Here's a version that I feel is committable (0001). There was an issue > > when returning from the inner loop, if in a previous iteration we had > > released the lock. In that case we need to return with the lock not > > held, so that the caller can acquire it again, but weren't. This seems > > pretty hard to hit in practice (I suppose somebody needs to destroy the > > slot just as checkpointer killed the walsender, but before checkpointer > > marks it as its own) ... but if it did happen, I think checkpointer > > would block with no recourse. Also added some comments and slightly > > restructured the code. > > > > There are plenty of conflicts in pg13, but it's all easy to handle. > > Pushed, with additional minor changes. I stared at this code, due to [1], and I think I found a bug. I think it's not the cause of the failures in that thread, but we probably should still do something about it. I think the minor changes might unfortunately have introduced a race? Before the patch just used ConditionVariableSleep(), but now it also has a ConditionVariablePrepareToSleep(). Without re-checking the sleep condition until /* Wait until the slot is released. */ ConditionVariableSleep(&s->active_cv, WAIT_EVENT_REPLICATION_SLOT_DROP); which directly violates what ConditionVariablePrepareToSleep() documents: * This can optionally be called before entering a test/sleep loop. * Doing so is more efficient if we'll need to sleep at least once. * However, if the first test of the exit condition is likely to succeed, * it's more efficient to omit the ConditionVariablePrepareToSleep call. * See comments in ConditionVariableSleep for more detail. * * Caution: "before entering the loop" means you *must* test the exit * condition between calling ConditionVariablePrepareToSleep and calling * ConditionVariableSleep. If that is inconvenient, omit calling * ConditionVariablePrepareToSleep. Afaics this means we can potentially sleep forever if the prior owner of the slot releases it before the ConditionVariablePrepareToSleep(). There's a comment that's mentioning this danger: /* * Prepare the sleep on the slot's condition variable before * releasing the lock, to close a possible race condition if the * slot is released before the sleep below. */ ConditionVariablePrepareToSleep(&s->active_cv); LWLockRelease(ReplicationSlotControlLock); but afaics that is bogus, because releasing a slot doesn't take ReplicationSlotControlLock. That just protects against the slot being dropped, not against it being released. We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier, before the checks at the start of the while loop. Greetings, Andres Freund [1] https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de
Hi, On 2022-02-22 17:48:55 -0800, Andres Freund wrote: > I think the minor changes might unfortunately have introduced a race? Before > the patch just used ConditionVariableSleep(), but now it also has a > ConditionVariablePrepareToSleep(). Without re-checking the sleep condition > until > /* Wait until the slot is released. */ > ConditionVariableSleep(&s->active_cv, > WAIT_EVENT_REPLICATION_SLOT_DROP); > > which directly violates what ConditionVariablePrepareToSleep() documents: > > * This can optionally be called before entering a test/sleep loop. > * Doing so is more efficient if we'll need to sleep at least once. > * However, if the first test of the exit condition is likely to succeed, > * it's more efficient to omit the ConditionVariablePrepareToSleep call. > * See comments in ConditionVariableSleep for more detail. > * > * Caution: "before entering the loop" means you *must* test the exit > * condition between calling ConditionVariablePrepareToSleep and calling > * ConditionVariableSleep. If that is inconvenient, omit calling > * ConditionVariablePrepareToSleep. > > > Afaics this means we can potentially sleep forever if the prior owner of the > slot releases it before the ConditionVariablePrepareToSleep(). > > There's a comment that's mentioning this danger: > > /* > * Prepare the sleep on the slot's condition variable before > * releasing the lock, to close a possible race condition if the > * slot is released before the sleep below. > */ > ConditionVariablePrepareToSleep(&s->active_cv); > > LWLockRelease(ReplicationSlotControlLock); > > but afaics that is bogus, because releasing a slot doesn't take > ReplicationSlotControlLock. That just protects against the slot being dropped, > not against it being released. > > We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier, > before the checks at the start of the while loop. Not at the start of the while loop, outside of the while loop. Doing it in the loop body doesn't make sense, even if it's at the top. Each ConditionVariablePrepareToSleep() will unregister itself: /* * If some other sleep is already prepared, cancel it; this is necessary * because we have just one static variable tracking the prepared sleep, * and also only one cvWaitLink in our PGPROC. It's okay to do this * because whenever control does return to the other test-and-sleep loop, * its ConditionVariableSleep call will just re-establish that sleep as * the prepared one. */ if (cv_sleep_target != NULL) ConditionVariableCancelSleep(); The intended use is documented in this comment: * This should be called in a predicate loop that tests for a specific exit * condition and otherwise sleeps, like so: * * ConditionVariablePrepareToSleep(cv); // optional * while (condition for which we are waiting is not true) * ConditionVariableSleep(cv, wait_event_info); * ConditionVariableCancelSleep(); Greetings, Andres Freund