Thread: Race condition in InvalidateObsoleteReplicationSlots()

Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

08 April 2021, 00:10:37

Hi,

I was looking at InvalidateObsoleteReplicationSlots() while reviewing /
polishing the logical decoding on standby patch. Which lead me to notice that
I think there's a race in InvalidateObsoleteReplicationSlots() (because
ResolveRecoveryConflictWithLogicalSlots has a closely related one).

void
InvalidateObsoleteReplicationSlots(XLogSegNo oldestSegno)
{
...
    LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
    for (int i = 0; i < max_replication_slots; i++)
    {
...
        if (XLogRecPtrIsInvalid(restart_lsn) || restart_lsn >= oldestLSN)
            continue;
        LWLockRelease(ReplicationSlotControlLock);
...
        for (;;)
        {
...
            wspid = ReplicationSlotAcquireInternal(s, NULL, SAB_Inquire);
...
        SpinLockAcquire(&s->mutex);
        s->data.invalidated_at = s->data.restart_lsn;
        s->data.restart_lsn = InvalidXLogRecPtr;
        SpinLockRelease(&s->mutex);
...


As far as I can tell there's no guarantee that the slot wasn't concurrently
dropped and another replication slot created at the same offset in
ReplicationSlotCtl->replication_slots. Which we then promptly would
invalidate, regardless of the slot not actually needing to be invalidated.

Note that this is different from the race mentioned in a comment:
            /*
             * Signal to terminate the process that owns the slot.
             *
             * There is the race condition where other process may own
             * the slot after the process using it was terminated and before
             * this process owns it. To handle this case, we signal again
             * if the PID of the owning process is changed than the last.
             *
             * XXX This logic assumes that the same PID is not reused
             * very quickly.
             */

It's one thing to terminate a connection erroneously - permanently breaking a
replica due to invalidating the wrong slot or such imo is different.


Interestingly this problem seems to have been present both in

commit c6550776394e25c1620bc8258427c8f1d448080d
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   2020-04-07 18:35:00 -0400

    Allow users to limit storage reserved by replication slots

commit f9e9704f09daf882f5a1cf1fbe3f5a3150ae2bb9
Author: Fujii Masao <fujii@postgresql.org>
Date:   2020-06-19 17:15:52 +0900

    Fix issues in invalidation of obsolete replication slots.


I think this can be solved in two different ways:

1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
   InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
   slot in the to-be-obsoleted-slot's place.

2) Atomically check whether the slot needs to be invalidated and try to
   acquire if needed. Don't release ReplicationSlotControlLock between those
   two steps. Signal the owner to release the slot iff we couldn't acquire the
   slot. In the latter case wait and then recheck if the slot still needs to
   be dropped.

To me 2) seems better, because we then can also be sure that the slot still
needs to be obsoleted, rather than potentially doing so unnecessarily.


It looks to me like several of the problems here stem from trying to reuse
code from ReplicationSlotAcquireInternal() (which before this was just named
ReplicationSlotAcquire()).  I don't think that makes sense, because cases like
this want to check if a condition is true, and acquire it only if so.

IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
except that a different condition is checked, and the if (active_pid) case
needs to prepare a condition variable, signal the owner and then wait on the
condition variable, to restart after.


Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

08 April 2021, 02:09:13

Hi,

On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> I think this can be solved in two different ways:
>
> 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
>    InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
>    slot in the to-be-obsoleted-slot's place.
>
> 2) Atomically check whether the slot needs to be invalidated and try to
>    acquire if needed. Don't release ReplicationSlotControlLock between those
>    two steps. Signal the owner to release the slot iff we couldn't acquire the
>    slot. In the latter case wait and then recheck if the slot still needs to
>    be dropped.
>
> To me 2) seems better, because we then can also be sure that the slot still
> needs to be obsoleted, rather than potentially doing so unnecessarily.
>
>
> It looks to me like several of the problems here stem from trying to reuse
> code from ReplicationSlotAcquireInternal() (which before this was just named
> ReplicationSlotAcquire()).  I don't think that makes sense, because cases like
> this want to check if a condition is true, and acquire it only if so.
>
> IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
> except that a different condition is checked, and the if (active_pid) case
> needs to prepare a condition variable, signal the owner and then wait on the
> condition variable, to restart after.

I'm also confused by the use of ConditionVariableTimedSleep(timeout =
10). Why do we need a timed sleep here in the first place? And why with
such a short sleep?

I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
but is aware it's running in checkpointer. I don't think CFI does much
there? If we are worried about needing to check for interrupts, more
work is needed.

Sketch for a fix attached. I did leave the odd
ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
there...

After this I don't see a reason to have SAB_Inquire - as far as I can
tell it's practically impossible to use without race conditions? Except
for raising an error - which is "builtin"...

Greetings,

Andres Freund

Attachment

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Amit Kapila

Date:

08 April 2021, 11:33:41

On Thu, Apr 8, 2021 at 7:39 AM Andres Freund <andres@anarazel.de> wrote:
>
> On 2021-04-07 17:10:37 -0700, Andres Freund wrote:
> > I think this can be solved in two different ways:
> >
> > 1) Hold ReplicationSlotAllocationLock with LW_SHARED across most of
> >    InvalidateObsoleteReplicationSlots(). That way nobody could re-create a new
> >    slot in the to-be-obsoleted-slot's place.
> >
> > 2) Atomically check whether the slot needs to be invalidated and try to
> >    acquire if needed. Don't release ReplicationSlotControlLock between those
> >    two steps. Signal the owner to release the slot iff we couldn't acquire the
> >    slot. In the latter case wait and then recheck if the slot still needs to
> >    be dropped.
> >
> > To me 2) seems better, because we then can also be sure that the slot still
> > needs to be obsoleted, rather than potentially doing so unnecessarily.
> >

+1.

> >
> > It looks to me like several of the problems here stem from trying to reuse
> > code from ReplicationSlotAcquireInternal() (which before this was just named
> > ReplicationSlotAcquire()).  I don't think that makes sense, because cases like
> > this want to check if a condition is true, and acquire it only if so.
> >
> > IOW, I think this basically needs to look like ReplicationSlotsDropDBSlots(),
> > except that a different condition is checked, and the if (active_pid) case
> > needs to prepare a condition variable, signal the owner and then wait on the
> > condition variable, to restart after.
>
> I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> 10). Why do we need a timed sleep here in the first place? And why with
> such a short sleep?
>
> I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
> but is aware it's running in checkpointer. I don't think CFI does much
> there? If we are worried about needing to check for interrupts, more
> work is needed.
>
>
> Sketch for a fix attached. I did leave the odd
> ConditionVariableTimedSleep(10ms) in, because I wasn't sure why it's
> there...
>

I haven't tested the patch but I couldn't spot any problems while
reading it. A minor point, don't we need to use
ConditionVariableCancelSleep() at some point after doing
ConditionVariableTimedSleep?

-- 
With Regards,
Amit Kapila.

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

29 April 2021, 17:28:20

On 2021-Apr-07, Andres Freund wrote:

> I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> 10). Why do we need a timed sleep here in the first place? And why with
> such a short sleep?

I was scared of the possibility that a process would not set the CV for
whatever reason, causing checkpointing to become stuck.  Maybe that's
misguided thinking if CVs are reliable enough.

> I also noticed that the code is careful to use CHECK_FOR_INTERRUPTS(); -
> but is aware it's running in checkpointer. I don't think CFI does much
> there? If we are worried about needing to check for interrupts, more
> work is needed.

Hmm .. yeah, doing CFI seems pretty useless.  I think that should just
be removed.  If checkpointer gets USR2 (request for shutdown) it's not
going to affect the behavior of CFI anyway.

I attach a couple of changes to your 0001.  It's all cosmetic; what
looks not so cosmetic is the change of "continue" to "break" in helper
routine; if !s->in_use, we'd loop infinitely.  The other routine
already checks that before calling the helper; since you hold
ReplicationSlotControlLock at that point, it should not be possible to
drop it in between.  Anyway, it's a trivial change to make, so it should
be correct.

I also added a "continue" at the bottom of one block; currently that
doesn't change any behavior, but if we add code at the other block, it
might not be what's intended.

> After this I don't see a reason to have SAB_Inquire - as far as I can
> tell it's practically impossible to use without race conditions? Except
> for raising an error - which is "builtin"...

Hmm, interesting ... If not needed, yeah let's get rid of that.

Are you getting this set pushed, or would you like me to handle it?
(There seems to be some minor conflict in 13)

-- 
Álvaro Herrera       Valdivia, Chile
"Pido que me den el Nobel por razones humanitarias" (Nicanor Parra)

Attachment

0001-Alvaro-s-edits.patch

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

30 April 2021, 15:57:35

Hi,

On 2021-04-08 17:03:41 +0530, Amit Kapila wrote:
> I haven't tested the patch but I couldn't spot any problems while
> reading it. A minor point, don't we need to use
> ConditionVariableCancelSleep() at some point after doing
> ConditionVariableTimedSleep?

It's not really necessary - unless the CV could get deallocated as part
of dynamic shared memory or such.

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

30 April 2021, 15:59:53

Hi,

On 2021-04-29 13:28:20 -0400, Álvaro Herrera wrote:
> On 2021-Apr-07, Andres Freund wrote:
> 
> > I'm also confused by the use of ConditionVariableTimedSleep(timeout =
> > 10). Why do we need a timed sleep here in the first place? And why with
> > such a short sleep?
> 
> I was scared of the possibility that a process would not set the CV for
> whatever reason, causing checkpointing to become stuck.  Maybe that's
> misguided thinking if CVs are reliable enough.

They better be, or we have bigger problems. And if it's an escape hatch
we surely ought not to use 10ms as the timeout. That's an appropriate
time for something *not* using condition variables...


> I attach a couple of changes to your 0001.  It's all cosmetic; what
> looks not so cosmetic is the change of "continue" to "break" in helper
> routine; if !s->in_use, we'd loop infinitely.  The other routine
> already checks that before calling the helper; since you hold
> ReplicationSlotControlLock at that point, it should not be possible to
> drop it in between.  Anyway, it's a trivial change to make, so it should
> be correct.

> I also added a "continue" at the bottom of one block; currently that
> doesn't change any behavior, but if we add code at the other block, it
> might not be what's intended.

Seems sane.


> Are you getting this set pushed, or would you like me to handle it?
> (There seems to be some minor conflict in 13)

I'd be quite happy for you to handle it...

Greetings,

Andres Freund

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

10 June 2021, 22:02:58

Here's a version that I feel is committable (0001).  There was an issue
when returning from the inner loop, if in a previous iteration we had
released the lock.  In that case we need to return with the lock not
held, so that the caller can acquire it again, but weren't.  This seems
pretty hard to hit in practice (I suppose somebody needs to destroy the
slot just as checkpointer killed the walsender, but before checkpointer
marks it as its own) ... but if it did happen, I think checkpointer
would block with no recourse.  Also added some comments and slightly
restructured the code.

There are plenty of conflicts in pg13, but it's all easy to handle.

I wrote a test (0002) to cover the case of signalling a walsender, which
is currently not covered (we only deal with the case of a standby that's
not running).  There are some sharp edges in this code -- I had to make
it use background_psql() to send a CHECKPOINT, which hangs, because I
previously send a SIGSTOP to the walreceiver.  Maybe there's a better
way to achieve a walreceiver that remains connected but doesn't consume
input from the primary, but I don't know what it is.  Anyway, the code
becomes covered with this.  I would like to at least see it in master,
to gather some reactions from buildfarm.

-- 
Álvaro Herrera       Valdivia, Chile
<Schwern> It does it in a really, really complicated way
<crab> why does it need to be complicated?
<Schwern> Because it's MakeMaker.

Attachment

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

11 June 2021, 00:58:17

On 2021-Jun-10, Álvaro Herrera wrote:

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running).  There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver.  Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is.  Anyway, the code
> becomes covered with this.  I would like to at least see it in master,
> to gather some reactions from buildfarm.

Small fixup to the test one, so that skipping it on Windows works
correctly.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
Voy a acabar con todos los humanos / con los humanos yo acabaré
voy a acabar con todos (bis) / con todos los humanos acabaré ¡acabaré! (Bender)

Attachment

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

11 June 2021, 16:27:57

On 2021-Jun-10, Álvaro Herrera wrote:

> Here's a version that I feel is committable (0001).  There was an issue
> when returning from the inner loop, if in a previous iteration we had
> released the lock.  In that case we need to return with the lock not
> held, so that the caller can acquire it again, but weren't.  This seems
> pretty hard to hit in practice (I suppose somebody needs to destroy the
> slot just as checkpointer killed the walsender, but before checkpointer
> marks it as its own) ... but if it did happen, I think checkpointer
> would block with no recourse.  Also added some comments and slightly
> restructured the code.
> 
> There are plenty of conflicts in pg13, but it's all easy to handle.

Pushed, with additional minor changes.

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running).  There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver.  Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is.  Anyway, the code
> becomes covered with this.  I would like to at least see it in master,
> to gather some reactions from buildfarm.

I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails.  Why?  that's not clear to me).  Attached is the last one I have,
in case somebody wants to make it better.  Maybe there's some completely
different approach that works better, but I'm out of ideas for now.

-- 
Álvaro Herrera       Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Attachment

v3-0001-Add-test-case-for-invalidating-an-active-replicat.patch

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

11 June 2021, 19:52:21

On 2021-Apr-07, Andres Freund wrote:

> After this I don't see a reason to have SAB_Inquire - as far as I can
> tell it's practically impossible to use without race conditions? Except
> for raising an error - which is "builtin"...

Pushed 0002.

Thanks!

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"La persona que no quería pecar / estaba obligada a sentarse
 en duras y empinadas sillas    / desprovistas, por cierto
 de blandos atenuantes"                          (Patricio Vogel)

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

11 June 2021, 20:38:32

On 2021-06-11 15:52:21 -0400, Álvaro Herrera wrote:
> On 2021-Apr-07, Andres Freund wrote:
> 
> > After this I don't see a reason to have SAB_Inquire - as far as I can
> > tell it's practically impossible to use without race conditions? Except
> > for raising an error - which is "builtin"...
> 
> Pushed 0002.
> 
> Thanks!

Thank you for your work on this!

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

18 June 2021, 20:59:00

On 2021-Jun-11, Álvaro Herrera wrote:

> I tried hard to make this stable, but it just isn't (it works fine one
> thousand runs, then I grab some coffee and run it once more and that one
> fails.  Why?  that's not clear to me).  Attached is the last one I have,
> in case somebody wants to make it better.  Maybe there's some completely
> different approach that works better, but I'm out of ideas for now.

It occurred to me that this could be made better by sigstopping both
walreceiver and walsender, then letting only the latter run; AFAICS this
makes the test stable.  I'll register this on the upcoming commitfest to
let cfbot run it, and if it looks good there I'll get it pushed to
master.  If there's any problem I'll just remove it before beta2 is
stamped.

-- 
Álvaro Herrera       Valdivia, Chile

Attachment

v3-0001-Add-test-case-for-obsoleting-slot-with-active-wal.patch

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

18 June 2021, 21:05:45

Apologies, I inadvertently sent the version before I added a maximum
number of iterations in the final loop.

-- 
Álvaro Herrera       Valdivia, Chile
"La fuerza no está en los medios físicos
sino que reside en una voluntad indomable" (Gandhi)

Attachment

v4-0001-Add-test-case-for-obsoleting-slot-with-active-wal.patch

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Tom Lane

Date:

19 June 2021, 19:16:50

=?utf-8?Q?=C3=81lvaro?= Herrera <alvherre@alvh.no-ip.org> writes:
> It occurred to me that this could be made better by sigstopping both
> walreceiver and walsender, then letting only the latter run; AFAICS this
> makes the test stable.  I'll register this on the upcoming commitfest to
> let cfbot run it, and if it looks good there I'll get it pushed to
> master.  If there's any problem I'll just remove it before beta2 is
> stamped.

Hmm ... desmoxytes has failed this test once, out of four runs since
it went in:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04

None of the other animals that have reported in so far are unhappy.
Still, maybe that's not a track record we want to have for beta2?

I've just launched a run on gaur, which given its dinosaur status
might be the most likely animal to have an actual portability problem
with this test technique.  If you want to wait a few hours to see what
it says, that'd be fine with me.

            regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

20 June 2021, 16:01:03

Hah, desmoxytes failed once:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04
I'll revert it and investigate later.  There have been no other
failures.

-- 
Álvaro Herrera                            39°49'30"S 73°17'W
"Hay que recordar que la existencia en el cosmos, y particularmente la
elaboración de civilizaciones dentro de él no son, por desgracia,
nada idílicas" (Ijon Tichy)

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Tom Lane

Date:

20 June 2021, 17:19:31

I wrote:
> Hmm ... desmoxytes has failed this test once, out of four runs since
> it went in:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04

I studied this failure a bit more, and I think the test itself has
a race condition.  It's doing

# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
    "walreceiver termination logged");

The string it's looking for does show up in node_primary3's log, but
not for another second or so; we can see instances of the following
poll_query_until query before that happens.  So the problem is that
there is no interlock to ensure that the walreceiver terminates
before this find_in_log check looks for it.

You should be able to fix this by adding a retry loop around the
find_in_log check (which would likely mean that you don't need
to do multiple advance_wal iterations here).

However, I agree with reverting the test for now and then trying
again after beta2.

            regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Tom Lane

Date:

20 June 2021, 18:37:09

I wrote:
> I studied this failure a bit more, and I think the test itself has
> a race condition.  It's doing
>
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> $logstart = get_log_size($node_primary3);
> advance_wal($node_primary3, 4);
> ok(find_in_log($node_primary3, "to release replication slot", $logstart),
>     "walreceiver termination logged");

Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond.  But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart.  I think you
need to capture that value before the kill not after.

I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string.  The full
message in primary3's log looks like

2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG:  terminating process 3548959 to release replication slot "rep3"

and I don't understand why we wouldn't match on the whole
message text.  (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)

            regards, tom lane

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

23 June 2021, 14:02:12

On 2021-Jun-20, Tom Lane wrote:

> Actually ... isn't there a second race, in the opposite direction?
> IIUC, the point of this is that once we force some WAL to be sent
> to the frozen sender/receiver, they'll be killed for failure to
> respond.  But the advance_wal call is not the only possible cause
> of that; a background autovacuum for example could emit some WAL.
> So I fear it's possible for the 'to release replication slot'
> message to come out before we capture $logstart.  I think you
> need to capture that value before the kill not after.

I accounted for all those things and pushed again.

-- 
Álvaro Herrera       Valdivia, Chile
"I can see support will not be a problem.  10 out of 10."    (Simon Wittber)
      (http://archives.postgresql.org/pgsql-general/2004-12/msg00159.php)

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

vignesh C

Date:

05 July 2021, 16:47:20

On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Jun-20, Tom Lane wrote:
>
> > Actually ... isn't there a second race, in the opposite direction?
> > IIUC, the point of this is that once we force some WAL to be sent
> > to the frozen sender/receiver, they'll be killed for failure to
> > respond.  But the advance_wal call is not the only possible cause
> > of that; a background autovacuum for example could emit some WAL.
> > So I fear it's possible for the 'to release replication slot'
> > message to come out before we capture $logstart.  I think you
> > need to capture that value before the kill not after.
>
> I accounted for all those things and pushed again.

I saw that this patch is pushed. If there is no pending work left for
this, can we change the commitfest entry to Committed.

Regards,
Vignesh

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Álvaro Herrera

Date:

05 July 2021, 17:00:38

On 2021-Jul-05, vignesh C wrote:

> On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> >
> > On 2021-Jun-20, Tom Lane wrote:
> >
> > > Actually ... isn't there a second race, in the opposite direction?
> > > IIUC, the point of this is that once we force some WAL to be sent
> > > to the frozen sender/receiver, they'll be killed for failure to
> > > respond.  But the advance_wal call is not the only possible cause
> > > of that; a background autovacuum for example could emit some WAL.
> > > So I fear it's possible for the 'to release replication slot'
> > > message to come out before we capture $logstart.  I think you
> > > need to capture that value before the kill not after.
> >
> > I accounted for all those things and pushed again.
> 
> I saw that this patch is pushed. If there is no pending work left for
> this, can we change the commitfest entry to Committed.

There is none that I'm aware of, please mark it committed.  Thanks

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/
"Siempre hay que alimentar a los dioses, aunque la tierra esté seca" (Orual)

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

vignesh C

Date:

05 July 2021, 17:10:24

On Mon, Jul 5, 2021 at 10:30 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2021-Jul-05, vignesh C wrote:
>
> > On Wed, Jun 23, 2021 at 7:32 PM Álvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> > >
> > > On 2021-Jun-20, Tom Lane wrote:
> > >
> > > > Actually ... isn't there a second race, in the opposite direction?
> > > > IIUC, the point of this is that once we force some WAL to be sent
> > > > to the frozen sender/receiver, they'll be killed for failure to
> > > > respond.  But the advance_wal call is not the only possible cause
> > > > of that; a background autovacuum for example could emit some WAL.
> > > > So I fear it's possible for the 'to release replication slot'
> > > > message to come out before we capture $logstart.  I think you
> > > > need to capture that value before the kill not after.
> > >
> > > I accounted for all those things and pushed again.
> >
> > I saw that this patch is pushed. If there is no pending work left for
> > this, can we change the commitfest entry to Committed.
>
> There is none that I'm aware of, please mark it committed.  Thanks

Thanks for confirming, I have marked it as committed.

Regards,
Vignesh

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

23 February 2022, 01:48:55

Hi,

On 2021-06-11 12:27:57 -0400, Álvaro Herrera wrote:
> On 2021-Jun-10, Álvaro Herrera wrote:
> 
> > Here's a version that I feel is committable (0001).  There was an issue
> > when returning from the inner loop, if in a previous iteration we had
> > released the lock.  In that case we need to return with the lock not
> > held, so that the caller can acquire it again, but weren't.  This seems
> > pretty hard to hit in practice (I suppose somebody needs to destroy the
> > slot just as checkpointer killed the walsender, but before checkpointer
> > marks it as its own) ... but if it did happen, I think checkpointer
> > would block with no recourse.  Also added some comments and slightly
> > restructured the code.
> > 
> > There are plenty of conflicts in pg13, but it's all easy to handle.
> 
> Pushed, with additional minor changes.

I stared at this code, due to [1], and I think I found a bug. I think it's not
the cause of the failures in that thread, but we probably should still do
something about it.

I think the minor changes might unfortunately have introduced a race? Before
the patch just used ConditionVariableSleep(), but now it also has a
ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
until
            /* Wait until the slot is released. */
            ConditionVariableSleep(&s->active_cv,
                                   WAIT_EVENT_REPLICATION_SLOT_DROP);

which directly violates what ConditionVariablePrepareToSleep() documents:

 * This can optionally be called before entering a test/sleep loop.
 * Doing so is more efficient if we'll need to sleep at least once.
 * However, if the first test of the exit condition is likely to succeed,
 * it's more efficient to omit the ConditionVariablePrepareToSleep call.
 * See comments in ConditionVariableSleep for more detail.
 *
 * Caution: "before entering the loop" means you *must* test the exit
 * condition between calling ConditionVariablePrepareToSleep and calling
 * ConditionVariableSleep.  If that is inconvenient, omit calling
 * ConditionVariablePrepareToSleep.

Afaics this means we can potentially sleep forever if the prior owner of the
slot releases it before the ConditionVariablePrepareToSleep().

There's a comment that's mentioning this danger:

            /*
             * Prepare the sleep on the slot's condition variable before
             * releasing the lock, to close a possible race condition if the
             * slot is released before the sleep below.
             */
            ConditionVariablePrepareToSleep(&s->active_cv);

            LWLockRelease(ReplicationSlotControlLock);

but afaics that is bogus, because releasing a slot doesn't take
ReplicationSlotControlLock. That just protects against the slot being dropped,
not against it being released.

We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
before the checks at the start of the while loop.

Greetings,

Andres Freund

[1] https://www.postgresql.org/message-id/20220218231415.c4plkp4i3reqcwip%40alap3.anarazel.de

Re: Race condition in InvalidateObsoleteReplicationSlots()

From

Andres Freund

Date:

23 February 2022, 01:56:29

Hi,

On 2022-02-22 17:48:55 -0800, Andres Freund wrote:
> I think the minor changes might unfortunately have introduced a race? Before
> the patch just used ConditionVariableSleep(), but now it also has a
> ConditionVariablePrepareToSleep().  Without re-checking the sleep condition
> until
>             /* Wait until the slot is released. */
>             ConditionVariableSleep(&s->active_cv,
>                                    WAIT_EVENT_REPLICATION_SLOT_DROP);
> 
> which directly violates what ConditionVariablePrepareToSleep() documents:
> 
>  * This can optionally be called before entering a test/sleep loop.
>  * Doing so is more efficient if we'll need to sleep at least once.
>  * However, if the first test of the exit condition is likely to succeed,
>  * it's more efficient to omit the ConditionVariablePrepareToSleep call.
>  * See comments in ConditionVariableSleep for more detail.
>  *
>  * Caution: "before entering the loop" means you *must* test the exit
>  * condition between calling ConditionVariablePrepareToSleep and calling
>  * ConditionVariableSleep.  If that is inconvenient, omit calling
>  * ConditionVariablePrepareToSleep.
> 
> 
> Afaics this means we can potentially sleep forever if the prior owner of the
> slot releases it before the ConditionVariablePrepareToSleep().
> 
> There's a comment that's mentioning this danger:
> 
>             /*
>              * Prepare the sleep on the slot's condition variable before
>              * releasing the lock, to close a possible race condition if the
>              * slot is released before the sleep below.
>              */
>             ConditionVariablePrepareToSleep(&s->active_cv);
> 
>             LWLockRelease(ReplicationSlotControlLock);
> 
> but afaics that is bogus, because releasing a slot doesn't take
> ReplicationSlotControlLock. That just protects against the slot being dropped,
> not against it being released.
> 
> We can ConditionVariablePrepareToSleep() here, but we'd have to it earlier,
> before the checks at the start of the while loop.

Not at the start of the while loop, outside of the while loop. Doing it in the
loop body doesn't make sense, even if it's at the top. Each
ConditionVariablePrepareToSleep() will unregister itself:

    /*
     * If some other sleep is already prepared, cancel it; this is necessary
     * because we have just one static variable tracking the prepared sleep,
     * and also only one cvWaitLink in our PGPROC.  It's okay to do this
     * because whenever control does return to the other test-and-sleep loop,
     * its ConditionVariableSleep call will just re-establish that sleep as
     * the prepared one.
     */
    if (cv_sleep_target != NULL)
        ConditionVariableCancelSleep();

The intended use is documented in this comment:

 * This should be called in a predicate loop that tests for a specific exit
 * condition and otherwise sleeps, like so:
 *
 *     ConditionVariablePrepareToSleep(cv);  // optional
 *     while (condition for which we are waiting is not true)
 *         ConditionVariableSleep(cv, wait_event_info);
 *     ConditionVariableCancelSleep();

Greetings,

Andres Freund