Re: Race condition in InvalidateObsoleteReplicationSlots() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race condition in InvalidateObsoleteReplicationSlots()
Date
Msg-id 1514140.1624214229@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race condition in InvalidateObsoleteReplicationSlots()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race condition in InvalidateObsoleteReplicationSlots()  (Álvaro Herrera <alvherre@alvh.no-ip.org>)
List pgsql-hackers
I wrote:
> I studied this failure a bit more, and I think the test itself has
> a race condition.  It's doing
>
> # freeze walsender and walreceiver. Slot will still be active, but walreceiver
> # won't get anything anymore.
> kill 'STOP', $senderpid, $receiverpid;
> $logstart = get_log_size($node_primary3);
> advance_wal($node_primary3, 4);
> ok(find_in_log($node_primary3, "to release replication slot", $logstart),
>     "walreceiver termination logged");

Actually ... isn't there a second race, in the opposite direction?
IIUC, the point of this is that once we force some WAL to be sent
to the frozen sender/receiver, they'll be killed for failure to
respond.  But the advance_wal call is not the only possible cause
of that; a background autovacuum for example could emit some WAL.
So I fear it's possible for the 'to release replication slot'
message to come out before we capture $logstart.  I think you
need to capture that value before the kill not after.

I also suggest that it wouldn't be a bad idea to make the
find_in_log check more specific, by including the expected PID
and perhaps the expected slot name in the string.  The full
message in primary3's log looks like

2021-06-19 05:24:36.221 CEST [60cd636f.362648:12] LOG:  terminating process 3548959 to release replication slot "rep3"

and I don't understand why we wouldn't match on the whole
message text.  (I think doing so will also reveal that what
we are looking for here is the walsender pid, not the walreceiver
pid, and thus that the description in the ok() call is backwards.
Or maybe we do want to check the walreceiver side, in which case
we are searching the wrong postmaster's log?)

            regards, tom lane



pgsql-hackers by date:

Previous
From: Andrey Borodin
Date:
Subject: Re: Different compression methods for FPI
Next
From: Tom Lane
Date:
Subject: Reducing the cycle time for CLOBBER_CACHE_ALWAYS buildfarm members