Re: Race condition in InvalidateObsoleteReplicationSlots() - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Race condition in InvalidateObsoleteReplicationSlots()
Date
Msg-id 1510400.1624209571@sss.pgh.pa.us
Whole thread Raw
In response to Re: Race condition in InvalidateObsoleteReplicationSlots()  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Race condition in InvalidateObsoleteReplicationSlots()  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
I wrote:
> Hmm ... desmoxytes has failed this test once, out of four runs since
> it went in:
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes&dt=2021-06-19%2003%3A06%3A04

I studied this failure a bit more, and I think the test itself has
a race condition.  It's doing

# freeze walsender and walreceiver. Slot will still be active, but walreceiver
# won't get anything anymore.
kill 'STOP', $senderpid, $receiverpid;
$logstart = get_log_size($node_primary3);
advance_wal($node_primary3, 4);
ok(find_in_log($node_primary3, "to release replication slot", $logstart),
    "walreceiver termination logged");

The string it's looking for does show up in node_primary3's log, but
not for another second or so; we can see instances of the following
poll_query_until query before that happens.  So the problem is that
there is no interlock to ensure that the walreceiver terminates
before this find_in_log check looks for it.

You should be able to fix this by adding a retry loop around the
find_in_log check (which would likely mean that you don't need
to do multiple advance_wal iterations here).

However, I agree with reverting the test for now and then trying
again after beta2.

            regards, tom lane



pgsql-hackers by date:

Previous
From: Zhihong Yu
Date:
Subject: Re: Speed up transaction completion faster after many relations are accessed in a transaction
Next
From: Andrew Dunstan
Date:
Subject: Re: PXGS vs TAP tests