Re: Race condition in InvalidateObsoleteReplicationSlots() - Mailing list pgsql-hackers

From Álvaro Herrera
Subject Re: Race condition in InvalidateObsoleteReplicationSlots()
Date
Msg-id 202106111627.t4yrl54wtpsf@alvherre.pgsql
Whole thread Raw
In response to Re: Race condition in InvalidateObsoleteReplicationSlots()  (Álvaro Herrera <alvherre@alvh.no-ip.org>)
Responses Re: Race condition in InvalidateObsoleteReplicationSlots()  (Álvaro Herrera <alvherre@alvh.no-ip.org>)
Re: Race condition in InvalidateObsoleteReplicationSlots()  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 2021-Jun-10, Álvaro Herrera wrote:

> Here's a version that I feel is committable (0001).  There was an issue
> when returning from the inner loop, if in a previous iteration we had
> released the lock.  In that case we need to return with the lock not
> held, so that the caller can acquire it again, but weren't.  This seems
> pretty hard to hit in practice (I suppose somebody needs to destroy the
> slot just as checkpointer killed the walsender, but before checkpointer
> marks it as its own) ... but if it did happen, I think checkpointer
> would block with no recourse.  Also added some comments and slightly
> restructured the code.
> 
> There are plenty of conflicts in pg13, but it's all easy to handle.

Pushed, with additional minor changes.

> I wrote a test (0002) to cover the case of signalling a walsender, which
> is currently not covered (we only deal with the case of a standby that's
> not running).  There are some sharp edges in this code -- I had to make
> it use background_psql() to send a CHECKPOINT, which hangs, because I
> previously send a SIGSTOP to the walreceiver.  Maybe there's a better
> way to achieve a walreceiver that remains connected but doesn't consume
> input from the primary, but I don't know what it is.  Anyway, the code
> becomes covered with this.  I would like to at least see it in master,
> to gather some reactions from buildfarm.

I tried hard to make this stable, but it just isn't (it works fine one
thousand runs, then I grab some coffee and run it once more and that one
fails.  Why?  that's not clear to me).  Attached is the last one I have,
in case somebody wants to make it better.  Maybe there's some completely
different approach that works better, but I'm out of ideas for now.

-- 
Álvaro Herrera       Valdivia, Chile
"La experiencia nos dice que el hombre peló millones de veces las patatas,
pero era forzoso admitir la posibilidad de que en un caso entre millones,
las patatas pelarían al hombre" (Ijon Tichy)

Attachment

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: Transactions involving multiple postgres foreign servers, take 2
Next
From: Bruce Momjian
Date:
Subject: Re: "an SQL" vs. "a SQL"