Re: [HACKERS] Restricting maximum keep segments by repslots - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [HACKERS] Restricting maximum keep segments by repslots
Date
Msg-id 20200406221555.GA16211@alvherre.pgsql
Whole thread Raw
In response to Re: [HACKERS] Restricting maximum keep segments by repslots  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Responses Re: [HACKERS] Restricting maximum keep segments by repslots  (Alvaro Herrera <alvherre@2ndquadrant.com>)
List pgsql-hackers
On 2020-Apr-06, Alvaro Herrera wrote:

> Lastly, I noticed that we're now changing the slot's restart_lsn to
> Invalid without being the slot's owner, which goes counter to what is
> said in slot.h:
> 
>  * - Individual fields are protected by mutex where only the backend owning
>  * the slot is authorized to update the fields from its own slot.  The
>  * backend owning the slot does not need to take this lock when reading its
>  * own fields, while concurrent backends not owning this slot should take the
>  * lock when reading this slot's data.
> 
> What this means is that if the slot owner walsender updates the
> restart_lsn to a newer value just as we (checkpointer) are trying to set
> it to Invalid, the owner's value might persist and our value would be
> lost.
> 
> AFAICT if we were really stressed about getting this exactly correct,
> then we would have to kill the walsender, wait for it to die, then
> ReplicationSlotAcquire and *then* update
> MyReplicationSlot->data.restart_lsn.

So I had cold feet about the whole business of trying to write a
non-owned replication slot, so I tried to implemented the "exactly
correct" idea above.  That's v25 here.

I think there's a race condition in this: if we kill a walsender and it
restarts immediately before we (checkpoint) can acquire the slot, we
will wait for it to terminate on its own.  Fixing this requires changing
the ReplicationSlotAcquire API so that it knows not to wait but not
raise error either (so we can use an infinite loop: "acquire, if busy
send signal")

I also include a separate diff for a change that might or might not be
necessary, where xmins reserved by slots with restart_lsn=invalid are
ignored.  I'm not yet sure that we should include this, but we should
keep an eye on it.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Next
From: Tom Lane
Date:
Subject: Re: ERROR: invalid input syntax for type circle