Re: pg_replication_slot_advance to return NULL instead of 0/0 if slotnot advanced - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: pg_replication_slot_advance to return NULL instead of 0/0 if slotnot advanced
Date
Msg-id CANP8+jLyS=X-CAk59BJnsxKQfjwrmKicHQykyn52Qj-Q=9GLCw@mail.gmail.com
Whole thread Raw
In response to Re: pg_replication_slot_advance to return NULL instead of 0/0 ifslot not advanced  (Michael Paquier <michael@paquier.xyz>)
Responses Re: pg_replication_slot_advance to return NULL instead of 0/0 ifslot not advanced  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
On 28 May 2018 at 09:57, Michael Paquier <michael@paquier.xyz> wrote:
> On Fri, May 25, 2018 at 02:12:32PM +0200, Magnus Hagander wrote:
>> I agree that returning 0/0 on this is wrong.
>>
>> However, can this actually occour for any case other than exactly the case
>> of "moving the position to where the position already is"? If I look at the
>> physical slot path at least that seems to be the only case, and in that
>> case I think the correct thing to return would be the new position, and not
>> NULL. If we actually *fail* to move the position, we give an error.
>
> Yes, this only returns InvalidXLogRecPtr if the location could not be
> moved forward.  Still, there is more going on here.  For a physical
> slot, confirmed_lsn is always 0/0, hence the backward check is never
> applied for it.  What I think should be used for value assigned to
> startlsn is restart_lsn instead.  Then what do we do if the position
> cannot be moved: should we raise an error, as what my patch attached
> does, or just report the existing position the slot is at?

I don't see why an ERROR would be appropriate.

> A second error that I can see is in pg_logical_replication_slot_advance,
> which does not take the mutex lock of MyReplicationSlot, so concurrent
> callers like those of ReplicationSlotsComputeRequiredLSN (applies to
> physical slot actually) and pg_get_replication_slots() may read false
> data.
>
> On top of that, it seems to me that StartLogicalReplication is reading a
> couple of fields from a slot without taking a lock, so at least
> pg_get_replication_slots() may read incorrect data.
> ReplicationSlotReserveWal also is missing a lock..  Those are older than
> v11 though.
>
>> Actually, isn't there also a race there? That is, if we try to move it, we
>> check that we're not trying to move it backwards, and throw an error, but
>> that's checked outside the lock. Then later we actually move it, and check
>> *again* if we try to move it backwards, but if that one fails we return
>> InvalidXLogRecPtr (which can happen in the case of concurrent activity on
>> the slot, I think)? In this case, maybe we should just re-check that and
>> raise an error appropriately?
>
> Er, isn't the take here that ReplicationSlotAcquire is used to lock any
> replication slot update to happen from other backends?  It seems to me
> that what counts at the end if if a backend PID is associated to a slot
> in memory.  If you look at the code paths updating a logical or physical
> slot then those imply that the slot is owned, still a spin lock needs to
> be taken for concurrent readers.
>
>> (I haven't looked at the logical slot path, but I assume it would have
>> something similar in it)
>
> Found one.  All the things I have spotted are in the patch attached.

I think the problem here is there are no comments explaining how to
access the various fields in the structure, so there was no way to
check whether the code was good or not.

If we add corrective code we should clarify that in comments the .h
file also, as is done in XlogCtl

Your points look correct to me, well spotted. I'd like to separate the
correction of these issues from the change of behavior patch. Those
earlier issues can be backpatched, but the change of behavior only
affects PG11.

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: PostgreSQL 11 beta1 on AIX 7.2 : 2 failures in 32bit mode
Next
From: Simon Riggs
Date:
Subject: Re: SHOW ALL does not honor pg_read_all_settings membership