Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
Date
Msg-id 20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?  (Michael Paquier <michael.paquier@gmail.com>)
Responses Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
List pgsql-hackers
Hi,

On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > Thanks for looking at this patch previously.  This patch is still in
> > Needs Review but it's not clear if that's correct or not, but this seems
> > to be a bug-fix, so it would be great if we could make some progress on
> > it (particularly since it was reported over a year ago and goes back to
> > 9.4, according to the thread, from what I can tell).
> > 
> > Any chance one of you can provide some further thoughts on it or make it
> > clear if we are waiting for a new patch or if the existing one is still
> > reasonable and just needs to be reviewed (in which case, perhaps one of
> > you could help with that..)?
> 
> Here is my input then with a summary regarding this thread. The base
> problem here is that WAL segments are not retained on a primary when
> using a replication slot if a record begins at a segment boundary.

> 1) On the primary side, track correctly the beginning of a record so as
> when a record runs across multiple segments, the slot does not recycle
> them.

I think that's a really bad idea. There's no bug on the sender side
here. The sender allows removal of WAL based on what the receiver
acks. This requires that all senders have low-level understanding on
what's being sent - not a good idea, as we e.g. imo want to have tools
that serve WAL over the streaming protocol from an archive.


> 2) On the standby side, we need to track as well the beginning of the
> record so as we don't risk recycling things because of restart points.


> On this thread have been discussed a couple of approaches:
> a) The first patch was doing things on the primary side by opening an
> XLogReader which was in charge to check if the record we are looking at
> was at a segment boundary. On top of being costly, this basically
> achieved 1) but failed on 2).

I am very very strongly against this approach.


> b) The second patch that I would like to mention is doing things on the
> standby side by being able to change a WAL source when fetching a single
> record so as it is possible to fetch the beginning of a cross-segment
> record from say an archive or the local xlog, and then request the rest
> on the primary. This patch can be found in
> https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> it seems to me that this actually breaks timeline switch
> consistency. The concept of switching a WAL source in the middle of a
> WAL segment is risky anyway, because we perform sanity checks while
> switching sources. The bug we are dealing about here is very rare, and
> seeing a problem like that for a timeline switch is even more rare, but
> the risk is not zero and we may finish with a corrupted instance, so we
> should not in my opinion take this approach. Also this actually achieves
> point 2) above, not completely though, but not 1).

I don't agree with the sentiment about the approach, but I agree there
might be weaknesses in the implementation (which I have not reviewed). I
think it's perfectly sensible to require fetching one segment from
pg_xlog and the next one via another method. Currently the inability to
do so often leads to us constantly refetching the same segment.


> An approach I have been thinking about to address points 1) and 2) is to
> introduce a new messages type in the replication protocol which can be
> used report as well at which position a replication slot should hold its
> position. It would have been tempting to extend the existing standby
> feedback message but we cannot afford a protocol break either. So my
> proposal is to create a new replication message which can be used in the
> context of a replication slot instead of the usual standby feedback
> message, let's call it slot feedback message. This is similar to the
> standby message, except that it reports as well the LSN the slot should
> wait for. This results in an addition of... 8 bytes in the message which
> is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> but one record can spawn across more than 4GB worth of segments, right?
> (Note that replication slot data should updated with that instead of the
> flush LSN).

-0.5, I think this just adds complexity instead of fixing the underlying
problem.

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [Sender Address Forgery]Re: [Sender Address Forgery]Re: [SenderAddress Forgery]Re: [HACKERS] path toward faster partition pruning
Next
From: Corey Huinker
Date:
Subject: Re: CREATE ROUTINE MAPPING