Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? |
Date | |
Msg-id | 20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? (Michael Paquier <michael.paquier@gmail.com>) |
Responses |
Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
|
List | pgsql-hackers |
Hi, On 2018-01-18 20:58:27 +0900, Michael Paquier wrote: > On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote: > > Thanks for looking at this patch previously. This patch is still in > > Needs Review but it's not clear if that's correct or not, but this seems > > to be a bug-fix, so it would be great if we could make some progress on > > it (particularly since it was reported over a year ago and goes back to > > 9.4, according to the thread, from what I can tell). > > > > Any chance one of you can provide some further thoughts on it or make it > > clear if we are waiting for a new patch or if the existing one is still > > reasonable and just needs to be reviewed (in which case, perhaps one of > > you could help with that..)? > > Here is my input then with a summary regarding this thread. The base > problem here is that WAL segments are not retained on a primary when > using a replication slot if a record begins at a segment boundary. > 1) On the primary side, track correctly the beginning of a record so as > when a record runs across multiple segments, the slot does not recycle > them. I think that's a really bad idea. There's no bug on the sender side here. The sender allows removal of WAL based on what the receiver acks. This requires that all senders have low-level understanding on what's being sent - not a good idea, as we e.g. imo want to have tools that serve WAL over the streaming protocol from an archive. > 2) On the standby side, we need to track as well the beginning of the > record so as we don't risk recycling things because of restart points. > On this thread have been discussed a couple of approaches: > a) The first patch was doing things on the primary side by opening an > XLogReader which was in charge to check if the record we are looking at > was at a segment boundary. On top of being costly, this basically > achieved 1) but failed on 2). I am very very strongly against this approach. > b) The second patch that I would like to mention is doing things on the > standby side by being able to change a WAL source when fetching a single > record so as it is possible to fetch the beginning of a cross-segment > record from say an archive or the local xlog, and then request the rest > on the primary. This patch can be found in > https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp > and the diff in WaitForWALToBecomeAvailable() scares me a lot because, > it seems to me that this actually breaks timeline switch > consistency. The concept of switching a WAL source in the middle of a > WAL segment is risky anyway, because we perform sanity checks while > switching sources. The bug we are dealing about here is very rare, and > seeing a problem like that for a timeline switch is even more rare, but > the risk is not zero and we may finish with a corrupted instance, so we > should not in my opinion take this approach. Also this actually achieves > point 2) above, not completely though, but not 1). I don't agree with the sentiment about the approach, but I agree there might be weaknesses in the implementation (which I have not reviewed). I think it's perfectly sensible to require fetching one segment from pg_xlog and the next one via another method. Currently the inability to do so often leads to us constantly refetching the same segment. > An approach I have been thinking about to address points 1) and 2) is to > introduce a new messages type in the replication protocol which can be > used report as well at which position a replication slot should hold its > position. It would have been tempting to extend the existing standby > feedback message but we cannot afford a protocol break either. So my > proposal is to create a new replication message which can be used in the > context of a replication slot instead of the usual standby feedback > message, let's call it slot feedback message. This is similar to the > standby message, except that it reports as well the LSN the slot should > wait for. This results in an addition of... 8 bytes in the message which > is a lot. I have thoughts about the possibility to reduce it to 4 bytes > but one record can spawn across more than 4GB worth of segments, right? > (Note that replication slot data should updated with that instead of the > flush LSN). -0.5, I think this just adds complexity instead of fixing the underlying problem. Greetings, Andres Freund
pgsql-hackers by date: