Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? |
Date | |
Msg-id | 20180118115827.GA12826@paquier.xyz Whole thread Raw |
In response to | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? (Stephen Frost <sfrost@snowman.net>) |
Responses |
Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
|
List | pgsql-hackers |
On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote: > Thanks for looking at this patch previously. This patch is still in > Needs Review but it's not clear if that's correct or not, but this seems > to be a bug-fix, so it would be great if we could make some progress on > it (particularly since it was reported over a year ago and goes back to > 9.4, according to the thread, from what I can tell). > > Any chance one of you can provide some further thoughts on it or make it > clear if we are waiting for a new patch or if the existing one is still > reasonable and just needs to be reviewed (in which case, perhaps one of > you could help with that..)? Here is my input then with a summary regarding this thread. The base problem here is that WAL segments are not retained on a primary when using a replication slot if a record begins at a segment boundary. When in recovery and trying to fetch a record from a source, the standby would try to fetch a segment from its beginning, but if segments have been recycled, then the segment could have been recycled, which breaks the contract that the replication slot provides. If we want to get things correctly addressed, I think that we want two things: 1) On the primary side, track correctly the beginning of a record so as when a record runs across multiple segments, the slot does not recycle them. 2) On the standby side, we need to track as well the beginning of the record so as we don't risk recycling things because of restart points. On this thread have been discussed a couple of approaches: a) The first patch was doing things on the primary side by opening an XLogReader which was in charge to check if the record we are looking at was at a segment boundary. On top of being costly, this basically achieved 1) but failed on 2). b) The second patch that I would like to mention is doing things on the standby side by being able to change a WAL source when fetching a single record so as it is possible to fetch the beginning of a cross-segment record from say an archive or the local xlog, and then request the rest on the primary. This patch can be found in https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp and the diff in WaitForWALToBecomeAvailable() scares me a lot because, it seems to me that this actually breaks timeline switch consistency. The concept of switching a WAL source in the middle of a WAL segment is risky anyway, because we perform sanity checks while switching sources. The bug we are dealing about here is very rare, and seeing a problem like that for a timeline switch is even more rare, but the risk is not zero and we may finish with a corrupted instance, so we should not in my opinion take this approach. Also this actually achieves point 2) above, not completely though, but not 1). An approach I have been thinking about to address points 1) and 2) is to introduce a new messages type in the replication protocol which can be used report as well at which position a replication slot should hold its position. It would have been tempting to extend the existing standby feedback message but we cannot afford a protocol break either. So my proposal is to create a new replication message which can be used in the context of a replication slot instead of the usual standby feedback message, let's call it slot feedback message. This is similar to the standby message, except that it reports as well the LSN the slot should wait for. This results in an addition of... 8 bytes in the message which is a lot. I have thoughts about the possibility to reduce it to 4 bytes but one record can spawn across more than 4GB worth of segments, right? (Note that replication slot data should updated with that instead of the flush LSN). Would love to hear thoughts from others. Of course, feel free to correct me as needed. -- Michael
Attachment
pgsql-hackers by date: