Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)? - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)? |
Date | |
Msg-id | CAB7nPqR+J1Xw+yzfsrehiQ+rh3ac+n5sEUgP7UOQ4_ymFnO9wg@mail.gmail.com Whole thread Raw |
In response to | Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>) |
Responses |
Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
|
List | pgsql-hackers |
On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote: > At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres@anarazel.de> wrote in <20170906192353.ufp2dq7wm5fd6qa7@alap3.anarazel.de> >> I'm not following. All we need to use is the beginning of the relevant >> records, that's easy enough to keep track of. We don't need to read the >> WAL or anything. > > The beginning is already tracked and nothing more to do. I have finally allocated time to look at your newly-proposed patch, sorry for the time it took. Patch 0002 forgot to include sys/stat.h to allow the debugging tool to compile :) > The first *problem* was WaitForWALToBecomeAvailable requests the > beginning of a record, which is not on the page the function has > been told to fetch. Still tliRecPtr is required to determine the > TLI to request, it should request RecPtr to be streamed. [...] > The rest to do is let XLogPageRead retry other sources > immediately. To do this I made ValidXLogPageHeader@xlogreader.c > public (and renamed to XLogReaderValidatePageHeader). > > The patch attached fixes the problem and passes recovery > tests. However, the test for this problem is not added. It needs > to go to the last page in a segment then put a record continues > to the next segment, then kill the standby after receiving the > previous segment but before receiving the whole record. +typedef struct XLogPageHeaderData *XLogPageHeader; [...] +/* Validate a page */ +extern bool XLogReaderValidatePageHeader(XLogReaderState *state, + XLogRecPtr recptr, XLogPageHeader hdr); Instead of exposing XLogPageHeaderData, I would recommend just using char* and remove this declaration. The comment on top of XLogReaderValidatePageHeader needs to make clear what caller should provide. + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, + (XLogPageHeader) readBuf)) + goto next_record_is_invalid; + [...] - ptr = tliRecPtr; + ptr = RecPtr; tli = tliOfPointInHistory(tliRecPtr, expectedTLEs); if (curFileTLI > 0 && tli < curFileTLI) The core of the patch is here (the patch has no comment so it is hard to understand what's the point of what is being done), and if I understand that correctly, you allow the receiver to fetch the portions of a record spawned across multiple segments from different sources, and I am not sure that we'd want to break that promise. Shouldn't we instead have the receiver side track the beginning of the record and send that position for the physical slot's restart_lsn? This way the receiver would retain WAL segments from the real beginning of a record. restart_lsn for replication slots is set when processing the standby message in ProcessStandbyReplyMessage() using now the flush LSN, so a more correct value should be provided using that. Andres, what's your take on the matter? -- Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
pgsql-hackers by date: