Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)? - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
Date
Msg-id 28f5a502-8f27-94af-6d5a-ed939be87125@iki.fi
Whole thread Raw
In response to Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?  (Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>)
Responses Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?
List pgsql-hackers
On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi>
>> Looking at the patch linked above
>> (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):
>>
>>> --- a/src/backend/access/transam/xlog.c
>>> +++ b/src/backend/access/transam/xlog.c
>>> @@ -11693,6 +11693,10 @@ retry:
>>>       Assert(reqLen <= readLen);
>>>        *readTLI = curFileTLI;
>>> +
>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
>>> readBuf))
>>> +        goto next_record_is_invalid;
>>> +
>>>       return readLen;
>>>    next_record_is_invalid:
>>
>> What is the point of adding this XLogReaderValidatePageHeader() call?
>> The caller of this callback function, ReadPageInternal(), checks the
>> page header anyway. Earlier in this thread, you said:
> 
> Without the lines, server actually fails to start replication.
> 
> (I try to remember the details...)
> 
> The difference is whether the function can cause retry for the
> same portion of a set of continued records (without changing the
> plugin API). XLogPageRead can do that. On the other hand all
> ReadPageInternal can do is just letting the caller ReadRecords
> retry from the beginning of the set of continued records since
> the caller side handles only complete records.
> 
> In the failure case, in XLogPageRead, when read() fails, it can
> try the next source midst of a continued records. On the other
> hand if the segment was read but it was recycled one, it passes
> "success" to ReadPageInternal and leads to retring from the
> beginning of the recrod. Infinite loop.

I see. You have the same problem if you have a WAL file that's corrupt 
in some other way, but one of the sources would have a correct copy. 
ValidXLogPageHeader() only checks the page header, after all. But unlike 
a recycled WAL segment, that's not supposed to happen as part of normal 
operation, so I guess we can live with that.

> Calling XLogReaderValidatePageHeader in ReadPageInternal is
> redundant, but removing it may be interface change of xlogreader
> plugin and I am not sure that is allowed.

We should avoid doing that in back-branches, at least. But in 'master', 
I wouldn't mind redesigning the API. Dealing with all the retrying is 
pretty complicated as it is, if we can simplify that somehow, I think 
that'd be good.

- Heikki


pgsql-hackers by date:

Previous
From: Aleksandr Parfenov
Date:
Subject: Optimze usage of immutable functions as relation
Next
From: Andrew Gierth
Date:
Subject: Re: Optimze usage of immutable functions as relation