Thread: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

28 August 2017, 14:02:40

Hello,

This problem still occurs on the master.
I rebased this to the current master.

At Mon, 3 Apr 2017 08:38:47 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT8dQk_Ce29YQ0CKAQ7htLDyUHNdFv6dELe4PkYr3SSjA@mail.gmail.com>
> On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> > As we are already past the commitfest, I am not sure, what should i change
> > the patch status to ?
> 
> The commit fest finishes on the 7th of April. Even with the deadline
> passed, there is nothing preventing to work on bug fixes. So this item
> ought to be moved to the next CF with the same category.

The steps to reproduce the problem follows.

- Apply the second patch (0002-) attached and recompile. It effectively reproduces the problematic state of database.

- M(aster): initdb the master with wal_keep_segments = 0           (default), log_min_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then    start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops. > LOG:  #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.
 > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already
beenremoved
 


The first patch (0001-) fixes this problem, preventing the
problematic state of WAL segments by retarding restart LSN of a
physical replication slot in a certain condition.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From

Michael Paquier

Date:

28 August 2017, 14:14:54

On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> The first patch (0001-) fixes this problem, preventing the
> problematic state of WAL segments by retarding restart LSN of a
> physical replication slot in a certain condition.

FWIW, I have this patch marked on my list of things to look at, so you
can count me as a reviewer. There are also some approaches that I
would like to test because I rely on replication slots for some
infrastructure. Still...

+            if (oldFlushPtr != InvalidXLogRecPtr &&
+                (restartLSN == InvalidXLogRecPtr ?
+                 oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+                 restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
I find such code patterns not readable.
-- 
Michael

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

04 September 2017, 09:51:51

Hello,

Thank you for reviewing this.

At Mon, 28 Aug 2017 20:14:54 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqT03+uaHXun3ft4LJWNDviKTgWSZDsXiqyNdtcCfeqcgg@mail.gmail.com>
> On Mon, Aug 28, 2017 at 8:02 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > The first patch (0001-) fixes this problem, preventing the
> > problematic state of WAL segments by retarding restart LSN of a
> > physical replication slot in a certain condition.
> 
> FWIW, I have this patch marked on my list of things to look at, so you
> can count me as a reviewer. There are also some approaches that I
> would like to test because I rely on replication slots for some
> infrastructure. Still...

This test patch modifies the code for easiness. The window for
this bug to occur is from receiving the first record of a segment
to in most cases receiving the second record or after receiving
several records. Intentionally emitting a record spanning two or
more segments would work?


> +            if (oldFlushPtr != InvalidXLogRecPtr &&
> +                (restartLSN == InvalidXLogRecPtr ?
> +                 oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
> +                 restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
> I find such code patterns not readable.

Yeah, I agree. I rewrote there and the result in the attached
patch is far cleaner than the blob.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Andres Freund

Date:

04 September 2017, 10:04:34

Hi,

On 2017-09-04 15:51:51 +0900, Kyotaro HORIGUCHI wrote:
>          SpinLockAcquire(&walsnd->mutex);
> +        oldFlushPtr = walsnd->flush;
>          walsnd->write = writePtr;
>          walsnd->flush = flushPtr;
>          walsnd->apply = applyPtr;
> @@ -1821,7 +1836,93 @@ ProcessStandbyReplyMessage(void)
>          if (SlotIsLogical(MyReplicationSlot))
>              LogicalConfirmReceivedLocation(flushPtr);
>          else
> -            PhysicalConfirmReceivedLocation(flushPtr);
> +        {
> +            /*
> +             * Recovery on standby requires that a continuation record is
> +             * available from single WAL source. For the reason, physical
> +             * replication slot should stay in the first segment of the
> +             * multiple segments that a continued record is spanning
> +             * over. Since we look pages and don't look into individual record
> +             * here, restartLSN may stay a bit too behind but it doesn't
> +             * matter.
> +             *
> +             * Since the objective is avoiding to remove required segments,
> +             * checking at the beginning of every segment is enough. But once
> +             * restartLSN goes behind, check every page for quick restoration.
> +             *
> +             * restartLSN has a valid value only when it is behind flushPtr.
> +             */
> +            bool    check_contrecords = false;
> +
> +            if (restartLSN != InvalidXLogRecPtr)
> +            {
> +                /*
> +                 * We're retarding restartLSN, check continuation records
> +                 * at every page boundary for quick restoration.
> +                 */
> +                if (restartLSN / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ)
> +                    check_contrecords = true;
> +            }
> +            else if (oldFlushPtr != InvalidXLogRecPtr)
> +            {
> +                /*
> +                 * We're not retarding restartLSN , check continuation records
> +                 * only at segment boundaries. No need to check if
> +                 */
> +                if (oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE)
> +                    check_contrecords = true;
> +            }
> +
> +            if (check_contrecords)
> +            {
> +                XLogRecPtr rp;
> +
> +                if (restartLSN == InvalidXLogRecPtr)
> +                    restartLSN = oldFlushPtr;
> +
> +                rp = restartLSN - (restartLSN % XLOG_BLCKSZ);
> +
> +                /*
> +                 * We may have let the record at flushPtr be sent, so it's
> +                 * worth looking
> +                 */
> +                while (rp <= flushPtr)
> +                {
> +                    XLogPageHeaderData header;
> +
> +                    /*
> +                     * If the page header is not available for now, don't move
> +                     * restartLSN forward. We can read it by the next chance.
> +                     */
> +                    if(sentPtr - rp >= sizeof(XLogPageHeaderData))
> +                    {
> +                        bool found;
> +                        /*
> +                         * Fetch the page header of the next page. Move
> +                         * restartLSN forward only if it is not a continuation
> +                         * page.
> +                         */
> +                        found = XLogRead((char *)&header, rp,
> +                                             sizeof(XLogPageHeaderData), true);
> +                        if (found &&
> +                            (header.xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
> +                            restartLSN = rp;
> +                    }
> +                    rp += XLOG_BLCKSZ;
> +                }
> +
> +                /*
> +                 * If restartLSN is on the same page with flushPtr, it means
> +                 * that we are catching up.
> +                 */
> +                if (restartLSN / XLOG_BLCKSZ == flushPtr / XLOG_BLCKSZ)
> +                    restartLSN = InvalidXLogRecPtr;
> +            }
> +
> +            /* restartLSN == InvalidXLogRecPtr means catching up */
> +            PhysicalConfirmReceivedLocation(restartLSN != InvalidXLogRecPtr ?
> +                                            restartLSN : flushPtr);
> +        }

I've not read through the thread, but this seems like the wrong approach
to me. The receiving side should use a correct value, instead of putting
this complexity on the sender's side.

Don't we also use the value on the receiving side to delete old segments
and such?

- Andres

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From

Michael Paquier

Date:

04 September 2017, 11:17:19

On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres@anarazel.de> wrote:
> I've not read through the thread, but this seems like the wrong approach
> to me. The receiving side should use a correct value, instead of putting
> this complexity on the sender's side.

Yes I agree with that. The current patch gives me a bad feeling to be
honest with the way it does things..
-- 
Michael

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

06 September 2017, 11:36:02

Hi,

At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSPf0qkq=DhSO-tAM9++LSA2aEYSVJ3oY_EdUdb=jKi1w@mail.gmail.com>
> On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres@anarazel.de> wrote:
> > I've not read through the thread, but this seems like the wrong approach
> > to me. The receiving side should use a correct value, instead of putting
> > this complexity on the sender's side.
> 
> Yes I agree with that. The current patch gives me a bad feeling to be
> honest with the way it does things..

The problem is that the current ReadRecord needs the first one of
a series of continuation records from the same source with the
other part, the master in the case.

A (or the) solution closed in the standby side is allowing to
read a seris of continuation records from muliple sources. In
this case the first part from the standby's pg_wal and the second
part from the master via streaming replication.  ReadRecord
needed refactoring, (seems to me) breaking the concept of
XLogReader plug-in system to accomplish this behavior.

If it is preferable for you, I'll re-try that. Or hints for other
solutions are also welcome.

Is there any suggestions?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Andres Freund

Date:

06 September 2017, 22:23:53

On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> Hi,
> 
> At Mon, 4 Sep 2017 17:17:19 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqSPf0qkq=DhSO-tAM9++LSA2aEYSVJ3oY_EdUdb=jKi1w@mail.gmail.com>
> > On Mon, Sep 4, 2017 at 4:04 PM, Andres Freund <andres@anarazel.de> wrote:
> > > I've not read through the thread, but this seems like the wrong approach
> > > to me. The receiving side should use a correct value, instead of putting
> > > this complexity on the sender's side.
> > 
> > Yes I agree with that. The current patch gives me a bad feeling to be
> > honest with the way it does things..
> 
> The problem is that the current ReadRecord needs the first one of
> a series of continuation records from the same source with the
> other part, the master in the case.

What's the problem with that?  We can easily keep track of the beginning
of a record, and only confirm the address before that.


> A (or the) solution closed in the standby side is allowing to
> read a seris of continuation records from muliple sources.

I'm not following. All we need to use is the beginning of the relevant
records, that's easy enough to keep track of. We don't need to read the
WAL or anything.

- Andres

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

07 September 2017, 06:33:47

Hello,

At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres@anarazel.de> wrote in
<20170906192353.ufp2dq7wm5fd6qa7@alap3.anarazel.de>
> On 2017-09-06 17:36:02 +0900, Kyotaro HORIGUCHI wrote:
> > The problem is that the current ReadRecord needs the first one of
> > a series of continuation records from the same source with the
> > other part, the master in the case.
> 
> What's the problem with that?  We can easily keep track of the beginning
> of a record, and only confirm the address before that.

After failure while reading a record locally, ReadRecored tries
streaming to read from the beginning of a record, which is not on
the master, then retry locally and.. This loops forever.

> > A (or the) solution closed in the standby side is allowing to
> > read a seris of continuation records from muliple sources.
> 
> I'm not following. All we need to use is the beginning of the relevant
> records, that's easy enough to keep track of. We don't need to read the
> WAL or anything.

The beginning is already tracked and nothing more to do. 

I reconsider that way and found that it doesn't need such
destructive refactoring.

The first *problem* was WaitForWALToBecomeAvaialble requests the
beginning of a record, which is not on the page the function has
been told to fetch. Still tliRecPtr is required to determine the
TLI to request, it should request RecPtr to be streamed.

The rest to do is let XLogPageRead retry other sources
immediately. To do this I made ValidXLogPageHeader@xlogreader.c
public (and renamed to XLogReaderValidatePageHeader).

The patch attached fixes the problem and passes recovery
tests. However, the test for this problem is not added. It needs
to go to the last page in a segment then put a record continues
to the next segment, then kill the standby after receiving the
previous segment but before receiving the whole record.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From

Michael Paquier

Date:

16 October 2017, 11:58:03

On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres@anarazel.de> wrote in
<20170906192353.ufp2dq7wm5fd6qa7@alap3.anarazel.de>
>> I'm not following. All we need to use is the beginning of the relevant
>> records, that's easy enough to keep track of. We don't need to read the
>> WAL or anything.
>
> The beginning is already tracked and nothing more to do.

I have finally allocated time to look at your newly-proposed patch,
sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
allow the debugging tool to compile :)

> The first *problem* was WaitForWALToBecomeAvailable requests the
> beginning of a record, which is not on the page the function has
> been told to fetch. Still tliRecPtr is required to determine the
> TLI to request, it should request RecPtr to be streamed.

[...]

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> public (and renamed to XLogReaderValidatePageHeader).
>
> The patch attached fixes the problem and passes recovery
> tests. However, the test for this problem is not added. It needs
> to go to the last page in a segment then put a record continues
> to the next segment, then kill the standby after receiving the
> previous segment but before receiving the whole record.

+typedef struct XLogPageHeaderData *XLogPageHeader;
[...]
+/* Validate a page */
+extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
+                                       XLogRecPtr recptr, XLogPageHeader hdr);
Instead of exposing XLogPageHeaderData, I would recommend just using
char* and remove this declaration. The comment on top of
XLogReaderValidatePageHeader needs to make clear what caller should
provide.

+    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
+                                      (XLogPageHeader) readBuf))
+        goto next_record_is_invalid;
+
[...]
-                            ptr = tliRecPtr;
+                            ptr = RecPtr;                            tli = tliOfPointInHistory(tliRecPtr,
expectedTLEs);
                            if (curFileTLI > 0 && tli < curFileTLI)
The core of the patch is here (the patch has no comment so it is hard
to understand what's the point of what is being done), and if I
understand that correctly, you allow the receiver to fetch the
portions of a record spawned across multiple segments from different
sources, and I am not sure that we'd want to break that promise.
Shouldn't we instead have the receiver side track the beginning of the
record and send that position for the physical slot's restart_lsn?
This way the receiver would retain WAL segments from the real
beginning of a record. restart_lsn for replication slots is set when
processing the standby message in ProcessStandbyReplyMessage() using
now the flush LSN, so a more correct value should be provided using
that. Andres, what's your take on the matter?
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

26 October 2017, 13:05:51

Hello. Thank you for looking this.

At Mon, 16 Oct 2017 17:58:03 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqR+J1Xw+yzfsrehiQ+rh3ac+n5sEUgP7UOQ4_ymFnO9wg@mail.gmail.com>
> On Thu, Sep 7, 2017 at 12:33 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > At Wed, 6 Sep 2017 12:23:53 -0700, Andres Freund <andres@anarazel.de> wrote in
<20170906192353.ufp2dq7wm5fd6qa7@alap3.anarazel.de>
> >> I'm not following. All we need to use is the beginning of the relevant
> >> records, that's easy enough to keep track of. We don't need to read the
> >> WAL or anything.
> >
> > The beginning is already tracked and nothing more to do.
> 
> I have finally allocated time to look at your newly-proposed patch,
> sorry for the time it took. Patch 0002 forgot to include sys/stat.h to
> allow the debugging tool to compile :)
> 
> > The first *problem* was WaitForWALToBecomeAvailable requests the
> > beginning of a record, which is not on the page the function has
> > been told to fetch. Still tliRecPtr is required to determine the
> > TLI to request, it should request RecPtr to be streamed.
> 
> [...]
> 
> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader@xlogreader.c
> > public (and renamed to XLogReaderValidatePageHeader).
> >
> > The patch attached fixes the problem and passes recovery
> > tests. However, the test for this problem is not added. It needs
> > to go to the last page in a segment then put a record continues
> > to the next segment, then kill the standby after receiving the
> > previous segment but before receiving the whole record.
> 
> +typedef struct XLogPageHeaderData *XLogPageHeader;
> [...]
> +/* Validate a page */
> +extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
> +                                       XLogRecPtr recptr, XLogPageHeader hdr);
> Instead of exposing XLogPageHeaderData, I would recommend just using
> char* and remove this declaration. The comment on top of
> XLogReaderValidatePageHeader needs to make clear what caller should
> provide.

Seems reasonable. Added several lines in the comment for the
function.

> +    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> +                                      (XLogPageHeader) readBuf))
> +        goto next_record_is_invalid;
> +
> [...]
> -                            ptr = tliRecPtr;
> +                            ptr = RecPtr;
>                              tli = tliOfPointInHistory(tliRecPtr, expectedTLEs);
> 
>                              if (curFileTLI > 0 && tli < curFileTLI)
> The core of the patch is here (the patch has no comment so it is hard
> to understand what's the point of what is being done), and if I

Hmm, sorry. Added a brief comment there.

> understand that correctly, you allow the receiver to fetch the
> portions of a record spawned across multiple segments from different
> sources, and I am not sure that we'd want to break that promise.

We are allowing consecutive records at a segment boundary from
different sources are in the same series of xlog records. A
continuation records never spans over two TLIs but I might be
missing something here. (I found that an error message shows an
incorrect record pointer. The error message seems still be
useful.)

> Shouldn't we instead have the receiver side track the beginning of the
> record and send that position for the physical slot's restart_lsn?

The largest obstacle to do that is that walreceiver is not
utterly concerned to record internals. In other words, it doesn't
know what a record is. Teaching that introduces much complexity
and the complexity slows down the walreceiver.

Addition to that, this "problem" occurs also on replication
without a slot. The latest patch also help the case.

> This way the receiver would retain WAL segments from the real
> beginning of a record. restart_lsn for replication slots is set when
> processing the standby message in ProcessStandbyReplyMessage() using
> now the flush LSN, so a more correct value should be provided using
> that. Andres, what's your take on the matter?


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From b23c1d69ad86fcbb992cb21c604f587d82441001 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horiguchi.kyotaro@lab.ntt.co.jp>
Date: Thu, 7 Sep 2017 12:14:55 +0900
Subject: [PATCH 1/2] Allow switch WAL source midst of record.

The corrent recovery machinary assumes the whole of a record is
avaiable from single source. This prevents a standby from restarting
under a certain condition. This patch allows source switching during
reading a series of continuation records.
---src/backend/access/transam/xlog.c       | 14 ++++++++++++--src/backend/access/transam/xlogreader.c | 28
++++++++++++++++++----------src/include/access/xlogreader.h        |  4 ++++3 files changed, 34 insertions(+), 12
deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dd028a1..0d639ec 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11647,6 +11647,10 @@ retry:    Assert(reqLen <= readLen);    *readTLI = curFileTLI;
+
+    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
+        goto next_record_is_invalid;
+    return readLen;next_record_is_invalid:
@@ -11781,12 +11785,18 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,                        }
                 else                        {
 
-                            ptr = tliRecPtr;
+                            /*
+                             * Trying from the current RecPtr, not from the
+                             * beginning of the current record. The record may
+                             * be no longer available from the master.
+                             */
+                            ptr = RecPtr;                            tli = tliOfPointInHistory(tliRecPtr,
expectedTLEs);                           if (curFileTLI > 0 && tli < curFileTLI)
elog(ERROR,"according to history file, WAL location %X/%X belongs to timeline %u, but previous recovered WAL file came
fromtimeline %u",
 
-                                     (uint32) (ptr >> 32), (uint32) ptr,
+                                     (uint32) (tliRecPtr >> 32),
+                                     (uint32) tliRecPtr,                                     tli, curFileTLI);
              }                        curFileTLI = tli;
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index b1f9b90..78a721a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -27,8 +27,6 @@static bool allocate_recordbuf(XLogReaderState *state, uint32 reclength);
-static bool ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-                    XLogPageHeader hdr);static bool ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
                  XLogRecPtr PrevRecPtr, XLogRecord *record, bool randAccess);static bool
ValidXLogRecord(XLogReaderState*state, XLogRecord *record,
 
@@ -533,7 +531,6 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)     */    if (targetSegNo
!=state->readSegNo && targetPageOff != 0)    {
 
-        XLogPageHeader hdr;        XLogRecPtr    targetSegmentPtr = pageptr - targetPageOff;        readLen =
state->read_page(state,targetSegmentPtr, XLOG_BLCKSZ,
 
@@ -545,9 +542,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)        /* we can be sure
tohave enough WAL available, we scrolled back */        Assert(readLen == XLOG_BLCKSZ);
 
-        hdr = (XLogPageHeader) state->readBuf;
-
-        if (!ValidXLogPageHeader(state, targetSegmentPtr, hdr))
+        if (!XLogReaderValidatePageHeader(state, targetSegmentPtr,
+                                          state->readBuf))            goto err;    }
@@ -584,7 +580,7 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)    /*     * Now that we
knowwe have the full header, validate it.     */
 
-    if (!ValidXLogPageHeader(state, pageptr, hdr))
+    if (!XLogReaderValidatePageHeader(state, pageptr, (char *) hdr))        goto err;    /* update read state
information*/
 
@@ -710,14 +706,26 @@ ValidXLogRecord(XLogReaderState *state, XLogRecord *record, XLogRecPtr recptr)/* * Validate a
pageheader
 
+ *
+ * Check if phdr is valid as the XLog page header of the XLog page of the page
+ * pointed by recptr.
+ *
+ * phdr is read as XLogPageHeaderData and check if it has valid magic, has no
+ * usused flag bits set and belongs to correct page pointed by recptr. The
+ * incorrect page address is detected usually when we read a page in a
+ * recycled segment.
+ *
+ * If it is the first page in a segment or detected rewind of state,
+ * additional checks are performed. */
-static bool
-ValidXLogPageHeader(XLogReaderState *state, XLogRecPtr recptr,
-                    XLogPageHeader hdr)
+bool
+XLogReaderValidatePageHeader(XLogReaderState *state, XLogRecPtr recptr,
+                             char *phdr){    XLogRecPtr    recaddr;    XLogSegNo    segno;    int32        offset;
+    XLogPageHeader hdr = (XLogPageHeader) phdr;    Assert((recptr % XLOG_BLCKSZ) == 0);
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 3a9ebd4..758d880 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -205,6 +205,10 @@ extern void XLogReaderFree(XLogReaderState *state);extern struct XLogRecord
*XLogReadRecord(XLogReaderState*state,               XLogRecPtr recptr, char **errormsg);
 
+/* Validate a page */
+extern bool XLogReaderValidatePageHeader(XLogReaderState *state,
+                    XLogRecPtr recptr, char *phdr);
+/* Invalidate read state */extern void XLogReaderInvalReadState(XLogReaderState *state);
-- 
2.9.2


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From

Michael Paquier

Date:

26 October 2017, 21:13:55

On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> The largest obstacle to do that is that walreceiver is not
> utterly concerned to record internals. In other words, it doesn't
> know what a record is. Teaching that introduces much complexity
> and the complexity slows down the walreceiver.
>
> Addition to that, this "problem" occurs also on replication
> without a slot. The latest patch also help the case.

That's why replication slots have been introduced to begin with. The
WAL receiver gives no guarantee that a segment will be retained or not
based on the beginning of a record. That's sad that the WAL receiver
does not track properly restart LSN and instead just uses the flush
LSN. I am beginning to think that a new message type used to report
the restart LSN when a replication slot is in use would just be a
better and more stable solution. I haven't looked at the
implementation details yet though.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From

Michael Paquier

Date:

29 November 2017, 07:57:48

On Fri, Oct 27, 2017 at 3:13 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> The largest obstacle to do that is that walreceiver is not
>> utterly concerned to record internals. In other words, it doesn't
>> know what a record is. Teaching that introduces much complexity
>> and the complexity slows down the walreceiver.
>>
>> Addition to that, this "problem" occurs also on replication
>> without a slot. The latest patch also help the case.
>
> That's why replication slots have been introduced to begin with. The
> WAL receiver gives no guarantee that a segment will be retained or not
> based on the beginning of a record. That's sad that the WAL receiver
> does not track properly restart LSN and instead just uses the flush
> LSN. I am beginning to think that a new message type used to report
> the restart LSN when a replication slot is in use would just be a
> better and more stable solution. I haven't looked at the
> implementation details yet though.

Yeah, I am still on track with this idea. Splitting the flush LSN and
the restart LSN properly can allow for better handling on clients. For
now I am moving this patch to the next CF.
-- 
Michael

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Stephen Frost

Date:

17 January 2018, 15:38:32

Michael, Andres,

* Michael Paquier (michael.paquier@gmail.com) wrote:
> On Fri, Oct 27, 2017 at 3:13 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > On Thu, Oct 26, 2017 at 3:05 AM, Kyotaro HORIGUCHI
> > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >> The largest obstacle to do that is that walreceiver is not
> >> utterly concerned to record internals. In other words, it doesn't
> >> know what a record is. Teaching that introduces much complexity
> >> and the complexity slows down the walreceiver.
> >>
> >> Addition to that, this "problem" occurs also on replication
> >> without a slot. The latest patch also help the case.
> >
> > That's why replication slots have been introduced to begin with. The
> > WAL receiver gives no guarantee that a segment will be retained or not
> > based on the beginning of a record. That's sad that the WAL receiver
> > does not track properly restart LSN and instead just uses the flush
> > LSN. I am beginning to think that a new message type used to report
> > the restart LSN when a replication slot is in use would just be a
> > better and more stable solution. I haven't looked at the
> > implementation details yet though.
>
> Yeah, I am still on track with this idea. Splitting the flush LSN and
> the restart LSN properly can allow for better handling on clients. For
> now I am moving this patch to the next CF.

Thanks for looking at this patch previously.  This patch is still in
Needs Review but it's not clear if that's correct or not, but this seems
to be a bug-fix, so it would be great if we could make some progress on
it (particularly since it was reported over a year ago and goes back to
9.4, according to the thread, from what I can tell).

Any chance one of you can provide some further thoughts on it or make it
clear if we are waiting for a new patch or if the existing one is still
reasonable and just needs to be reviewed (in which case, perhaps one of
you could help with that..)?

Thanks again!

Stephen

Attachment

signature.asc

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Michael Paquier

Date:

18 January 2018, 14:58:27

On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> Thanks for looking at this patch previously.  This patch is still in
> Needs Review but it's not clear if that's correct or not, but this seems
> to be a bug-fix, so it would be great if we could make some progress on
> it (particularly since it was reported over a year ago and goes back to
> 9.4, according to the thread, from what I can tell).
>
> Any chance one of you can provide some further thoughts on it or make it
> clear if we are waiting for a new patch or if the existing one is still
> reasonable and just needs to be reviewed (in which case, perhaps one of
> you could help with that..)?

Here is my input then with a summary regarding this thread. The base
problem here is that WAL segments are not retained on a primary when
using a replication slot if a record begins at a segment boundary. When
in recovery and trying to fetch a record from a source, the standby
would try to fetch a segment from its beginning, but if segments have
been recycled, then the segment could have been recycled, which breaks
the contract that the replication slot provides. If we want to get
things correctly addressed, I think that we want two things:
1) On the primary side, track correctly the beginning of a record so as
when a record runs across multiple segments, the slot does not recycle
them.
2) On the standby side, we need to track as well the beginning of the
record so as we don't risk recycling things because of restart points.

On this thread have been discussed a couple of approaches:
a) The first patch was doing things on the primary side by opening an
XLogReader which was in charge to check if the record we are looking at
was at a segment boundary. On top of being costly, this basically
achieved 1) but failed on 2).
b) The second patch that I would like to mention is doing things on the
standby side by being able to change a WAL source when fetching a single
record so as it is possible to fetch the beginning of a cross-segment
record from say an archive or the local xlog, and then request the rest
on the primary. This patch can be found in
https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
it seems to me that this actually breaks timeline switch
consistency. The concept of switching a WAL source in the middle of a
WAL segment is risky anyway, because we perform sanity checks while
switching sources. The bug we are dealing about here is very rare, and
seeing a problem like that for a timeline switch is even more rare, but
the risk is not zero and we may finish with a corrupted instance, so we
should not in my opinion take this approach. Also this actually achieves
point 2) above, not completely though, but not 1).

An approach I have been thinking about to address points 1) and 2) is to
introduce a new messages type in the replication protocol which can be
used report as well at which position a replication slot should hold its
position. It would have been tempting to extend the existing standby
feedback message but we cannot afford a protocol break either. So my
proposal is to create a new replication message which can be used in the
context of a replication slot instead of the usual standby feedback
message, let's call it slot feedback message. This is similar to the
standby message, except that it reports as well the LSN the slot should
wait for. This results in an addition of... 8 bytes in the message which
is a lot. I have thoughts about the possibility to reduce it to 4 bytes
but one record can spawn across more than 4GB worth of segments, right?
(Note that replication slot data should updated with that instead of the
flush LSN).

Would love to hear thoughts from others. Of course, feel free to correct
me as needed.
--
Michael

Attachment

signature.asc

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Andres Freund

Date:

18 January 2018, 22:52:52

Hi,

On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > Thanks for looking at this patch previously.  This patch is still in
> > Needs Review but it's not clear if that's correct or not, but this seems
> > to be a bug-fix, so it would be great if we could make some progress on
> > it (particularly since it was reported over a year ago and goes back to
> > 9.4, according to the thread, from what I can tell).
> > 
> > Any chance one of you can provide some further thoughts on it or make it
> > clear if we are waiting for a new patch or if the existing one is still
> > reasonable and just needs to be reviewed (in which case, perhaps one of
> > you could help with that..)?
> 
> Here is my input then with a summary regarding this thread. The base
> problem here is that WAL segments are not retained on a primary when
> using a replication slot if a record begins at a segment boundary.

> 1) On the primary side, track correctly the beginning of a record so as
> when a record runs across multiple segments, the slot does not recycle
> them.

I think that's a really bad idea. There's no bug on the sender side
here. The sender allows removal of WAL based on what the receiver
acks. This requires that all senders have low-level understanding on
what's being sent - not a good idea, as we e.g. imo want to have tools
that serve WAL over the streaming protocol from an archive.


> 2) On the standby side, we need to track as well the beginning of the
> record so as we don't risk recycling things because of restart points.


> On this thread have been discussed a couple of approaches:
> a) The first patch was doing things on the primary side by opening an
> XLogReader which was in charge to check if the record we are looking at
> was at a segment boundary. On top of being costly, this basically
> achieved 1) but failed on 2).

I am very very strongly against this approach.


> b) The second patch that I would like to mention is doing things on the
> standby side by being able to change a WAL source when fetching a single
> record so as it is possible to fetch the beginning of a cross-segment
> record from say an archive or the local xlog, and then request the rest
> on the primary. This patch can be found in
> https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> it seems to me that this actually breaks timeline switch
> consistency. The concept of switching a WAL source in the middle of a
> WAL segment is risky anyway, because we perform sanity checks while
> switching sources. The bug we are dealing about here is very rare, and
> seeing a problem like that for a timeline switch is even more rare, but
> the risk is not zero and we may finish with a corrupted instance, so we
> should not in my opinion take this approach. Also this actually achieves
> point 2) above, not completely though, but not 1).

I don't agree with the sentiment about the approach, but I agree there
might be weaknesses in the implementation (which I have not reviewed). I
think it's perfectly sensible to require fetching one segment from
pg_xlog and the next one via another method. Currently the inability to
do so often leads to us constantly refetching the same segment.


> An approach I have been thinking about to address points 1) and 2) is to
> introduce a new messages type in the replication protocol which can be
> used report as well at which position a replication slot should hold its
> position. It would have been tempting to extend the existing standby
> feedback message but we cannot afford a protocol break either. So my
> proposal is to create a new replication message which can be used in the
> context of a replication slot instead of the usual standby feedback
> message, let's call it slot feedback message. This is similar to the
> standby message, except that it reports as well the LSN the slot should
> wait for. This results in an addition of... 8 bytes in the message which
> is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> but one record can spawn across more than 4GB worth of segments, right?
> (Note that replication slot data should updated with that instead of the
> flush LSN).

-0.5, I think this just adds complexity instead of fixing the underlying
problem.

Greetings,

Andres Freund

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

19 January 2018, 04:54:53

Hello,

Thank you for the complrehensive explanation, Michael.

I happened to see another instance of this failure on one of our
client site. The precise steps lead to the situation is not
available but it is reported that it had occurred without
immediate shutdown. As far as I know just starting the standby
after pg_basebackup caused that.

(The standby fetches archives of the primary server so just
 swtichg wal segment could break the loop in the case but it's
 not always the case.)

At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres@anarazel.de> wrote in
<20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de>
> Hi,
> 
> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> > On Wed, Jan 17, 2018 at 07:38:32AM -0500, Stephen Frost wrote:
> > > Thanks for looking at this patch previously.  This patch is still in
> > > Needs Review but it's not clear if that's correct or not, but this seems
> > > to be a bug-fix, so it would be great if we could make some progress on
> > > it (particularly since it was reported over a year ago and goes back to
> > > 9.4, according to the thread, from what I can tell).
> > > 
> > > Any chance one of you can provide some further thoughts on it or make it
> > > clear if we are waiting for a new patch or if the existing one is still
> > > reasonable and just needs to be reviewed (in which case, perhaps one of
> > > you could help with that..)?
> > 
> > Here is my input then with a summary regarding this thread. The base
> > problem here is that WAL segments are not retained on a primary when
> > using a replication slot if a record begins at a segment boundary.
> 
> > 1) On the primary side, track correctly the beginning of a record so as
> > when a record runs across multiple segments, the slot does not recycle
> > them.
> 
> I think that's a really bad idea. There's no bug on the sender side
> here. The sender allows removal of WAL based on what the receiver
> acks. This requires that all senders have low-level understanding on
> what's being sent - not a good idea, as we e.g. imo want to have tools
> that serve WAL over the streaming protocol from an archive.
> 
> > 2) On the standby side, we need to track as well the beginning of the
> > record so as we don't risk recycling things because of restart points.
> 
> 
> > On this thread have been discussed a couple of approaches:
> > a) The first patch was doing things on the primary side by opening an
> > XLogReader which was in charge to check if the record we are looking at
> > was at a segment boundary. On top of being costly, this basically
> > achieved 1) but failed on 2).
> 
> I am very very strongly against this approach.
> 
> > b) The second patch that I would like to mention is doing things on the
> > standby side by being able to change a WAL source when fetching a single
> > record so as it is possible to fetch the beginning of a cross-segment
> > record from say an archive or the local xlog, and then request the rest
> > on the primary. This patch can be found in
> > https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> > and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> > it seems to me that this actually breaks timeline switch
> > consistency. The concept of switching a WAL source in the middle of a
> > WAL segment is risky anyway, because we perform sanity checks while
> > switching sources. The bug we are dealing about here is very rare, and
> > seeing a problem like that for a timeline switch is even more rare, but
> > the risk is not zero and we may finish with a corrupted instance, so we
> > should not in my opinion take this approach. Also this actually achieves
> > point 2) above, not completely though, but not 1).
> 
> I don't agree with the sentiment about the approach, but I agree there
> might be weaknesses in the implementation (which I have not reviewed). I
> think it's perfectly sensible to require fetching one segment from
> pg_xlog and the next one via another method. Currently the inability to
> do so often leads to us constantly refetching the same segment.

Though I'm still not fully confident, if reading a set of
continuation records from two (or more) sources can lead to
inconsistency, two successve (complete) records are facing the
same risk.

> > An approach I have been thinking about to address points 1) and 2) is to
> > introduce a new messages type in the replication protocol which can be
> > used report as well at which position a replication slot should hold its
> > position. It would have been tempting to extend the existing standby
> > feedback message but we cannot afford a protocol break either. So my
> > proposal is to create a new replication message which can be used in the
> > context of a replication slot instead of the usual standby feedback
> > message, let's call it slot feedback message. This is similar to the
> > standby message, except that it reports as well the LSN the slot should
> > wait for. This results in an addition of... 8 bytes in the message which
> > is a lot. I have thoughts about the possibility to reduce it to 4 bytes
> > but one record can spawn across more than 4GB worth of segments, right?
> > (Note that replication slot data should updated with that instead of the
> > flush LSN).
> 
> -0.5, I think this just adds complexity instead of fixing the underlying
> problem.

On the other hand if one logical record must be read from single
source, we need any means to deterrent wal-recycling on the
primary side. Conducting that within the primary side is rejected
by consensus. (There might be smarter way to do that, though.) To
do that from the standby side, just retarding write feedbacks or
this kind of special feedback would be needed. In any way it is
necessary to introduce WAL-record awareness into WAL shipping
process and it's seemingly large complexity.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Michael Paquier

Date:

19 January 2018, 12:24:56

On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> On the other hand if one logical record must be read from single
> source, we need any means to deterrent wal-recycling on the
> primary side. Conducting that within the primary side is rejected
> by consensus.

There is this approach of extending the message protocol as well so as
the primary can retain the segments it needs to keep around...

> (There might be smarter way to do that, though.) To
> do that from the standby side, just retarding write feedbacks or
> this kind of special feedback would be needed. In any way it is
> necessary to introduce WAL-record awareness into WAL shipping
> process and it's seemingly large complexity.

Coming to logical slots, don't you need to be aware of the beginning of
the record on the primary to perform correctly decoding of tuples
through time travel? If the record is present across segments, it seems
to me that it matters. Andres knows the answer to that for sure, so I
would expect to be counter-argued in the next 12 hours or so.
--
Michael

Attachment

signature.asc

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

23 January 2018, 12:50:00

At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<20180119092456.GA1169@paquier.xyz>
> On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> > On the other hand if one logical record must be read from single
> > source, we need any means to deterrent wal-recycling on the
> > primary side. Conducting that within the primary side is rejected
> > by consensus.
> 
> There is this approach of extending the message protocol as well so as
> the primary can retain the segments it needs to keep around...

I haven't seen it in detail but it seems meaning that it is done
by notifying something from the standby to the parimary not
knowing what is a WAL record on the standby side.

> > (There might be smarter way to do that, though.) To
> > do that from the standby side, just retarding write feedbacks or
> > this kind of special feedback would be needed. In any way it is
> > necessary to introduce WAL-record awareness into WAL shipping
> > process and it's seemingly large complexity.
> 
> Coming to logical slots, don't you need to be aware of the beginning of
> the record on the primary to perform correctly decoding of tuples
> through time travel? If the record is present across segments, it seems

I'm not sure what the time travel is, but the requried segments
seems kept safely on logical replication since the publisher
moves restart_lsn not based on finished commits LSN, but on the
beginning LSN of running transactions. In other words, logical
decoding doesn't need to track each record for the purpose of
keeping segments since it already keeping track of the beginning
of a transaction. Physical replication is totally unaware of a
WAL record, it just copies blobs in a convenient unit and only
cares LSN. But the ignorance seems required to keep performance.

> to me that it matters. Andres knows the answer to that for sure, so I
> would expect to be counter-argued in the next 12 hours or so.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

09 April 2018, 07:26:54

Hello, I added this as Older Bugs in Open items. (I believe it's legit.)

The latest patch still applies on the HEAD with some offsets.

At Tue, 23 Jan 2018 18:50:00 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20180123.185000.232069302.horiguchi.kyotaro@lab.ntt.co.jp>
> At Fri, 19 Jan 2018 18:24:56 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<20180119092456.GA1169@paquier.xyz>
> > On Fri, Jan 19, 2018 at 10:54:53AM +0900, Kyotaro HORIGUCHI wrote:
> > > On the other hand if one logical record must be read from single
> > > source, we need any means to deterrent wal-recycling on the
> > > primary side. Conducting that within the primary side is rejected
> > > by consensus.
> > 
> > There is this approach of extending the message protocol as well so as
> > the primary can retain the segments it needs to keep around...
> 
> I haven't seen it in detail but it seems meaning that it is done
> by notifying something from the standby to the parimary not
> knowing what is a WAL record on the standby side.
> 
> > > (There might be smarter way to do that, though.) To
> > > do that from the standby side, just retarding write feedbacks or
> > > this kind of special feedback would be needed. In any way it is
> > > necessary to introduce WAL-record awareness into WAL shipping
> > > process and it's seemingly large complexity.
> > 
> > Coming to logical slots, don't you need to be aware of the beginning of
> > the record on the primary to perform correctly decoding of tuples
> > through time travel? If the record is present across segments, it seems
> 
> I'm not sure what the time travel is, but the requried segments
> seems kept safely on logical replication since the publisher
> moves restart_lsn not based on finished commits LSN, but on the
> beginning LSN of running transactions. In other words, logical
> decoding doesn't need to track each record for the purpose of
> keeping segments since it already keeping track of the beginning
> of a transaction. Physical replication is totally unaware of a
> WAL record, it just copies blobs in a convenient unit and only
> cares LSN. But the ignorance seems required to keep performance.
> 
> > to me that it matters. Andres knows the answer to that for sure, so I
> > would expect to be counter-argued in the next 12 hours or so.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Michael Paquier

Date:

09 April 2018, 07:59:45

On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote:
> Hello, I added this as Older Bugs in Open items. (I believe it's
> legit.)

Yes, I think that's adapted.  I am hesitating to do the same thing with
all the other patches marked as bug fixes.
--
Michael

Attachment

signature.asc

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

09 April 2018, 08:19:24

At Mon, 9 Apr 2018 13:59:45 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20180409045945.GB1740@paquier.xyz>
> On Mon, Apr 09, 2018 at 01:26:54PM +0900, Kyotaro HORIGUCHI wrote:
> > Hello, I added this as Older Bugs in Open items. (I believe it's
> > legit.)
> 
> Yes, I think that's adapted.  I am hesitating to do the same thing with
> all the other patches marked as bug fixes.

Thanks. I'm also not going to do so on the other "bug fix"es. The
reasons for this patch are I myself recently had a suspected case
(on a costomer site) and the size is not so large. (This doesn't
mean it is easy, of couse..)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Heikki Linnakangas

Date:

23 April 2018, 10:41:47

On 18/01/18 20:54, Kyotaro HORIGUCHI wrote:
> At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres@anarazel.de> wrote in
<20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de>
>> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
>>> b) The second patch that I would like to mention is doing things on the
>>> standby side by being able to change a WAL source when fetching a single
>>> record so as it is possible to fetch the beginning of a cross-segment
>>> record from say an archive or the local xlog, and then request the rest
>>> on the primary. This patch can be found in
>>> https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
>>> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
>>> it seems to me that this actually breaks timeline switch
>>> consistency. The concept of switching a WAL source in the middle of a
>>> WAL segment is risky anyway, because we perform sanity checks while
>>> switching sources. The bug we are dealing about here is very rare, and
>>> seeing a problem like that for a timeline switch is even more rare, but
>>> the risk is not zero and we may finish with a corrupted instance, so we
>>> should not in my opinion take this approach. Also this actually achieves
>>> point 2) above, not completely though, but not 1).
>>
>> I don't agree with the sentiment about the approach, but I agree there
>> might be weaknesses in the implementation (which I have not reviewed). I
>> think it's perfectly sensible to require fetching one segment from
>> pg_xlog and the next one via another method. Currently the inability to
>> do so often leads to us constantly refetching the same segment.
> 
> Though I'm still not fully confident, if reading a set of
> continuation records from two (or more) sources can lead to
> inconsistency, two successve (complete) records are facing the
> same risk.

This seems like the best approach to me as well.

Looking at the patch linked above 
(https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):

> --- a/src/backend/access/transam/xlog.c
> +++ b/src/backend/access/transam/xlog.c
> @@ -11693,6 +11693,10 @@ retry:
>      Assert(reqLen <= readLen);
>  
>      *readTLI = curFileTLI;
> +
> +    if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr, readBuf))
> +        goto next_record_is_invalid;
> +
>      return readLen;
>  
>  next_record_is_invalid:

What is the point of adding this XLogReaderValidatePageHeader() call? 
The caller of this callback function, ReadPageInternal(), checks the 
page header anyway. Earlier in this thread, you said:

> The rest to do is let XLogPageRead retry other sources
> immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> public (and renamed to XLogReaderValidatePageHeader).

but I don't understand that.

- Heikki

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

24 April 2018, 13:57:12

Thank you very much for looking this!

At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi>
> On 18/01/18 20:54, Kyotaro HORIGUCHI wrote:
> > At Thu, 18 Jan 2018 11:52:52 -0800, Andres Freund <andres@anarazel.de>
> > wrote in <20180118195252.hyxgkb3krcqjzfjm@alap3.anarazel.de>
> >> On 2018-01-18 20:58:27 +0900, Michael Paquier wrote:
> >>> b) The second patch that I would like to mention is doing things on
> >>> the
> >>> standby side by being able to change a WAL source when fetching a
> >>> single
> >>> record so as it is possible to fetch the beginning of a cross-segment
> >>> record from say an archive or the local xlog, and then request the
> >>> rest
> >>> on the primary. This patch can be found in
> >>> https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp
> >>> and the diff in WaitForWALToBecomeAvailable() scares me a lot because,
> >>> it seems to me that this actually breaks timeline switch
> >>> consistency. The concept of switching a WAL source in the middle of a
> >>> WAL segment is risky anyway, because we perform sanity checks while
> >>> switching sources. The bug we are dealing about here is very rare, and
> >>> seeing a problem like that for a timeline switch is even more rare,
> >>> but
> >>> the risk is not zero and we may finish with a corrupted instance, so
> >>> we
> >>> should not in my opinion take this approach. Also this actually
> >>> achieves
> >>> point 2) above, not completely though, but not 1).
> >>
> >> I don't agree with the sentiment about the approach, but I agree there
> >> might be weaknesses in the implementation (which I have not
> >> reviewed). I
> >> think it's perfectly sensible to require fetching one segment from
> >> pg_xlog and the next one via another method. Currently the inability
> >> to
> >> do so often leads to us constantly refetching the same segment.
> > Though I'm still not fully confident, if reading a set of
> > continuation records from two (or more) sources can lead to
> > inconsistency, two successve (complete) records are facing the
> > same risk.
> 
> This seems like the best approach to me as well.
> 
> Looking at the patch linked above
> (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):
> 
> > --- a/src/backend/access/transam/xlog.c
> > +++ b/src/backend/access/transam/xlog.c
> > @@ -11693,6 +11693,10 @@ retry:
> >      Assert(reqLen <= readLen);
> >       *readTLI = curFileTLI;
> > +
> > + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> > readBuf))
> > +        goto next_record_is_invalid;
> > +
> >      return readLen;
> >   next_record_is_invalid:
> 
> What is the point of adding this XLogReaderValidatePageHeader() call?
> The caller of this callback function, ReadPageInternal(), checks the
> page header anyway. Earlier in this thread, you said:

Without the lines, server actually fails to start replication.

(I try to remember the details...)

The difference is whether the function can cause retry for the
same portion of a set of continued records (without changing the
plugin API). XLogPageRead can do that. On the other hand all
ReadPageInternal can do is just letting the caller ReadRecords
retry from the beginning of the set of continued records since
the caller side handles only complete records.

In the failure case, in XLogPageRead, when read() fails, it can
try the next source midst of a continued records. On the other
hand if the segment was read but it was recycled one, it passes
"success" to ReadPageInternal and leads to retring from the
beginning of the recrod. Infinite loop.

Calling XLogReaderValidatePageHeader in ReadPageInternal is
redundant, but removing it may be interface change of xlogreader
plugin and I am not sure that is allowed.

> > The rest to do is let XLogPageRead retry other sources
> > immediately. To do this I made ValidXLogPageHeader(at)xlogreader(dot)c
> > public (and renamed to XLogReaderValidatePageHeader).
> 
> but I don't understand that.

It is exposed so that XLogPageRead can validate the page header
using it.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Heikki Linnakangas

Date:

04 May 2018, 10:05:41

On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi>
>> Looking at the patch linked above
>> (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):
>>
>>> --- a/src/backend/access/transam/xlog.c
>>> +++ b/src/backend/access/transam/xlog.c
>>> @@ -11693,6 +11693,10 @@ retry:
>>>       Assert(reqLen <= readLen);
>>>        *readTLI = curFileTLI;
>>> +
>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
>>> readBuf))
>>> +        goto next_record_is_invalid;
>>> +
>>>       return readLen;
>>>    next_record_is_invalid:
>>
>> What is the point of adding this XLogReaderValidatePageHeader() call?
>> The caller of this callback function, ReadPageInternal(), checks the
>> page header anyway. Earlier in this thread, you said:
> 
> Without the lines, server actually fails to start replication.
> 
> (I try to remember the details...)
> 
> The difference is whether the function can cause retry for the
> same portion of a set of continued records (without changing the
> plugin API). XLogPageRead can do that. On the other hand all
> ReadPageInternal can do is just letting the caller ReadRecords
> retry from the beginning of the set of continued records since
> the caller side handles only complete records.
> 
> In the failure case, in XLogPageRead, when read() fails, it can
> try the next source midst of a continued records. On the other
> hand if the segment was read but it was recycled one, it passes
> "success" to ReadPageInternal and leads to retring from the
> beginning of the recrod. Infinite loop.

I see. You have the same problem if you have a WAL file that's corrupt 
in some other way, but one of the sources would have a correct copy. 
ValidXLogPageHeader() only checks the page header, after all. But unlike 
a recycled WAL segment, that's not supposed to happen as part of normal 
operation, so I guess we can live with that.

> Calling XLogReaderValidatePageHeader in ReadPageInternal is
> redundant, but removing it may be interface change of xlogreader
> plugin and I am not sure that is allowed.

We should avoid doing that in back-branches, at least. But in 'master', 
I wouldn't mind redesigning the API. Dealing with all the retrying is 
pretty complicated as it is, if we can simplify that somehow, I think 
that'd be good.

- Heikki

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Heikki Linnakangas

Date:

05 May 2018, 01:49:31

On 04/05/18 10:05, Heikki Linnakangas wrote:
> On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
>> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi>
>>> Looking at the patch linked above
>>> (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):
>>>
>>>> --- a/src/backend/access/transam/xlog.c
>>>> +++ b/src/backend/access/transam/xlog.c
>>>> @@ -11693,6 +11693,10 @@ retry:
>>>>        Assert(reqLen <= readLen);
>>>>         *readTLI = curFileTLI;
>>>> +
>>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
>>>> readBuf))
>>>> +        goto next_record_is_invalid;
>>>> +
>>>>        return readLen;
>>>>     next_record_is_invalid:
>>>
>>> What is the point of adding this XLogReaderValidatePageHeader() call?
>>> The caller of this callback function, ReadPageInternal(), checks the
>>> page header anyway. Earlier in this thread, you said:
>>
>> Without the lines, server actually fails to start replication.
>>
>> (I try to remember the details...)
>>
>> The difference is whether the function can cause retry for the
>> same portion of a set of continued records (without changing the
>> plugin API). XLogPageRead can do that. On the other hand all
>> ReadPageInternal can do is just letting the caller ReadRecords
>> retry from the beginning of the set of continued records since
>> the caller side handles only complete records.
>>
>> In the failure case, in XLogPageRead, when read() fails, it can
>> try the next source midst of a continued records. On the other
>> hand if the segment was read but it was recycled one, it passes
>> "success" to ReadPageInternal and leads to retring from the
>> beginning of the recrod. Infinite loop.
> 
> I see. You have the same problem if you have a WAL file that's corrupt
> in some other way, but one of the sources would have a correct copy.
> ValidXLogPageHeader() only checks the page header, after all. But unlike
> a recycled WAL segment, that's not supposed to happen as part of normal
> operation, so I guess we can live with that.

Pushed this now, after adding some comments. Thanks!

>> Calling XLogReaderValidatePageHeader in ReadPageInternal is
>> redundant, but removing it may be interface change of xlogreader
>> plugin and I am not sure that is allowed.
> 
> We should avoid doing that in back-branches, at least. But in 'master',
> I wouldn't mind redesigning the API. Dealing with all the retrying is
> pretty complicated as it is, if we can simplify that somehow, I think
> that'd be good.

I spent some time musing on what a better API might look like. We could 
remove the ReadPage callback, and instead have XLogReadRecord return a 
special return code to mean "give me more data". I'm thinking of 
something like:

/* return code of XLogReadRecord() */
typedef enum
{
     XLREAD_SUCCESS,
     XLREAD_INVALID_RECORD,  /* a record was read, but it was corrupt */
     XLREAD_INVALID_PAGE,    /* the page that was supplied looks invalid. */
     XLREAD_NEED_DATA,       /* caller should place more data in buffer, 
and retry */
} XLogReadRecord_Result;


And the calls to XLogReadRecord() would look something like this:

for(;;)
{
     rc = XLogReadRecord(reader, startptr, errormsg);

     if (rc == XLREAD_SUCCESS)
     {
         /* great, got record */
     }
     if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
     {
         elog(ERROR, "invalid record");
     }
     if (rc == XLREAD_NEED_DATA)
     {
         /*
          * Read a page from disk, and place it into reader->readBuf
          */
         XLogPageRead(reader->readPagePtr, /* page to read */
                      reader->reqLen       /* # of bytes to read */ );
         /*
          * Now that we have read the data that XLogReadRecord()
          * requested, call it again.
          */
         continue;
     }
}

So instead of having a callback, XLogReadRecord() would return 
XLREAD_NEED_DATA. The caller would then need to place that data into the 
buffer, and call it again. If a record spans multiple pages, 
XLogReadRecord() would return with XLREAD_NEED_DATA multiple times, to 
read each page.

The important difference for the bug we're discussing on this thread is 
is that if you passed an invalid page to XLogReadRecord(), it would 
return with XLREAD_INVALID_PAGE. You could then try reading the same 
page from a different source, and call XLogReadRecord() again, and it 
could continue where it was left off, even if it was in the middle of a 
continuation record.

This is clearly not backpatchable, but maybe something to think about 
for v12.

- Heikki

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From

Kyotaro HORIGUCHI

Date:

14 May 2018, 09:59:13

Thank you for adding the detailed comment and commiting.

At Sat, 5 May 2018 01:49:31 +0300, Heikki Linnakangas <hlinnaka@iki.fi> wrote in
<47215279-228d-f30d-35d1-16af695e53f3@iki.fi>
> On 04/05/18 10:05, Heikki Linnakangas wrote:
> > On 24/04/18 13:57, Kyotaro HORIGUCHI wrote:
> >> At Mon, 23 Apr 2018 03:41:47 -0400, Heikki Linnakangas
> >> <hlinnaka@iki.fi> wrote in
> >> <89e33d4f-5c75-0738-3dcb-44c4df59e920@iki.fi>
> >>> Looking at the patch linked above
> >>> (https://www.postgresql.org/message-id/20171026.190551.208996945.horiguchi.kyotaro%40lab.ntt.co.jp):
> >>>
> >>>> --- a/src/backend/access/transam/xlog.c
> >>>> +++ b/src/backend/access/transam/xlog.c
> >>>> @@ -11693,6 +11693,10 @@ retry:
> >>>>        Assert(reqLen <= readLen);
> >>>>         *readTLI = curFileTLI;
> >>>> +
> >>>> + if (!XLogReaderValidatePageHeader(xlogreader, targetPagePtr,
> >>>> readBuf))
> >>>> +        goto next_record_is_invalid;
> >>>> +
> >>>>        return readLen;
> >>>>     next_record_is_invalid:
> >>>
> >>> What is the point of adding this XLogReaderValidatePageHeader() call?
> >>> The caller of this callback function, ReadPageInternal(), checks the
> >>> page header anyway. Earlier in this thread, you said:
> >>
> >> Without the lines, server actually fails to start replication.
> >>
> >> (I try to remember the details...)
> >>
> >> The difference is whether the function can cause retry for the
> >> same portion of a set of continued records (without changing the
> >> plugin API). XLogPageRead can do that. On the other hand all
> >> ReadPageInternal can do is just letting the caller ReadRecords
> >> retry from the beginning of the set of continued records since
> >> the caller side handles only complete records.
> >>
> >> In the failure case, in XLogPageRead, when read() fails, it can
> >> try the next source midst of a continued records. On the other
> >> hand if the segment was read but it was recycled one, it passes
> >> "success" to ReadPageInternal and leads to retring from the
> >> beginning of the recrod. Infinite loop.
> > I see. You have the same problem if you have a WAL file that's corrupt
> > in some other way, but one of the sources would have a correct copy.
> > ValidXLogPageHeader() only checks the page header, after all. But
> > unlike
> > a recycled WAL segment, that's not supposed to happen as part of
> > normal
> > operation, so I guess we can live with that.

Anyway we read successive complete records from different sources
so I think if such curruption causes a problem we would have
faced the same problem even without this.

> Pushed this now, after adding some comments. Thanks!

Thanks!

> >> Calling XLogReaderValidatePageHeader in ReadPageInternal is
> >> redundant, but removing it may be interface change of xlogreader
> >> plugin and I am not sure that is allowed.
> > We should avoid doing that in back-branches, at least. But in
> > 'master',
> > I wouldn't mind redesigning the API. Dealing with all the retrying is
> > pretty complicated as it is, if we can simplify that somehow, I think
> > that'd be good.

Agreed.

> I spent some time musing on what a better API might look like. We
> could remove the ReadPage callback, and instead have XLogReadRecord
> return a special return code to mean "give me more data". I'm thinking
> of something like:

Sounds reasonable. That makes an additional return/call iteration
to XLogReadRecord but it would be ignorable comparing to the cost
of XLogPageRead.

> /* return code of XLogReadRecord() */
> typedef enum
> {
>     XLREAD_SUCCESS,
>     XLREAD_INVALID_RECORD,  /* a record was read, but it was corrupt */
>     XLREAD_INVALID_PAGE, /* the page that was supplied looks invalid. */
>     XLREAD_NEED_DATA, /* caller should place more data in buffer, and
>     retry */
> } XLogReadRecord_Result;
> 
> 
> And the calls to XLogReadRecord() would look something like this:
> 
> for(;;)
> {
>     rc = XLogReadRecord(reader, startptr, errormsg);
> 
>     if (rc == XLREAD_SUCCESS)
>     {
>         /* great, got record */
>     }
>     if (rc == XLREAD_INVALID_PAGE || XLREAD_INVALID_RECORD)
>     {
>         elog(ERROR, "invalid record");
>     }
>     if (rc == XLREAD_NEED_DATA)
>     {
>         /*
>          * Read a page from disk, and place it into reader->readBuf
>          */
>         XLogPageRead(reader->readPagePtr, /* page to read */
>                      reader->reqLen       /* # of bytes to read */ );
>         /*
>          * Now that we have read the data that XLogReadRecord()
>          * requested, call it again.
>          */
>         continue;
>     }
> }
> 
> So instead of having a callback, XLogReadRecord() would return
> XLREAD_NEED_DATA. The caller would then need to place that data into
> the buffer, and call it again. If a record spans multiple pages,
> XLogReadRecord() would return with XLREAD_NEED_DATA multiple times, to
> read each page.

It seems easier to understand at a glance. In short, I like it.

> The important difference for the bug we're discussing on this thread
> is is that if you passed an invalid page to XLogReadRecord(), it would
> return with XLREAD_INVALID_PAGE. You could then try reading the same
> page from a different source, and call XLogReadRecord() again, and it
> could continue where it was left off, even if it was in the middle of
> a continuation record.

We will have more control on how to continue reading WAL with
this than the current code and it seems preferable as the whole.

> This is clearly not backpatchable, but maybe something to think about
> for v12.

Are you planning to working on this shortly?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center