Thread: Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson@dyn.com> wrote in
<CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > user=,db=,app=,client= LOG:  received fast shutdown request
> > user=,db=,app=,client= LOG:  aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG:  invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG:  switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > ERROR:  requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0           (default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then    start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops. > LOG:  #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.
 > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already
beenremoved
 


This problem occurs when only the earlier parts in a continued
record is replicated then the segment is removed on the
master. In other words, the first half is only on standby, and
the second half is only on the master.

I believe that a continuation record cannot be span over three or
more *segments* (is it right?), so kepping one spare segment
would be enough. The attached second patch does this.


Other possible measures might be,

- Allowing switching wal source while reading a continuation record. Currently ReadRecord assumes that a continuation
recordcan be read from single source. But this needs refactoring involving xlog.c, xlogreader.c and relatives.
 

- Delaying recycing a segment until the last partial record on it completes. This seems doable in page-wise (coarse
resolution)but would cost additional reading of past xlog files (page header of past pages is required).
 

- Delaying write/flush feedback until the current record is completed. walreceiver is not conscious of a WAL record and
thismight break synchronous replication.
 

Any thoughts?


=========================================
> > A physical analysis shows that the WAL file 00000001000000060000008B is
> > 100% zeroes (ASCII NUL).

I suppose it is on the standby so the segment file is the one
where the next transferred record will be written onto.

> > The results of querying pg_replication_slots shows a restart_lsn that
> > matches ….6/8B.

It is the beginning of the next record to be replicatd as
documentation. In other words, just after the last transferred
record (containing padding).

> > Pg_controldata shows values like:
> > Minimum recovery ending location:     6/8Axxxxxx

It is the beginning of the last applied record.

> > How can the master show a position that is greater than the minimum
> > recovery ending location?

So it is natural that the former is larger than the latter.

> > EXAMPLE #2:
> >
> > Minimum recovery ending location:     19DD/73FFFFE0
> > Log segment 00000001000019DD00000073 was not available.
> > The restart LSN was 19DD/74000000.
> > The last few lines from pg_xlogdump 00000001000019DD00000073:
> >
> >
> > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> > rel 1663/16403/150017028 blk 1832
> > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> > rel 1663/16403/150017028 blk 11709
> >
> > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> > byte of the last record in this file, and the record length is 64 bytes
> > which places the first byte of the next record at: 16777184 (0xffffe0)
> > (logical position 0x73ffffe0: this jives with pg_controldata).

Maybe right. pg_xlogdump skips partial records.

> > However, there are only 32 bytes of file left:
> > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > 16777216 - 16777184 -=> 32
> >
> > Which means that the next record is in the WAL file
> > 00000001000019DD00000074.

Maybe right.

> > A possible theory:
> >
> > Let us assume PG has applied 100% of the data in a given WAL file, and
> > let’s assume (as in this case) that the WAL file is
> > 00000001000019DD00000073.  When it starts up again, it uses the control
> > data to start and say “The next record is at 19DD/0x73ffffe0"  which it
> > truncates to 0x73000000.  However, PG has *also* already told the master
> > that is has fully received, written, and flushed all of the data for that
> > WAL file, so the master has 0x74000000 as the start position (and has
> > consequently removed the WAL file for 0x73). The relationship between
> > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > slightly (but importantly) at odds.
> >
> > Could it be this part of the code?

No. the code does the right thing. The problem is that a
continuation record is assumed to be on the same wal source, that
is, archive/wal and streaming. But a continueation record is
distributed to two sources.

> > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> >
> > 235     /*
> > 236      * We always start at the beginning of the segment. That prevents
> > a broken
> > 237      * segment (i.e., with no records in the first half of a segment)
> > from
> > 238      * being created by XLOG streaming, which might cause trouble
> > later on if
> > 239      * the segment is e.g archived.
> > 240      */
> > 241     if (recptr % XLogSegSize != 0)
> > 242         recptr -= recptr % XLogSegSize;
> > 243
> >
> > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > that segment for any more records, so logically we'd have to go to
> > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> > 0x73000000, which the master has already removed (and - technically - we
> > don't actually need?).

0x73ffffe0 and 0x73000000 are on the same segment. Current
recovery mechanism requires reading the record starts from
0x73ffffe0 and it is on the standby and it is read.


regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index c6b54ec..39d8ae5 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -978,6 +978,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)            recvFile =
XLogFileInit(recvSegNo,&use_existent, true);            recvFileTLI = ThisTimeLineID;            recvOff = 0;
 
+
+            if ((recptr & 0xffffffL) == 0)
+            {
+                XLogPageHeader ph = (XLogPageHeader) buf;
+                Assert(nbytes >= sizeof(SizeOfXLogShortPHD));
+                
+                elog(LOG, "############# CHECK AT %lX : %d",
+                     recptr, (ph->xlp_info & XLP_FIRST_IS_CONTRECORD) != 0);
+                if (ph->xlp_info & XLP_FIRST_IS_CONTRECORD)
+                {
+                    struct stat sbuf;
+                    if (stat("/tmp/hoge", &sbuf) == 0)
+                    {
+                        elog(LOG, "#################### STOP THE SERVER");
+                        system("pg_ctl stop -m f -W");
+                        while (1)
+                        {
+                            ProcessWalRcvInterrupts();
+                            sleep(1);
+                        }
+                    }
+                }
+            }        }        /* Calculate the start offset of the received logs */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 70edafa..df386a3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9129,7 +9129,12 @@ KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo)    {        XLogRecPtr    slotSegNo;
-        XLByteToSeg(keep, slotSegNo);
+        /*
+         * If we are just at the beginning of a segment, the previous segment
+         * might required by the next replication connection. So preserve
+         * extra one segment for the case.
+         */
+        XLByteToSeg(keep - 1, slotSegNo);        if (slotSegNo <= 0)            segno = 1;

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Auch! It is wrong.

Not decrement keep, decrement segno.

2017年1月17日(火) 19:37 Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>:
Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson@dyn.com> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > user=,db=,app=,client= LOG:  received fast shutdown request
> > user=,db=,app=,client= LOG:  aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG:  invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG:  switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > ERROR:  requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
            (default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
     start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:  #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already been removed


This problem occurs when only the earlier parts in a continued
record is replicated then the segment is removed on the
master. In other words, the first half is only on standby, and
the second half is only on the master.

I believe that a continuation record cannot be span over three or
more *segments* (is it right?), so kepping one spare segment
would be enough. The attached second patch does this.


Other possible measures might be,

- Allowing switching wal source while reading a continuation
  record. Currently ReadRecord assumes that a continuation record
  can be read from single source. But this needs refactoring
  involving xlog.c, xlogreader.c and relatives.

- Delaying recycing a segment until the last partial record on it
  completes. This seems doable in page-wise (coarse resolution)
  but would cost additional reading of past xlog files (page
  header of past pages is required).

- Delaying write/flush feedback until the current record is
  completed. walreceiver is not conscious of a WAL record and
  this might break synchronous replication.

Any thoughts?


=========================================
> > A physical analysis shows that the WAL file 00000001000000060000008B is
> > 100% zeroes (ASCII NUL).

I suppose it is on the standby so the segment file is the one
where the next transferred record will be written onto.

> > The results of querying pg_replication_slots shows a restart_lsn that
> > matches ….6/8B.

It is the beginning of the next record to be replicatd as
documentation. In other words, just after the last transferred
record (containing padding).

> > Pg_controldata shows values like:
> > Minimum recovery ending location:     6/8Axxxxxx

It is the beginning of the last applied record.

> > How can the master show a position that is greater than the minimum
> > recovery ending location?

So it is natural that the former is larger than the latter.

> > EXAMPLE #2:
> >
> > Minimum recovery ending location:     19DD/73FFFFE0
> > Log segment 00000001000019DD00000073 was not available.
> > The restart LSN was 19DD/74000000.
> > The last few lines from pg_xlogdump 00000001000019DD00000073:
> >
> >
> > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > 19DD/73FFFF60, prev 19DD/73FFFF20, desc: INSERT_LEAF off 132, blkref #0:
> > rel 1663/16403/150017028 blk 1832
> > rmgr: Btree       len (rec/tot):      2/    64, tx:      77257, lsn:
> > 19DD/73FFFFA0, prev 19DD/73FFFF60, desc: INSERT_LEAF off 206, blkref #0:
> > rel 1663/16403/150017028 blk 11709
> >
> > If I'm understanding this properly, (0x73FFFFA0 - 0x73000000) is the first
> > byte of the last record in this file, and the record length is 64 bytes
> > which places the first byte of the next record at: 16777184 (0xffffe0)
> > (logical position 0x73ffffe0: this jives with pg_controldata).

Maybe right. pg_xlogdump skips partial records.

> > However, there are only 32 bytes of file left:
> > 0x73FFFFA0 - 0x73000000 + 64 -=> 16777184
> > 16777216 - 16777184 -=> 32
> >
> > Which means that the next record is in the WAL file
> > 00000001000019DD00000074.

Maybe right.

> > A possible theory:
> >
> > Let us assume PG has applied 100% of the data in a given WAL file, and
> > let’s assume (as in this case) that the WAL file is
> > 00000001000019DD00000073.  When it starts up again, it uses the control
> > data to start and say “The next record is at 19DD/0x73ffffe0"  which it
> > truncates to 0x73000000.  However, PG has *also* already told the master
> > that is has fully received, written, and flushed all of the data for that
> > WAL file, so the master has 0x74000000 as the start position (and has
> > consequently removed the WAL file for 0x73). The relationship between
> > pg_controldata and pg_replication_slots.restart_lsn seem to be very
> > slightly (but importantly) at odds.
> >
> > Could it be this part of the code?

No. the code does the right thing. The problem is that a
continuation record is assumed to be on the same wal source, that
is, archive/wal and streaming. But a continueation record is
distributed to two sources.

> > From src/backend/replication/walreceiverfuncs.c in RequestXLogStreaming
> > (as of a0aa358ca603d8189fe4be72f614cf7cf363d81a):
> >
> > 235     /*
> > 236      * We always start at the beginning of the segment. That prevents
> > a broken
> > 237      * segment (i.e., with no records in the first half of a segment)
> > from
> > 238      * being created by XLOG streaming, which might cause trouble
> > later on if
> > 239      * the segment is e.g archived.
> > 240      */
> > 241     if (recptr % XLogSegSize != 0)
> > 242         recptr -= recptr % XLogSegSize;
> > 243
> >
> > We start up with 19DD/0x73ffffe0 (but there would not be enough room in
> > that segment for any more records, so logically we'd have to go to
> > 19DD/0x74000000). When we start WAL receiving, we truncate 0x73ffffe0 to
> > 0x73000000, which the master has already removed (and - technically - we
> > don't actually need?).

0x73ffffe0 and 0x73000000 are on the same segment. Current
recovery mechanism requires reading the record starts from
0x73ffffe0 and it is on the standby and it is read.


regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Michael Paquier
Date:
On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.

Good idea.

> I believe that a continuation record cannot be span over three or
> more *segments* (is it right?), so keeping one spare segment
> would be enough. The attached second patch does this.

I have to admit that I did not think about this problem much yet (I
bookmarked this report weeks ago to be honest as something to look
at), but that does not look right to me. Couldn't a record be spawned
across even more segments? Take a random string longer than 64MB or
event longer for example.

> Other possible measures might be,
>
> - Allowing switching wal source while reading a continuation
>   record. Currently ReadRecord assumes that a continuation record
>   can be read from single source. But this needs refactoring
>   involving xlog.c, xlogreader.c and relatives.

This is scary thinking about back-branches.

> - Delaying recycling a segment until the last partial record on it
>   completes. This seems doable in page-wise (coarse resolution)
>   but would cost additional reading of past xlog files (page
>   header of past pages is required).

Hm, yes. That looks like the least invasive way to go. At least that
looks more correct than the others.

> - Delaying write/flush feedback until the current record is
>   completed. walreceiver is not conscious of a WAL record and
>   this might break synchronous replication.

Not sure about this one yet.
-- 
Michael



Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA@mail.gmail.com>
> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> 
> Good idea.

Thanks. Fortunately(?), the problematic situation seems to happen
at almost all segment boundary.

> > I believe that a continuation record cannot be span over three or
> > more *segments* (is it right?), so keeping one spare segment
> > would be enough. The attached second patch does this.
> 
> I have to admit that I did not think about this problem much yet (I
> bookmarked this report weeks ago to be honest as something to look
> at), but that does not look right to me. Couldn't a record be spawned
> across even more segments? Take a random string longer than 64MB or
> event longer for example.

Though I haven't look closer to how a modification is splitted
into WAL records. A tuple cannot be so long. As a simple test, I
observed rechder->xl_tot_len at the end of XLogRecordAssemble
inserting an about 400KB not-so-compressable string into a text
column, but I saw a series of many records with shorter than
several thousand bytes.

> > Other possible measures might be,
> >
> > - Allowing switching wal source while reading a continuation
> >   record. Currently ReadRecord assumes that a continuation record
> >   can be read from single source. But this needs refactoring
> >   involving xlog.c, xlogreader.c and relatives.
> 
> This is scary thinking about back-branches.

Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)

> > - Delaying recycling a segment until the last partial record on it
> >   completes. This seems doable in page-wise (coarse resolution)
> >   but would cost additional reading of past xlog files (page
> >   header of past pages is required).
> 
> Hm, yes. That looks like the least invasive way to go. At least that
> looks more correct than the others.

The attached patch does that. Usually it reads page headers only
on segment boundaries, but once continuation record found (or
failed to read the next page header, that is, the first record on
the first page in the next segment has not been replicated), it
becomes to happen on every page boundary until non-continuation
page comes.

I leave a debug info (at LOG level) in the attached file shown on
every state change of keep pointer. At least for pgbench, the
cost seems ignorable.

> > - Delaying write/flush feedback until the current record is
> >   completed. walreceiver is not conscious of a WAL record and
> >   this might break synchronous replication.
> 
> Not sure about this one yet.

I'm not sure, too:p

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..6b3abc5 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -217,7 +217,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Transtatic void
WalSndWriteData(LogicalDecodingContext*ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);static XLogRecPtr
WalSndWaitForWal(XLogRecPtrloc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);/* Initialize walsender process
beforeentering the main command loop */
 
@@ -774,7 +774,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req        count =
flushptr- targetPagePtr;    /* now actually read the data, we know it's there */
 
-    XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+    XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);    return count;}
@@ -1551,8 +1551,9 @@ static voidProcessStandbyReplyMessage(void){    XLogRecPtr    writePtr,
-                flushPtr,
-                applyPtr;
+                flushPtr, oldFlushPtr,
+                applyPtr,
+                keepPtr;    bool        replyRequested;    /* the caller already consumed the msgtype byte */
@@ -1580,24 +1581,99 @@ ProcessStandbyReplyMessage(void)        WalSnd       *walsnd = MyWalSnd;
SpinLockAcquire(&walsnd->mutex);
+        keepPtr = walsnd->keep;
+        oldFlushPtr = walsnd->flush;        walsnd->write = writePtr;        walsnd->flush = flushPtr;
walsnd->apply= applyPtr;        SpinLockRelease(&walsnd->mutex);    }
 
+    /*
+     * If we are managed by a replication slot, maintain keepPtr on the page
+     * where the first fragment of the continuation record at flushPtr. Since
+     * this doesn't look into individual record, keepPtr may stay a bit too
+     * behind.
+     */
+    if (MyReplicationSlot &&
+        flushPtr != InvalidXLogRecPtr && oldFlushPtr != InvalidXLogRecPtr)
+    {
+        /*
+         * If keepPtr is cathing up, we do nothing until the next segment
+         * comes. Otherwise check on every page boundary.
+         */
+        if (oldKeepPtr == InvalidXLogRecPtr ?
+            keepPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+            keepPtr / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ)
+        {
+            XLogRecPtr oldKeepPtr = keepPtr;
+            XLogRecPtr rp;
+
+            if (keepPtr == InvalidXLogRecPtr)
+                keepPtr = oldFlushPtr;
+
+            rp = keepPtr - (keepPtr % XLOG_BLCKSZ);
+
+            /* We may have the record at flushPtr, so it's worth looking */
+            while (rp <= flushPtr)
+            {
+                XLogPageHeaderData header;
+
+                /*
+                 * If we don't have enough wal data, don't move keepPtr
+                 * forward. We may read it by the next chance.
+                 */
+                if(sentPtr - rp >= sizeof(XLogPageHeaderData))
+                {
+                    bool found;
+                    /*
+                     * Fetch the page header of the next page. Move keepPtr
+                     * forward only if when it is not a continuing page.
+                     */
+                    found = XLogRead((char *)&header,
+                                     rp, sizeof(XLogPageHeaderData), true);
+                    if (found &&
+                        (header.xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
+                        keepPtr = rp;
+                }
+                rp += XLOG_BLCKSZ;
+            }
+
+            /*
+             * If keepPtr is on the same page with flushPtr, it means catching
+             * up
+             */
+            if (keepPtr / XLOG_BLCKSZ == flushPtr / XLOG_BLCKSZ)
+                keepPtr = InvalidXLogRecPtr;
+
+            if (oldKeepPtr != keepPtr)
+            {
+                WalSnd       *walsnd = MyWalSnd;
+                elog(LOG, "%lX => %lX / %lX", oldKeepPtr, keepPtr, flushPtr); 
+                SpinLockAcquire(&walsnd->mutex);
+                walsnd->keep = keepPtr;
+                SpinLockRelease(&walsnd->mutex);
+            }
+        }
+    }
+    if (!am_cascading_walsender)        SyncRepReleaseWaiters();    /*     * Advance our local xmin horizon when the
clientconfirmed a flush.     */
 
-    if (MyReplicationSlot && flushPtr != InvalidXLogRecPtr)
+    if (MyReplicationSlot)    {
-        if (SlotIsLogical(MyReplicationSlot))
+        if (SlotIsLogical(MyReplicationSlot) && flushPtr != InvalidXLogRecPtr)
LogicalConfirmReceivedLocation(flushPtr);       else
 
-            PhysicalConfirmReceivedLocation(flushPtr);
+        {
+            /* keepPtr == InvalidXLogRecPtr means catching up */
+            if (keepPtr == InvalidXLogRecPtr)
+                keepPtr = flushPtr;
+            PhysicalConfirmReceivedLocation(keepPtr);
+        }    }}
@@ -2019,6 +2095,7 @@ WalSndKill(int code, Datum arg)/* * Read 'count' bytes from WAL into 'buf', starting at location
'startptr'
+ * Returns false if the segment file is not found iff notfoundok is true. * * XXX probably this should be improved to
suckdata directly from the * WAL buffers when possible.
 
@@ -2028,8 +2105,8 @@ WalSndKill(int code, Datum arg) * always be one descriptor left open until the process ends, but
never* more than one. */
 
-static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+static bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count, bool notfoundok){    char       *p;    XLogRecPtr    recptr;
@@ -2106,10 +2183,15 @@ retry:                 * removed or recycled.                 */                if (errno ==
ENOENT)
+                {
+                    if (notfoundok)
+                        return false;
+                    ereport(ERROR,                            (errcode_for_file_access(),
errmsg("requestedWAL segment %s has already been removed",
XLogFileNameP(curFileTimeLine,sendSegNo))));
 
+                }                else                    ereport(ERROR,
(errcode_for_file_access(),
@@ -2189,6 +2271,8 @@ retry:            goto retry;        }    }
+
+    return true;}/*
@@ -2393,7 +2477,7 @@ XLogSendPhysical(void)     * calls.     */    enlargeStringInfo(&output_message, nbytes);
-    XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+    XLogRead(&output_message.data[output_message.len], startptr, nbytes, false);    output_message.len += nbytes;
output_message.data[output_message.len]= '\0';
 
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 5e6ccfc..084146d 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -47,6 +47,13 @@ typedef struct WalSnd    XLogRecPtr    flush;    XLogRecPtr    apply;
+    /*
+     * Segment-spanning continuation records requires that the all related
+     * segments preserved. This holds how far we should preserve older
+     * segments only when it differs to flush location.
+     */
+    XLogRecPtr    keep;
+    /* Protects shared variables shown above. */    slock_t        mutex;

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 19 Jan 2017 18:37:31 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170119.183731.223893446.horiguchi.kyotaro@lab.ntt.co.jp>
> > > - Delaying recycling a segment until the last partial record on it
> > >   completes. This seems doable in page-wise (coarse resolution)
> > >   but would cost additional reading of past xlog files (page
> > >   header of past pages is required).
> > 
> > Hm, yes. That looks like the least invasive way to go. At least that
> > looks more correct than the others.
> 
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.
> 
> I leave a debug info (at LOG level) in the attached file shown on
> every state change of keep pointer. At least for pgbench, the
> cost seems ignorable.

I revised it. It became neater and less invasive.
- Removed added keep from struct WalSnd. It is never referrenced  from other processes. It is static variable now.
- Restore keepPtr from replication slot on starting.
- Moved the main part to more appropriate position.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index f3082c3..0270474 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -185,6 +185,12 @@ static volatile sig_atomic_t replication_active = false;static LogicalDecodingContext
*logical_decoding_ctx= NULL;static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
+/*
+ * Segment keep pointer for physical slots. Has a valid value only when it
+ * differs from the current flush pointer.
+ */
+static XLogRecPtr       keepPtr = InvalidXLogRecPtr;
+/* Signal handlers */static void WalSndSigHupHandler(SIGNAL_ARGS);static void WalSndXLogSendHandler(SIGNAL_ARGS);
@@ -217,7 +223,7 @@ static void WalSndPrepareWrite(LogicalDecodingContext *ctx, XLogRecPtr lsn, Transtatic void
WalSndWriteData(LogicalDecodingContext*ctx, XLogRecPtr lsn, TransactionId xid, bool last_write);static XLogRecPtr
WalSndWaitForWal(XLogRecPtrloc);
 
-static void XLogRead(char *buf, XLogRecPtr startptr, Size count);
+static bool XLogRead(char *buf, XLogRecPtr startptr, Size count, bool noutfoundok);/* Initialize walsender process
beforeentering the main command loop */
 
@@ -538,6 +544,9 @@ StartReplication(StartReplicationCmd *cmd)            ereport(ERROR,
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),                    (errmsg("cannot use a logical replication slot
forphysical replication"))));
 
+
+        /* Restore keepPtr from replication slot */
+        keepPtr = MyReplicationSlot->data.restart_lsn;    }    /*
@@ -553,6 +562,10 @@ StartReplication(StartReplicationCmd *cmd)    else        FlushPtr = GetFlushRecPtr();
+    /* Set InvalidXLogRecPtr if catching up */
+    if (keepPtr == FlushPtr)
+        keepPtr = InvalidXLogRecPtr;
+        if (cmd->timeline != 0)    {        XLogRecPtr    switchpoint;
@@ -774,7 +787,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req        count =
flushptr- targetPagePtr;    /* now actually read the data, we know it's there */
 
-    XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ);
+    XLogRead(cur_page, targetPagePtr, XLOG_BLCKSZ, false);    return count;}
@@ -1551,7 +1564,7 @@ static voidProcessStandbyReplyMessage(void){    XLogRecPtr    writePtr,
-                flushPtr,
+                flushPtr, oldFlushPtr,                applyPtr;    bool        replyRequested;
@@ -1580,6 +1593,7 @@ ProcessStandbyReplyMessage(void)        WalSnd       *walsnd = MyWalSnd;
SpinLockAcquire(&walsnd->mutex);
+        oldFlushPtr = walsnd->flush;        walsnd->write = writePtr;        walsnd->flush = flushPtr;
walsnd->apply= applyPtr;
 
@@ -1597,7 +1611,78 @@ ProcessStandbyReplyMessage(void)        if (SlotIsLogical(MyReplicationSlot))
LogicalConfirmReceivedLocation(flushPtr);       else
 
-            PhysicalConfirmReceivedLocation(flushPtr);
+        {
+            /*
+             * On recovery, a continuation reocrd must be available from
+             * single WAL source. So physical replication slot should stay in
+             * the first segment for a continuation record spanning multiple
+             * segments. Since this doesn't look into individual record,
+             * keepPtr may stay a bit too behind.
+             *
+             * Since the objective is avoding to remove required segments,
+             * checking every segment is enough. But once keepPtr goes behind,
+             * check every page for quick restoration.
+             *
+             * keepPtr has a valid value only when it is behind flushPtr.
+             */
+            if (oldFlushPtr != InvalidXLogRecPtr &&
+                (keepPtr == InvalidXLogRecPtr ?
+                 oldFlushPtr / XLOG_SEG_SIZE != flushPtr / XLOG_SEG_SIZE :
+                 keepPtr / XLOG_BLCKSZ != flushPtr / XLOG_BLCKSZ))
+            {
+                XLogRecPtr rp;
+                XLogRecPtr oldKeepPtr = keepPtr; /* for debug */
+
+                if (keepPtr == InvalidXLogRecPtr)
+                    keepPtr = oldFlushPtr;
+
+                rp = keepPtr - (keepPtr % XLOG_BLCKSZ);
+
+                /*
+                 * We may have let the record at flushPtr sent, so it's worth
+                 * looking
+                 */
+                while (rp <= flushPtr)
+                {
+                    XLogPageHeaderData header;
+
+                    /*
+                     * If the page header is not available for now, don't move
+                     * keepPtr forward. We can read it by the next chance.
+                     */
+                    if(sentPtr - rp >= sizeof(XLogPageHeaderData))
+                    {
+                        bool found;
+                        /*
+                         * Fetch the page header of the next page. Move
+                         * keepPtr forward only if when it is not a
+                         * continuation page.
+                         */
+                        found = XLogRead((char *)&header, rp,
+                                             sizeof(XLogPageHeaderData), true);
+                        if (found &&
+                            (header.xlp_info & XLP_FIRST_IS_CONTRECORD) == 0)
+                            keepPtr = rp;
+                    }
+                    rp += XLOG_BLCKSZ;
+                }
+
+                /*
+                 * If keepPtr is on the same page with flushPtr, it means that
+                 * we are catching up
+                 */
+                if (keepPtr / XLOG_BLCKSZ == flushPtr / XLOG_BLCKSZ)
+                    keepPtr = InvalidXLogRecPtr;
+
+                if (oldKeepPtr != keepPtr)
+                    elog(LOG, "%lX => %lX / %lX",
+                         oldKeepPtr, keepPtr, flushPtr); 
+            }
+
+            /* keepPtr == InvalidXLogRecPtr means catching up */
+            PhysicalConfirmReceivedLocation(keepPtr != InvalidXLogRecPtr ?
+                                            keepPtr : flushPtr);
+        }    }}
@@ -2019,6 +2104,7 @@ WalSndKill(int code, Datum arg)/* * Read 'count' bytes from WAL into 'buf', starting at location
'startptr'
+ * Returns false if the segment file is not found when notfoundok is true. * * XXX probably this should be improved to
suckdata directly from the * WAL buffers when possible.
 
@@ -2028,8 +2114,8 @@ WalSndKill(int code, Datum arg) * always be one descriptor left open until the process ends, but
never* more than one. */
 
-static void
-XLogRead(char *buf, XLogRecPtr startptr, Size count)
+static bool
+XLogRead(char *buf, XLogRecPtr startptr, Size count, bool notfoundok){    char       *p;    XLogRecPtr    recptr;
@@ -2106,10 +2192,15 @@ retry:                 * removed or recycled.                 */                if (errno ==
ENOENT)
+                {
+                    if (notfoundok)
+                        return false;
+                    ereport(ERROR,                            (errcode_for_file_access(),
errmsg("requestedWAL segment %s has already been removed",
XLogFileNameP(curFileTimeLine,sendSegNo))));
 
+                }                else                    ereport(ERROR,
(errcode_for_file_access(),
@@ -2189,6 +2280,8 @@ retry:            goto retry;        }    }
+
+    return true;}/*
@@ -2393,7 +2486,7 @@ XLogSendPhysical(void)     * calls.     */    enlargeStringInfo(&output_message, nbytes);
-    XLogRead(&output_message.data[output_message.len], startptr, nbytes);
+    XLogRead(&output_message.data[output_message.len], startptr, nbytes, false);    output_message.len += nbytes;
output_message.data[output_message.len]= '\0'; 

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello, I'll add the rebased version to the next CF.

At Fri, 20 Jan 2017 11:07:29 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170120.110729.107284864.horiguchi.kyotaro@lab.ntt.co.jp>
> > > > - Delaying recycling a segment until the last partial record on it
> > > >   completes. This seems doable in page-wise (coarse resolution)
> > > >   but would cost additional reading of past xlog files (page
> > > >   header of past pages is required).
> > > 
> > > Hm, yes. That looks like the least invasive way to go. At least that
> > > looks more correct than the others.
> > 
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
> > 
> > I leave a debug info (at LOG level) in the attached file shown on
> > every state change of keep pointer. At least for pgbench, the
> > cost seems ignorable.
> 
> I revised it. It became neater and less invasive.
> 
>  - Removed added keep from struct WalSnd. It is never referrenced
>    from other processes. It is static variable now.
> 
>  - Restore keepPtr from replication slot on starting.

keepPtr is renamed to a more meaningful name restartLSN.

>  - Moved the main part to more appropriate position.

- Removed the debug print code.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Fujii Masao
Date:
On Thu, Jan 19, 2017 at 6:37 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> Hello,
>
> At Wed, 18 Jan 2017 12:34:51 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQytF2giE7FD-4oJJpPVwiKJrDQPc24hLNGThX01SbSmA@mail.gmail.com>
>> On Tue, Jan 17, 2017 at 7:36 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> > I managed to reproduce this. A little tweak as the first patch
>> > lets the standby to suicide as soon as walreceiver sees a
>> > contrecord at the beginning of a segment.
>>
>> Good idea.
>
> Thanks. Fortunately(?), the problematic situation seems to happen
> at almost all segment boundary.
>
>> > I believe that a continuation record cannot be span over three or
>> > more *segments* (is it right?), so keeping one spare segment
>> > would be enough. The attached second patch does this.
>>
>> I have to admit that I did not think about this problem much yet (I
>> bookmarked this report weeks ago to be honest as something to look
>> at), but that does not look right to me. Couldn't a record be spawned
>> across even more segments? Take a random string longer than 64MB or
>> event longer for example.
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.
>
>> > Other possible measures might be,
>> >
>> > - Allowing switching wal source while reading a continuation
>> >   record. Currently ReadRecord assumes that a continuation record
>> >   can be read from single source. But this needs refactoring
>> >   involving xlog.c, xlogreader.c and relatives.
>>
>> This is scary thinking about back-branches.
>
> Yes. It would be no longer a bug fix. (Or becomes a quite ugly hack..)
>
>> > - Delaying recycling a segment until the last partial record on it
>> >   completes. This seems doable in page-wise (coarse resolution)
>> >   but would cost additional reading of past xlog files (page
>> >   header of past pages is required).
>>
>> Hm, yes. That looks like the least invasive way to go. At least that
>> looks more correct than the others.
>
> The attached patch does that. Usually it reads page headers only
> on segment boundaries, but once continuation record found (or
> failed to read the next page header, that is, the first record on
> the first page in the next segment has not been replicated), it
> becomes to happen on every page boundary until non-continuation
> page comes.

I'm afraid that many WAL segments would start with a continuation record
when there are the workload of short transactions (e.g., by pgbench), and
which would make restart_lsn go behind very much. No?

The discussion on this thread just makes me think that restart_lsn should
indicate the replay location instead of flush location. This seems safer.

Regards,

-- 
Fujii Masao



Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
Thank you for the comment.

At Thu, 2 Feb 2017 01:26:03 +0900, Fujii Masao <masao.fujii@gmail.com> wrote in
<CAHGQGwEET=QBA_jND=xhrXn+9ZreP4_qMBAqsBZg56beqxbveg@mail.gmail.com>
> > The attached patch does that. Usually it reads page headers only
> > on segment boundaries, but once continuation record found (or
> > failed to read the next page header, that is, the first record on
> > the first page in the next segment has not been replicated), it
> > becomes to happen on every page boundary until non-continuation
> > page comes.
> 
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I agreed. So trying to release the lock for every page boundary
but restart_lsn goes behind much if so many contiguous pages were
CONTRECORD. But I think the chance for the situation sticks for
one or more segments is ignorablly low. Being said that, there
*is* possibility of false continuation, anyway.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

Standby restarts from minRecoveryPoint, which is a copy of
XLogCtl->replayEndRecPtr and updated by
UpdateMinRecoveryPoint(). Whlie, applyPtr in reply messages is a
copy of XLogCtl->lastReplayedEndRecptr which is updated after the
upate of on-disk minRecoveryPoint. It seems safe from the
viewpoint.

On the other hand, apply is pausable. Records are copied and
flushd on standby then the segments on master that is already
sent are safely be removed even for the case. In spite of that,
older segments on the master are kept from being removed during
the pause. If applyPtr were used as restart_lsn, this could be
another problem and this is sure to happen.

I'm not sure how much possibility is there for several contiguous
segments are full of contpages. But I think it's worse that apply
pause causes needless pg_wal flooding.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Michael Paquier
Date:
On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> I'm afraid that many WAL segments would start with a continuation record
> when there are the workload of short transactions (e.g., by pgbench), and
> which would make restart_lsn go behind very much. No?

I don't quite understand this argument. Even if there are many small
transactions, that would cause restart_lsn to just be late by one
segment, all the time.

> The discussion on this thread just makes me think that restart_lsn should
> indicate the replay location instead of flush location. This seems safer.

That would penalize WAL retention on the primary with standbys using
recovery_min_apply_delay and a slot for example...

We can attempt to address this problem two ways. The patch proposed
(ugly btw and there are two typos!) is doing it in the WAL sender by
not making restart_lsn jump to the next segment if a continuation
record is found. Or we could have the standby request for the next
segment instead if the record it wants to replay from is at a boundary
and that it locally has the beginning of the record, and it has it
because it already confirmed to the primary that it flushed to the
next segment. Not sure which fix is better though.
-- 
Michael



Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
At Thu, 2 Feb 2017 15:34:33 +0900, Michael Paquier <michael.paquier@gmail.com> wrote in
<CAB7nPqQ05G15JooRMEONgPkW0osot77yaFAUF9_6Q8G+v+2+xg@mail.gmail.com>
> On Thu, Feb 2, 2017 at 1:26 AM, Fujii Masao <masao.fujii@gmail.com> wrote:
> > I'm afraid that many WAL segments would start with a continuation record
> > when there are the workload of short transactions (e.g., by pgbench), and
> > which would make restart_lsn go behind very much. No?
> 
> I don't quite understand this argument. Even if there are many small
> transactions, that would cause restart_lsn to just be late by one
> segment, all the time.
> 
> > The discussion on this thread just makes me think that restart_lsn should
> > indicate the replay location instead of flush location. This seems safer.
> 
> That would penalize WAL retention on the primary with standbys using
> recovery_min_apply_delay and a slot for example...
> 
> We can attempt to address this problem two ways. The patch proposed
> (ugly btw and there are two typos!) is doing it in the WAL sender by
> not making restart_lsn jump to the next segment if a continuation
> record is found.

Sorry for the ug..:p Anyway, the previous version was not the
latest. The attached one is the revised version. (Sorry, I
haven't find a typo by myself..)

>  Or we could have the standby request for the next
> segment instead if the record it wants to replay from is at a boundary
> and that it locally has the beginning of the record, and it has it
> because it already confirmed to the primary that it flushed to the
> next segment. Not sure which fix is better though.

We could it as I said, with some refactoring ReadRecord involving
reader plugin mechanism..

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello. I added pgsql-hackers.

This occurs also on git master and back to 9.4.

At Fri, 13 Jan 2017 08:47:06 -0600, Jonathon Nelson <jdnelson@dyn.com> wrote in <CACJqAM1ydcZcd5DoCp+y5hkWto1ZeGW+Mj8UK7avqctbGJO8Bw@mail.gmail.com>
> On Mon, Nov 28, 2016 at 1:39 PM, Jonathon Nelson <jdnelson@dyn.com> wrote:
> > First, the postgresql configuration differs only minimally from the stock
> > config:
> >
> > Assume wal_keep_segments = 0.
> > Assume the use of physical replication slots.
> > Assume one master, one standby.
> >
> > Lastly, we have observed the behavior "in the wild" at least twice and in
> > the lab a dozen or so times.
> >
> > EXAMPLE #1 (logs are from the replica):
> >
> > user=,db=,app=,client= DEBUG:  creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  done creating and filling new WAL file
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8A000000
> > apply 5/748425A0
> > user=,db=,app=,client= DEBUG:  sending write 6/8B000000 flush 6/8B000000
> > apply 5/74843020
> > <control-c here>
> > user=,db=,app=,client= DEBUG:  postmaster received signal 2
> > user=,db=,app=,client= LOG:  received fast shutdown request
> > user=,db=,app=,client= LOG:  aborting any active transactions
> >
> > And, upon restart:
> >
> > user=,db=,app=,client= LOG:  restartpoint starting: xlog
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/67002390 on
> > timeline 1
> > user=,db=,app=,client= DEBUG:  performing replication slot checkpoint
> > user=,db=,app=,client= DEBUG:  updated min recovery point to 6/671768C0 on
> > timeline 1
> > user=,db=,app=,client= CONTEXT:  writing block 589 of relation
> > base/13294/16501
> > user=,db=,app=,client= LOG:  invalid magic number 0000 in log segment
> > 00000001000000060000008B, offset 0
> > user=,db=,app=,client= DEBUG:  switched WAL source from archive to stream
> > after failure
> > user=,db=,app=,client= LOG:  started streaming WAL from primary at
> > 6/8A000000 on timeline 1
> > user=,db=,app=,client= FATAL:  could not receive data from WAL stream:
> > ERROR:  requested WAL segment 00000001000000060000008A has already been
> > removed

I managed to reproduce this. A little tweak as the first patch
lets the standby to suicide as soon as walreceiver sees a
contrecord at the beginning of a segment.

- M(aster): createdb as a master with wal_keep_segments = 0
            (default), min_log_messages = debug2
- M: Create a physical repslot.
- S(tandby): Setup a standby database.
- S: Edit recovery.conf to use the replication slot above then
     start it.
- S: touch /tmp/hoge
- M: Run pgbench ...
- S: After a while, the standby stops.
  > LOG:  #################### STOP THE SERVER

- M: Stop pgbench.
- M: Do 'checkpoint;' twice.
- S: rm /tmp/hoge
- S: Fails to catch up with the following error.

  > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already been removed


I have been testing / reviewing the latest patch "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might need some more clarification on this.

Before applying the patch, I tried re-producing the above error -

- I had master->standby in streaming replication
- Took the backup of master
   - with a low max_wal_size and wal_keep_segments = 0
- Configured standby with recovery.conf
- Created replication slot on master
- Configured the replication slot on standby and started the standby
- I got the below error
  
   >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at 0/F2000140: wanted 24, got 0
   >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from primary at 0/F2000000 on timeline 1
   >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2 has already been removed

and i could notice that the file "0000000100000000000000F2" was removed from the master. This can be easily re-produced and this occurs irrespective of configuring replication slots.

As long as the file "0000000100000000000000F2" is available on the master, standby continues to stream WALs without any issues.

some more details -

Contents of the file  "0000000100000000000000F2" on standby before pg_stop_backup()

rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
pg_waldump: FATAL:  error in WAL record at 0/F2000108: invalid record length at 0/F2000140: wanted 24, got 0

Contents of the file on master after pg_stop_backup()

rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000028, prev 0/F1000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000060, prev 0/F2000028, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2000098, prev 0/F2000060, desc: CHECKPOINT_ONLINE redo 0/F2000060; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2000108, prev 0/F2000098, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Heap2       len (rec/tot):      8/  7735, tx:          0, lsn: 0/F2000140, prev 0/F2000108, desc: CLEAN remxid 620, blkref #0: rel 1663/13179/2619 blk 2 FPW
rmgr: Heap2       len (rec/tot):      8/  6863, tx:          0, lsn: 0/F2001F78, prev 0/F2000140, desc: CLEAN remxid 620, blkref #0: rel 1663/13179/2840 blk 0 FPW
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003A60, prev 0/F2001F78, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003A98, prev 0/F2003A60, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):     80/   106, tx:          0, lsn: 0/F2003AD0, prev 0/F2003A98, desc: CHECKPOINT_ONLINE redo 0/F2003A98; tli 1; prev tli 1; fpw true; xid 0:638; oid 16487; multi 1; offset 0; oldest xid 544 in DB 1; oldest multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running xid 638; online
rmgr: Standby     len (rec/tot):     24/    50, tx:          0, lsn: 0/F2003B40, prev 0/F2003AD0, desc: RUNNING_XACTS nextXid 638 latestCompletedXid 637 oldestRunningXid 638
rmgr: XLOG        len (rec/tot):      8/    34, tx:          0, lsn: 0/F2003B78, prev 0/F2003B40, desc: BACKUP_END 0/F2000060
rmgr: XLOG        len (rec/tot):      0/    24, tx:          0, lsn: 0/F2003BA0, prev 0/F2003B78, desc: SWITCH

If the scenario i created to reproduce the error is correct, then, applying the patch is not making a difference.  

I think, i need help in building a specific test case which will re-produce the specific BUG related to physical replication slots as reported ?

Will continue to review the patch, once i have any comments on this.

Regards,
Venkata B N

Database Consultant


Re: [HACKERS] [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in
<CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com>
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> >             (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> >      start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> >   > LOG:  #################### STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> >   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> > segment 00000001000000000000002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
> 
> Before applying the patch, I tried re-producing the above error -
> 
> - I had master->standby in streaming replication
> - Took the backup of master
>    - with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby

I suppose the "configure" means primary_slot_name in recovery.conf.

> - I got the below error
> 
>    >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> 0/F2000140: wanted 24, got 0
>    >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> primary at 0/F2000000 on timeline 1
>    >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2 has
> already been removed

Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.

> - Took the backup of master
>    - with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby

Creating the slot with the following command will save it.

=# select pg_create_physical_replication_slot('s1', true);


> and i could notice that the file "0000000100000000000000F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
> 
> As long as the file "0000000100000000000000F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.

Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.

> I think, i need help in building a specific test case which will re-produce
> the specific BUG related to physical replication slots as reported ?
> 
> Will continue to review the patch, once i have any comments on this.

Thaks a lot!

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
This conflicts with 6912acc (replication lag tracker) so just
rebased on a6f22e8.

At Fri, 17 Mar 2017 16:48:27 +0900 (Tokyo Standard Time), Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote in
<20170317.164827.46663014.horiguchi.kyotaro@lab.ntt.co.jp>
> Hello,
> 
> At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in
<CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com>
> > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > I managed to reproduce this. A little tweak as the first patch
> > > lets the standby to suicide as soon as walreceiver sees a
> > > contrecord at the beginning of a segment.
> > >
> > > - M(aster): createdb as a master with wal_keep_segments = 0
> > >             (default), min_log_messages = debug2
> > > - M: Create a physical repslot.
> > > - S(tandby): Setup a standby database.
> > > - S: Edit recovery.conf to use the replication slot above then
> > >      start it.
> > > - S: touch /tmp/hoge
> > > - M: Run pgbench ...
> > > - S: After a while, the standby stops.
> > >   > LOG:  #################### STOP THE SERVER
> > >
> > > - M: Stop pgbench.
> > > - M: Do 'checkpoint;' twice.
> > > - S: rm /tmp/hoge
> > > - S: Fails to catch up with the following error.
> > >
> > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> > > segment 00000001000000000000002B has already been removed
> > >
> > >
> > I have been testing / reviewing the latest patch
> > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > need some more clarification on this.
> > 
> > Before applying the patch, I tried re-producing the above error -
> > 
> > - I had master->standby in streaming replication
> > - Took the backup of master
> >    - with a low max_wal_size and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> > - Configured the replication slot on standby and started the standby
> 
> I suppose the "configure" means primary_slot_name in recovery.conf.
> 
> > - I got the below error
> > 
> >    >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > 0/F2000140: wanted 24, got 0
> >    >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > primary at 0/F2000000 on timeline 1
> >    >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2 has
> > already been removed
> 
> Maybe you created the master slot with non-reserve (default) mode
> and put a some-minites pause after making the backup and before
> starting the standby. For the case the master slot doesn't keep
> WAL segments unless the standby connects so a couple of
> checkpoints can blow away the first segment required by the
> standby. This is quite reasonable behavior. The following steps
> makes this more sure.
> 
> > - Took the backup of master
> >    - with a low max_wal_size = 2 and wal_keep_segments = 0
> > - Configured standby with recovery.conf
> > - Created replication slot on master
> + - SELECT pg_switch_wal(); on master twice.
> + - checkpoint; on master twice.
> > - Configured the replication slot on standby and started the standby
> 
> Creating the slot with the following command will save it.
> 
> =# select pg_create_physical_replication_slot('s1', true);
> 
> 
> > and i could notice that the file "0000000100000000000000F2" was removed
> > from the master. This can be easily re-produced and this occurs
> > irrespective of configuring replication slots.
> > 
> > As long as the file "0000000100000000000000F2" is available on the master,
> > standby continues to stream WALs without any issues.
> ...
> > If the scenario i created to reproduce the error is correct, then, applying
> > the patch is not making a difference.
> 
> Yes, the patch is not for saving this case. The patch saves the
> case where the previous segment to the first required segment by
> standby was removed and it contains the first part of a record
> continues to the first required segment. On the other hand this
> case is that the segment at the start point of standby is just
> removed.
> 
> > I think, i need help in building a specific test case which will re-produce
> > the specific BUG related to physical replication slots as reported ?
> > 
> > Will continue to review the patch, once i have any comments on this.
> 
> Thaks a lot!

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:



Regards,

Venkata B N
Database Consultant
 

On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
This conflicts with 6912acc (replication lag tracker) so just
rebased on a6f22e8.

I tried applying this patch to latest master, it is not getting applied

[dba@buildhost postgresql]$ git apply /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28: trailing whitespace.
/*
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29: trailing whitespace.
 * This variable corresponds to restart_lsn in pg_replication_slots for a
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30: trailing whitespace.
 * physical slot. This has a valid value only when it differs from the current
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31: trailing whitespace.
 * flush pointer.
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32: trailing whitespace.
 */
error: patch failed: src/backend/replication/walsender.c:210
error: src/backend/replication/walsender.c: patch does not apply

Regards,

Venkata Balaji N
Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Michael Paquier
Date:
On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I tried applying this patch to latest master, it is not getting applied
>
> [dba@buildhost postgresql]$ git apply
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> trailing whitespace.
> /*
>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> trailing whitespace.
>  * This variable corresponds to restart_lsn in pg_replication_slots for a
>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> trailing whitespace.
>  * physical slot. This has a valid value only when it differs from the
> current
>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> trailing whitespace.
>  * flush pointer.
>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> trailing whitespace.
>  */
> error: patch failed: src/backend/replication/walsender.c:210
> error: src/backend/replication/walsender.c: patch does not apply

git apply and git am can be very picky sometimes, so you may want to
fallback to patch -p1 if things don't work. In this case it does.
-- 
Michael



Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <michael.paquier@gmail.com> wrote:
On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> I tried applying this patch to latest master, it is not getting applied
>
> [dba@buildhost postgresql]$ git apply
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> trailing whitespace.
> /*
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> trailing whitespace.
>  * This variable corresponds to restart_lsn in pg_replication_slots for a
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
> trailing whitespace.
>  * physical slot. This has a valid value only when it differs from the
> current
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
> trailing whitespace.
>  * flush pointer.
> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
> trailing whitespace.
>  */
> error: patch failed: src/backend/replication/walsender.c:210
> error: src/backend/replication/walsender.c: patch does not apply

git apply and git am can be very picky sometimes, so you may want to
fallback to patch -p1 if things don't work. In this case it does.

patch -p1 seems to be working. Thanks !

Regards,

Venkata B N
Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Tatsuo Ishii
Date:
> On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010@gmail.com> wrote:
>> On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
>> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>> I tried applying this patch to latest master, it is not getting applied
>>
>> [dba@buildhost postgresql]$ git apply
>> /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
>>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
>> trailing whitespace.
>> /*
>>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
>> trailing whitespace.
>>  * This variable corresponds to restart_lsn in pg_replication_slots for a
>>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:30:
>> trailing whitespace.
>>  * physical slot. This has a valid value only when it differs from the
>> current
>>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:31:
>> trailing whitespace.
>>  * flush pointer.
>>
/data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:32:
>> trailing whitespace.
>>  */
>> error: patch failed: src/backend/replication/walsender.c:210
>> error: src/backend/replication/walsender.c: patch does not apply
> 
> git apply and git am can be very picky sometimes, so you may want to
> fallback to patch -p1 if things don't work. In this case it does.

Committers will not apply patches which has trailing whitespace
issues. So the patch submitter needs to fix them anyway.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Michael Paquier
Date:
On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> Committers will not apply patches which has trailing whitespace
> issues. So the patch submitter needs to fix them anyway.

I cannot comment on that point (committers are free to pick up things
the way they want), but just using git commands to apply a patch
should not be an obstacle for a review if a patch can be easily
applied as long as they roughly respect GNU's diff format.
-- 
Michael



Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Tatsuo Ishii
Date:
> On Thu, Mar 30, 2017 at 9:12 AM, Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
>> Committers will not apply patches which has trailing whitespace
>> issues. So the patch submitter needs to fix them anyway.
> 
> I cannot comment on that point (committers are free to pick up things
> the way they want), but just using git commands to apply a patch
> should not be an obstacle for a review if a patch can be easily
> applied as long as they roughly respect GNU's diff format.

My point is, the coding standard. Having trainling whitespace is
against our coding standard and committers should not accept such a
code, I believe.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in
<CAEyp7J9ZuuGc_Ndgw-rQF5oWUPH+p4=p7XE-B5cC3QRr6SUT7w@mail.gmail.com>
> On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <michael.paquier@gmail.com
> > wrote:
> 
> > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010@gmail.com>
> > wrote:
> > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > I tried applying this patch to latest master, it is not getting applied
> > >
> > > [dba@buildhost postgresql]$ git apply
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > trailing whitespace.
> > > /*
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > trailing whitespace.
> > >  * This variable corresponds to restart_lsn in pg_replication_slots for a
...
> > git apply and git am can be very picky sometimes, so you may want to
> > fallback to patch -p1 if things don't work. In this case it does.
> >
> 
> patch -p1 seems to be working. Thanks !

That's quite strange. The patch I sent doesn't cantain trailing
spaces at all. The cited lines doesn't seem to contain them. It
applied cleanly with "git am" for me.

The file restored from the attachment of received mail also don't.

The original files contains the following,

0002440  66  6f  72  20  61  0a  2b  20  2a  20  70  68  79  73  69  63         f   o   r       a  \n   +       *
p  h   y   s   i   c
 

The corresponding part of the file restored from mail on Windows
is the following,
0002460  63  61  74  69  6f  6e  5f  73  6c  6f  74  73  20  66  6f  72         c   a   t   i   o   n   _   s   l   o
t  s       f   o   r
 
0002500  20  61  0d  0a  2b  20  2a  20  70  68  79  73  69  63  61  6c             a  \r  \n   +       *       p   h
y  s   i   c   a   l
 

Both doesn't contain a space at the end of a line. How did you
retrieve the patch from the mail?

regarsd,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Thu, Mar 30, 2017 at 3:51 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
At Thu, 30 Mar 2017 11:12:56 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J9ZuuGc_Ndgw-rQF5oWUPH+p4=p7XE-B5cC3QRr6SUT7w@mail.gmail.com>
> On Thu, Mar 30, 2017 at 10:55 AM, Michael Paquier <michael.paquier@gmail.com
> > wrote:
>
> > On Thu, Mar 30, 2017 at 8:49 AM, Venkata B Nagothi <nag1010@gmail.com>
> > wrote:
> > > On Tue, Mar 28, 2017 at 5:51 PM, Kyotaro HORIGUCHI
> > > <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > I tried applying this patch to latest master, it is not getting applied
> > >
> > > [dba@buildhost postgresql]$ git apply
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> > > /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:28:
> > > trailing whitespace.
> > > /*
> > 0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch:29:
> > > trailing whitespace.
> > >  * This variable corresponds to restart_lsn in pg_replication_slots for a
...
> > git apply and git am can be very picky sometimes, so you may want to
> > fallback to patch -p1 if things don't work. In this case it does.
> >
>
> patch -p1 seems to be working. Thanks !

That's quite strange. The patch I sent doesn't cantain trailing
spaces at all. The cited lines doesn't seem to contain them. It
applied cleanly with "git am" for me.

The file restored from the attachment of received mail also don't.

The original files contains the following,

0002440  66  6f  72  20  61  0a  2b  20  2a  20  70  68  79  73  69  63
          f   o   r       a  \n   +       *       p   h   y   s   i   c

The corresponding part of the file restored from mail on Windows
is the following,
0002460  63  61  74  69  6f  6e  5f  73  6c  6f  74  73  20  66  6f  72
          c   a   t   i   o   n   _   s   l   o   t   s       f   o   r
0002500  20  61  0d  0a  2b  20  2a  20  70  68  79  73  69  63  61  6c
              a  \r  \n   +       *       p   h   y   s   i   c   a   l

Both doesn't contain a space at the end of a line. How did you
retrieve the patch from the mail?

Yes, downloaded from the email on Windows and copied across to Linux and did "git apply".

Regards,

Venkata B N
Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in
<CAEyp7J-TxoLZGh2qioA9nJJepiXUvvsAWrk=BxwU0J05X_j9_A@mail.gmail.com>
> Yes, downloaded from the email on Windows and copied across to Linux and
> did "git apply".

The same works for me. But --keep-cr gave me the same result with
you.

> $ git am --keep-cr ~/work/patches/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> Applying: Fix a bug of physical replication slot.
> .git/rebase-apply/patch:13: trailing whitespace.
> /*

https://git-scm.com/docs/git-am

| --[no-]keep-cr
| 
| With --keep-cr, call git mailsplit (see git-mailsplit[1]) with
| the same option, to prevent it from stripping CR at the end of
| lines. am.keepcr configuration variable can be used to specify
| the default behaviour. --no-keep-cr is useful to override
| am.keepcr.

I don't know why it preserves CRs only for the lines, but anyway,
don't you have am.keepcr in you configuration?

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center




Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Thu, Mar 30, 2017 at 4:46 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,

At Thu, 30 Mar 2017 15:59:14 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J-TxoLZGh2qioA9nJJepiXUvvsAWrk=BxwU0J05X_j9_A@mail.gmail.com>
> Yes, downloaded from the email on Windows and copied across to Linux and
> did "git apply".

The same works for me. But --keep-cr gave me the same result with
you.

> $ git am --keep-cr ~/work/patches/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
> Applying: Fix a bug of physical replication slot.
> .git/rebase-apply/patch:13: trailing whitespace.
> /*

for me too -

[dba@buildhost postgresql]$ git am --keep-cr /data/postgresql-patches/9.5-ReplicationSlots-Bug-Patch/0001-Fix-a-bug-of-physical-replication-slot_a6f22e8.patch
Applying: Fix a bug of physical replication slot.
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:13: trailing whitespace.
/*
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:14: trailing whitespace.
 * This variable corresponds to restart_lsn in pg_replication_slots for a
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:15: trailing whitespace.
 * physical slot. This has a valid value only when it differs from the current
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:16: trailing whitespace.
 * flush pointer.
/data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch:17: trailing whitespace.
 */
error: patch failed: src/backend/replication/walsender.c:210
error: src/backend/replication/walsender.c: patch does not apply
Patch failed at 0001 Fix a bug of physical replication slot.
The copy of the patch that failed is found in:
   /data/PostgreSQL-GIT-Repo/postgresql/.git/rebase-apply/patch
When you have resolved this problem, run "git am --resolved".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
 

https://git-scm.com/docs/git-am

| --[no-]keep-cr
|
| With --keep-cr, call git mailsplit (see git-mailsplit[1]) with
| the same option, to prevent it from stripping CR at the end of
| lines. am.keepcr configuration variable can be used to specify
| the default behaviour. --no-keep-cr is useful to override
| am.keepcr.

I don't know why it preserves CRs only for the lines, but anyway,
don't you have am.keepcr in you configuration?

May be, I do not think i have am.keepcr in my configuration. I am not 100% sure of it.

I only did "git apply.." which produced white space errors.

Regards,

Venkata B N
Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello,

At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.gmail.com>
> On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > I managed to reproduce this. A little tweak as the first patch
> > lets the standby to suicide as soon as walreceiver sees a
> > contrecord at the beginning of a segment.
> >
> > - M(aster): createdb as a master with wal_keep_segments = 0
> >             (default), min_log_messages = debug2
> > - M: Create a physical repslot.
> > - S(tandby): Setup a standby database.
> > - S: Edit recovery.conf to use the replication slot above then
> >      start it.
> > - S: touch /tmp/hoge
> > - M: Run pgbench ...
> > - S: After a while, the standby stops.
> >   > LOG:  #################### STOP THE SERVER
> >
> > - M: Stop pgbench.
> > - M: Do 'checkpoint;' twice.
> > - S: rm /tmp/hoge
> > - S: Fails to catch up with the following error.
> >
> >   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL
> > segment 00000001000000000000002B has already been removed
> >
> >
> I have been testing / reviewing the latest patch
> "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> need some more clarification on this.
>
> Before applying the patch, I tried re-producing the above error -
>
> - I had master->standby in streaming replication
> - Took the backup of master
>    - with a low max_wal_size and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
> - Configured the replication slot on standby and started the standby

I suppose the "configure" means primary_slot_name in recovery.conf.

> - I got the below error
>
>    >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> 0/F2000140: wanted 24, got 0
>    >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> primary at 0/F2000000 on timeline 1
>    >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2 has
> already been removed

Maybe you created the master slot with non-reserve (default) mode
and put a some-minites pause after making the backup and before
starting the standby. For the case the master slot doesn't keep
WAL segments unless the standby connects so a couple of
checkpoints can blow away the first segment required by the
standby. This is quite reasonable behavior. The following steps
makes this more sure.

> - Took the backup of master
>    - with a low max_wal_size = 2 and wal_keep_segments = 0
> - Configured standby with recovery.conf
> - Created replication slot on master
+ - SELECT pg_switch_wal(); on master twice.
+ - checkpoint; on master twice.
> - Configured the replication slot on standby and started the standby

Creating the slot with the following command will save it.

=# select pg_create_physical_replication_slot('s1', true);

I did a test again, by applying the patch and I am not sure if the patch is doing the right thing ?

Here is test case -

- I ran pgbench
- I took the backup of the master first

- Below are the WALs on master after the stop backup -

postgres=# select pg_stop_backup();

NOTICE:  WAL archiving is not enabled; you must ensure that all required WAL segments are copied through other means to complete the backup
 pg_stop_backup
----------------
 0/8C000130
(1 row)

postgres=# \q
[dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
total 65M
drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
-rw-------. 1 dba dba  16M Mar 31 11:09 00000001000000000000008E
-rw-------. 1 dba dba  16M Mar 31 11:17 00000001000000000000008F
-rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008C
-rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008D

- After the backup, i created the physical replication slot
  
postgres=# select pg_create_physical_replication_slot('repslot',true);

 pg_create_physical_replication_slot
-------------------------------------
 (repslot,0/8D000028)
(1 row)

postgres=# select pg_walfile_name('0/8D000028');

     pg_walfile_name
---------------------------------------
 00000001000000000000008D
(1 row)

Here, When you start the standby, it would ask for the file 00000001000000000000008C, which is the first file needed for the standby and since i applied your patch, i am assuming that, the file 
00000001000000000000008C should also be retained without being removed - correct ? 

- I started the standby and the below error occurs

>> 2017-03-31 11:26:01.288 AEDT [17475] LOG:  invalid record length at 0/8C000108: wanted 24, got 0
>> 2017-03-31 11:26:01.291 AEDT [17486] LOG:  started streaming WAL from primary at 0/8C000000 on timeline 1
>> 2017-03-31 11:26:01.291 AEDT [17486] FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000008C has already been removed

> and i could notice that the file "0000000100000000000000F2" was removed
> from the master. This can be easily re-produced and this occurs
> irrespective of configuring replication slots.
>
> As long as the file "0000000100000000000000F2" is available on the master,
> standby continues to stream WALs without any issues.
...
> If the scenario i created to reproduce the error is correct, then, applying
> the patch is not making a difference.

Yes, the patch is not for saving this case. The patch saves the
case where the previous segment to the first required segment by
standby was removed and it contains the first part of a record
continues to the first required segment. On the other hand this
case is that the segment at the start point of standby is just
removed.

Which means, the file 00000001000000000000008C must be retained as it is the first file standby is looking for - correct ?

Regards,
Venkata Balaji N

Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least9.5)?

From
Kyotaro HORIGUCHI
Date:
Thank you having a look on this.

# I removed -bugs in CC:.

At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in
<CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ@mail.gmail.com>
> On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> 
> > Hello,
> >
> > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com>
> > wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.
> > gmail.com>
> > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > I managed to reproduce this. A little tweak as the first patch
> > > > lets the standby to suicide as soon as walreceiver sees a
> > > > contrecord at the beginning of a segment.
> > > >
> > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > >             (default), min_log_messages = debug2
> > > > - M: Create a physical repslot.
> > > > - S(tandby): Setup a standby database.
> > > > - S: Edit recovery.conf to use the replication slot above then
> > > >      start it.
> > > > - S: touch /tmp/hoge
> > > > - M: Run pgbench ...
> > > > - S: After a while, the standby stops.
> > > >   > LOG:  #################### STOP THE SERVER
> > > >
> > > > - M: Stop pgbench.
> > > > - M: Do 'checkpoint;' twice.
> > > > - S: rm /tmp/hoge
> > > > - S: Fails to catch up with the following error.
> > > >
> > > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested
> > WAL
> > > > segment 00000001000000000000002B has already been removed
> > > >
> > > >
> > > I have been testing / reviewing the latest patch
> > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > > need some more clarification on this.
> > >
> > > Before applying the patch, I tried re-producing the above error -
> > >
> > > - I had master->standby in streaming replication
> > > - Took the backup of master
> > >    - with a low max_wal_size and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > > - Configured the replication slot on standby and started the standby
> >
> > I suppose the "configure" means primary_slot_name in recovery.conf.
> >
> > > - I got the below error
> > >
> > >    >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > > 0/F2000140: wanted 24, got 0
> > >    >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > > primary at 0/F2000000 on timeline 1
> > >    >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > > from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2
> > has
> > > already been removed
> >
> > Maybe you created the master slot with non-reserve (default) mode
> > and put a some-minites pause after making the backup and before
> > starting the standby. For the case the master slot doesn't keep
> > WAL segments unless the standby connects so a couple of
> > checkpoints can blow away the first segment required by the
> > standby. This is quite reasonable behavior. The following steps
> > makes this more sure.
> >
> > > - Took the backup of master
> > >    - with a low max_wal_size = 2 and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > + - SELECT pg_switch_wal(); on master twice.
> > + - checkpoint; on master twice.
> > > - Configured the replication slot on standby and started the standby
> >
> > Creating the slot with the following command will save it.
> >
> > =# select pg_create_physical_replication_slot('s1', true);
> >
> 
> I did a test again, by applying the patch and I am not sure if the patch is
> doing the right thing ?

This is a bit difficult to make it happen.

> Here is test case -
> 
> - I ran pgbench
> - I took the backup of the master first
> 
> - Below are the WALs on master after the stop backup -
> 
> postgres=# select pg_stop_backup();
> 
> NOTICE:  WAL archiving is not enabled; you must ensure that all required
> WAL segments are copied through other means to complete the backup
>  pg_stop_backup
> ----------------
>  0/8C000130
> (1 row)
> 
> postgres=# \q
> [dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> total 65M
> drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
> -rw-------. 1 dba dba  16M Mar 31 11:09 00000001000000000000008E
> -rw-------. 1 dba dba  16M Mar 31 11:17 00000001000000000000008F
> -rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008C
> -rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008D
> 
> - After the backup, i created the physical replication slot
> 
> 
> postgres=# select pg_create_physical_replication_slot('repslot',true);
> 
>  pg_create_physical_replication_slot
> -------------------------------------
>  (repslot,0/8D000028)
> (1 row)

At this point, segment 8C is not protected from removal. It's too
late if the first record in 8D is continued from 8C.

> postgres=# select pg_walfile_name('0/8D000028');
> 
>      pg_walfile_name
> ---------------------------------------
>  00000001000000000000008D
> (1 row)
> 
> Here, When you start the standby, it would ask for the file
> 00000001000000000000008C, which is the first file needed for the standby
> and since i applied your patch, i am assuming that, the file
> 00000001000000000000008C should also be retained without being removed -
> correct ?

Yes if repslot were staying in 8C. So this is not the case.

> - I started the standby and the below error occurs
> 
> >> 2017-03-31 11:26:01.288 AEDT [17475] LOG:  invalid record length at
> 0/8C000108: wanted 24, got 0
> >> 2017-03-31 11:26:01.291 AEDT [17486] LOG:  started streaming WAL from
> primary at 0/8C000000 on timeline 1
> >> 2017-03-31 11:26:01.291 AEDT [17486] FATAL:  could not receive data from
> WAL stream: ERROR:  requested WAL segment 00000001000000000000008C has
> already been removed

LoL. You hit this by a single shot! However, the case cannot be
saved with this patch since the repslot is already in 8D. In
other words, newly created slot doesn't protect the previous
segments at the time of creation. The existence of required WAL
segments is not guranteed without repslot anyway.

> > and i could notice that the file "0000000100000000000000F2" was removed
> > > from the master. This can be easily re-produced and this occurs
> > > irrespective of configuring replication slots.
> > >
> > > As long as the file "0000000100000000000000F2" is available on the
> > master,
> > > standby continues to stream WALs without any issues.
> > ...
> > > If the scenario i created to reproduce the error is correct, then,
> > applying
> > > the patch is not making a difference.
> >
> > Yes, the patch is not for saving this case. The patch saves the
> > case where the previous segment to the first required segment by
> > standby was removed and it contains the first part of a record
> > continues to the first required segment. On the other hand this
> > case is that the segment at the start point of standby is just
> > removed.
> >
> 
> Which means, the file 00000001000000000000008C must be retained as it is
> the first file standby is looking for - correct ?

Yes, but if and only if the repslot is successfully connected at
least once. So I attached an additional patch to effectively
cause the problematic situation.  The attached patch let standby
suicide when the condition for this bug is made.

I wrote in <20170117.193645.160386781.horiguchi.kyotaro@lab.ntt.co.jp>: 
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
> 
> - M(aster): createdb as a master with wal_keep_segments = 0
>             (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
>      start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
>   > LOG:  #################### STOP THE SERVER

At this point, pg_replication_slots.restart_lsn should be in the
previous segment of the segment for the LSN shown in the last
"LOG: ####.. CHECK AT" log. For example in the case below.

> LOG:  ############# CHECK AT 5000000 : 1
> LOG:  #################### STOP THE SERVER

The master responds to the query "select restart_lsn from
pg_replication_slots" with '0/4F......' with this patch but
'0/50......' without this patch.

> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
> 
>   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has
alreadybeen removed
 

One other point of thie patch is that repslot *shoutd not* retain
needless semgents. But this doesn't seem externally
confirmable...

regards,


diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 771ac30..f1de180 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -986,6 +986,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)            recvFile =
XLogFileInit(recvSegNo,&use_existent, true);            recvFileTLI = ThisTimeLineID;            recvOff = 0;
 
+
+            if ((recptr & 0xffffffL) == 0)
+            {
+                XLogPageHeader ph = (XLogPageHeader) buf;
+                Assert(nbytes >= sizeof(SizeOfXLogShortPHD));
+                
+                elog(LOG, "############# CHECK AT %lX : %d",
+                     recptr, (ph->xlp_info & XLP_FIRST_IS_CONTRECORD) != 0);
+                if (ph->xlp_info & XLP_FIRST_IS_CONTRECORD)
+                {
+                    struct stat sbuf;
+                    if (stat("/tmp/hoge", &sbuf) == 0)
+                    {
+                        elog(LOG, "#################### STOP THE SERVER");
+                        system("pg_ctl stop -m f -W");
+                        while (1)
+                        {
+                            ProcessWalRcvInterrupts();
+                            sleep(1);
+                        }
+                    }
+                }
+            }        }        /* Calculate the start offset of the received logs */

Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Venkata B Nagothi
Date:

On Fri, Mar 31, 2017 at 4:05 PM, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Thank you having a look on this.

# I removed -bugs in CC:.

At Fri, 31 Mar 2017 13:40:00 +1100, Venkata B Nagothi <nag1010@gmail.com> wrote in <CAEyp7J-BmsMqCVO7zX5vpEoEuRiM+KVPZPJzodUqOdOTZsLpqQ@mail.gmail.com>
> On Fri, Mar 17, 2017 at 6:48 PM, Kyotaro HORIGUCHI <
> horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> > Hello,
> >
> > At Mon, 13 Mar 2017 11:06:00 +1100, Venkata B Nagothi <nag1010@gmail.com>
> > wrote in <CAEyp7J-4MmVwGoZSwvaSULZC80JDD_tL-9KsNiqF17+bNqiSBg@mail.
> > gmail.com>
> > > On Tue, Jan 17, 2017 at 9:36 PM, Kyotaro HORIGUCHI <
> > > horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > > > I managed to reproduce this. A little tweak as the first patch
> > > > lets the standby to suicide as soon as walreceiver sees a
> > > > contrecord at the beginning of a segment.
> > > >
> > > > - M(aster): createdb as a master with wal_keep_segments = 0
> > > >             (default), min_log_messages = debug2
> > > > - M: Create a physical repslot.
> > > > - S(tandby): Setup a standby database.
> > > > - S: Edit recovery.conf to use the replication slot above then
> > > >      start it.
> > > > - S: touch /tmp/hoge
> > > > - M: Run pgbench ...
> > > > - S: After a while, the standby stops.
> > > >   > LOG:  #################### STOP THE SERVER
> > > >
> > > > - M: Stop pgbench.
> > > > - M: Do 'checkpoint;' twice.
> > > > - S: rm /tmp/hoge
> > > > - S: Fails to catch up with the following error.
> > > >
> > > >   > FATAL:  could not receive data from WAL stream: ERROR:  requested
> > WAL
> > > > segment 00000001000000000000002B has already been removed
> > > >
> > > >
> > > I have been testing / reviewing the latest patch
> > > "0001-Fix-a-bug-of-physical-replication-slot.patch" and i think, i might
> > > need some more clarification on this.
> > >
> > > Before applying the patch, I tried re-producing the above error -
> > >
> > > - I had master->standby in streaming replication
> > > - Took the backup of master
> > >    - with a low max_wal_size and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > > - Configured the replication slot on standby and started the standby
> >
> > I suppose the "configure" means primary_slot_name in recovery.conf.
> >
> > > - I got the below error
> > >
> > >    >> 2017-03-10 11:58:15.704 AEDT [478] LOG:  invalid record length at
> > > 0/F2000140: wanted 24, got 0
> > >    >> 2017-03-10 11:58:15.706 AEDT [481] LOG:  started streaming WAL from
> > > primary at 0/F2000000 on timeline 1
> > >    >> 2017-03-10 11:58:15.706 AEDT [481] FATAL:  could not receive data
> > > from WAL stream: ERROR:  requested WAL segment 0000000100000000000000F2
> > has
> > > already been removed
> >
> > Maybe you created the master slot with non-reserve (default) mode
> > and put a some-minites pause after making the backup and before
> > starting the standby. For the case the master slot doesn't keep
> > WAL segments unless the standby connects so a couple of
> > checkpoints can blow away the first segment required by the
> > standby. This is quite reasonable behavior. The following steps
> > makes this more sure.
> >
> > > - Took the backup of master
> > >    - with a low max_wal_size = 2 and wal_keep_segments = 0
> > > - Configured standby with recovery.conf
> > > - Created replication slot on master
> > + - SELECT pg_switch_wal(); on master twice.
> > + - checkpoint; on master twice.
> > > - Configured the replication slot on standby and started the standby
> >
> > Creating the slot with the following command will save it.
> >
> > =# select pg_create_physical_replication_slot('s1', true);
> >
>
> I did a test again, by applying the patch and I am not sure if the patch is
> doing the right thing ?

This is a bit difficult to make it happen.

Yes, it seems to be quite difficult to re-produce.
 

> Here is test case -
>
> - I ran pgbench
> - I took the backup of the master first
>
> - Below are the WALs on master after the stop backup -
>
> postgres=# select pg_stop_backup();
>
> NOTICE:  WAL archiving is not enabled; you must ensure that all required
> WAL segments are copied through other means to complete the backup
>  pg_stop_backup
> ----------------
>  0/8C000130
> (1 row)
>
> postgres=# \q
> [dba@buildhost data]$ ls -ltrh pgdata-10dev-prsb-1/pg_wal/
> total 65M
> drwx------. 2 dba dba 4.0K Mar 31 09:36 archive_status
> -rw-------. 1 dba dba  16M Mar 31 11:09 00000001000000000000008E
> -rw-------. 1 dba dba  16M Mar 31 11:17 00000001000000000000008F
> -rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008C
> -rw-------. 1 dba dba  16M Mar 31 11:18 00000001000000000000008D
>
> - After the backup, i created the physical replication slot
>
>
> postgres=# select pg_create_physical_replication_slot('repslot',true);
>
>  pg_create_physical_replication_slot
> -------------------------------------
>  (repslot,0/8D000028)
> (1 row)

At this point, segment 8C is not protected from removal. It's too
late if the first record in 8D is continued from 8C.

> postgres=# select pg_walfile_name('0/8D000028');
>
>      pg_walfile_name
> ---------------------------------------
>  00000001000000000000008D
> (1 row)
>
> Here, When you start the standby, it would ask for the file
> 00000001000000000000008C, which is the first file needed for the standby
> and since i applied your patch, i am assuming that, the file
> 00000001000000000000008C should also be retained without being removed -
> correct ?

Yes if repslot were staying in 8C. So this is not the case.

> - I started the standby and the below error occurs
>
> >> 2017-03-31 11:26:01.288 AEDT [17475] LOG:  invalid record length at
> 0/8C000108: wanted 24, got 0
> >> 2017-03-31 11:26:01.291 AEDT [17486] LOG:  started streaming WAL from
> primary at 0/8C000000 on timeline 1
> >> 2017-03-31 11:26:01.291 AEDT [17486] FATAL:  could not receive data from
> WAL stream: ERROR:  requested WAL segment 00000001000000000000008C has
> already been removed

LoL. You hit this by a single shot! However, the case cannot be
saved with this patch since the repslot is already in 8D. In
other words, newly created slot doesn't protect the previous
segments at the time of creation. The existence of required WAL
segments is not guranteed without repslot anyway.

Yeah, this can be re-produced easily.
 

> > and i could notice that the file "0000000100000000000000F2" was removed
> > > from the master. This can be easily re-produced and this occurs
> > > irrespective of configuring replication slots.
> > >
> > > As long as the file "0000000100000000000000F2" is available on the
> > master,
> > > standby continues to stream WALs without any issues.
> > ...
> > > If the scenario i created to reproduce the error is correct, then,
> > applying
> > > the patch is not making a difference.
> >
> > Yes, the patch is not for saving this case. The patch saves the
> > case where the previous segment to the first required segment by
> > standby was removed and it contains the first part of a record
> > continues to the first required segment. On the other hand this
> > case is that the segment at the start point of standby is just
> > removed.
> >
>
> Which means, the file 00000001000000000000008C must be retained as it is
> the first file standby is looking for - correct ?

Yes, but if and only if the repslot is successfully connected at
least once. So I attached an additional patch to effectively
cause the problematic situation.  The attached patch let standby
suicide when the condition for this bug is made.

I wrote in <20170117.193645.160386781.horiguchi.kyotaro@lab.ntt.co.jp>:
> I managed to reproduce this. A little tweak as the first patch
> lets the standby to suicide as soon as walreceiver sees a
> contrecord at the beginning of a segment.
>
> - M(aster): createdb as a master with wal_keep_segments = 0
>             (default), min_log_messages = debug2
> - M: Create a physical repslot.
> - S(tandby): Setup a standby database.
> - S: Edit recovery.conf to use the replication slot above then
>      start it.
> - S: touch /tmp/hoge
> - M: Run pgbench ...
> - S: After a while, the standby stops.
>   > LOG:  #################### STOP THE SERVER

At this point, pg_replication_slots.restart_lsn should be in the
previous segment of the segment for the LSN shown in the last
"LOG: ####.. CHECK AT" log. For example in the case below.

> LOG:  ############# CHECK AT 5000000 : 1
> LOG:  #################### STOP THE SERVER

The master responds to the query "select restart_lsn from
pg_replication_slots" with '0/4F......' with this patch but
'0/50......' without this patch.

> - M: Stop pgbench.
> - M: Do 'checkpoint;' twice.
> - S: rm /tmp/hoge
> - S: Fails to catch up with the following error.
>
>   > FATAL:  could not receive data from WAL stream: ERROR:  requested WAL segment 00000001000000000000002B has already been removed

One other point of thie patch is that repslot *shoutd not* retain
needless semgents. But this doesn't seem externally
confirmable...

I will need to use this patch and see if i can re-produce the case.

As we are already past the commitfest, I am not sure, what should i change the patch status to ?

Regards,

Venkata B N
Database Consultant

Re: [BUGS] Bug in Physical Replication Slots (at least 9.5)?

From
Michael Paquier
Date:
On Mon, Apr 3, 2017 at 7:19 AM, Venkata B Nagothi <nag1010@gmail.com> wrote:
> As we are already past the commitfest, I am not sure, what should i change
> the patch status to ?

The commit fest finishes on the 7th of April. Even with the deadline
passed, there is nothing preventing to work on bug fixes. So this item
ought to be moved to the next CF with the same category.
-- 
Michael



Re: Bug in Physical Replication Slots (at least 9.5)?

From
Greg Stark
Date:
On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.

I think the case to check is a commit record with many thousands of
subtransactions. I'm not sure you can fill a whole segment though.


-- 
greg


Re: Bug in Physical Replication Slots (at least 9.5)?

From
Greg Stark
Date:
On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
>
> Though I haven't look closer to how a modification is splitted
> into WAL records. A tuple cannot be so long. As a simple test, I
> observed rechder->xl_tot_len at the end of XLogRecordAssemble
> inserting an about 400KB not-so-compressable string into a text
> column, but I saw a series of many records with shorter than
> several thousand bytes.

I think the case to check is a commit record with many thousands of
subtransactions. I'm not sure you can fill a whole segment though.


-- 
greg


Re: Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 19 Jan 2018 11:28:58 +0000, Greg Stark <stark@mit.edu> wrote in
<CAM-w4HMU==SkHHzS6KDSrNiKU9vk2R4TG73M4FJzA-8Yui34+g@mail.gmail.com>
> On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > Though I haven't look closer to how a modification is splitted
> > into WAL records. A tuple cannot be so long. As a simple test, I
> > observed rechder->xl_tot_len at the end of XLogRecordAssemble
> > inserting an about 400KB not-so-compressable string into a text
> > column, but I saw a series of many records with shorter than
> > several thousand bytes.
> 
> I think the case to check is a commit record with many thousands of
> subtransactions. I'm not sure you can fill a whole segment though.

Thanks, potentially it can. 1 subtransaction adds 4 bytes so
roughly 4.2M subtransactions will fill a segment but a
transaction with 100000 subtrans didn't end returning a pile of
many-many commans tags.

... Anyway, current point of the discussion is I think moved to
the validity of taking a series of continuation records from
different WAL sources, or acceptability of adding
record-awareness to wal-receiver side.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Bug in Physical Replication Slots (at least 9.5)?

From
Kyotaro HORIGUCHI
Date:
Hello,

At Fri, 19 Jan 2018 11:28:58 +0000, Greg Stark <stark@mit.edu> wrote in
<CAM-w4HMU==SkHHzS6KDSrNiKU9vk2R4TG73M4FJzA-8Yui34+g@mail.gmail.com>
> On 19 January 2017 at 09:37, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> >
> > Though I haven't look closer to how a modification is splitted
> > into WAL records. A tuple cannot be so long. As a simple test, I
> > observed rechder->xl_tot_len at the end of XLogRecordAssemble
> > inserting an about 400KB not-so-compressable string into a text
> > column, but I saw a series of many records with shorter than
> > several thousand bytes.
> 
> I think the case to check is a commit record with many thousands of
> subtransactions. I'm not sure you can fill a whole segment though.

Thanks, potentially it can. 1 subtransaction adds 4 bytes so
roughly 4.2M subtransactions will fill a segment but a
transaction with 100000 subtrans didn't end returning a pile of
many-many commans tags.

... Anyway, current point of the discussion is I think moved to
the validity of taking a series of continuation records from
different WAL sources, or acceptability of adding
record-awareness to wal-receiver side.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center