Thread: Possible missing segments in archiving on standby

Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
Hello.

While looking a patch, I found that a standby with archive_mode=always
fails to archive segments under certain conditions.

A. Walreceiver is gracefully terminated just after a segment is
   finished.

B. Walreceiver is gracefully terminated while receiving filling chunks
  for a segment switch.

The two above are reprodusible (without distinction between the two)
using a helper patch. See below.

There's one more issue here.

C. Standby doesn't archive a segment until walreceiver receives any
  data for the next segment.

I'm not sure wehther we assume C as an issue.

The first attached patch fixes A and B. A side-effect of that is that
standby archives the previous segment of the streaming start
location. Concretely 00..0100..2 gets to be archived in the above case
(recovery starts at 0/3000000).  That behavior doesn't seem to be a
proble since the segment is a part of the standby's data anyway.

The second attached patch fixes all of A to C, but seems somewhat
redundant.

Any opnions and/or suggestions are welcome.


The attached files are:

1. v1-0001-Make-sure-standby-archives-all-segments.patch:
  Fix for A and B.

2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
  Fix for A, B and C.

3. repro.sh
  The reproducer shell script used below.

4. repro_helper.patch
  Helper patch for repro.sh for master and patch 1 above.

5. repro_helper2.patch
  Helper patch for repro.sh for patch 2 above.

=====
** REPRODUCER

The failure is reproducible with some code tweak.

1. Create a primary server with archive_mode=always then start it.
2. Create and start a standby.
3. touch /tmp/hoge

4. psql -c "create table t(); drop table t; select pg_switch_wal(); select pg_sleep(1); create table t(); drop table t;
selectpg_switch_wal();"
 

5. look into the archive directory of the standby.
   If no missing segments found in archive, repeat from 3.

The third attached shell script is a reproducer for the problem,
needing the aid of the fourth patch attached.

$ mkdir testdir
$ cd testdir
$ bash ..../repro.sh
....
After test 2:
Primary location: 0/8000310
Standby location: 0/8000310
# primary archive
000000010000000000000003
000000010000000000000004
000000010000000000000005
000000010000000000000006
000000010000000000000007
000000010000000000000008
# standby archive
000000010000000000000003
000000010000000000000005
000000010000000000000006
000000010000000000000008

The segment 4 is skipped by the issue A and 7 is skipped by the issue
B.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center

From afa907bca7db8ea6335d47bd02761f567591d553 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Tue, 30 Jun 2020 14:21:30 +0900
Subject: [PATCH v1] Make sure standby archives all segments

Standby fails to archive a segment if standby is stopped just after a
segment is finished or stopped just after a segment swtich. Make sure
that walreceiver archives all segments by rechecking at start.
---
 src/backend/replication/walreceiver.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..680154365d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -938,6 +938,23 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
                 else
                     XLogArchiveNotify(xlogfname);
             }
+            else if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+            {
+                /*
+                 * If we are starting streaming at the beginning of a segment,
+                 * there may be casees where the previous segment have not been
+                 * archived yet.  Make sure it is archived.
+                 */
+                char        xlogfname[MAXFNAMELEN];
+                XLogSegNo    prevseg;
+
+                XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+                XLogFileName(xlogfname, ThisTimeLineID, prevseg,
+                             wal_segment_size);
+
+                /* Mark as ".ready" of not yet */
+                XLogArchiveCheckDone(xlogfname);
+            }
             recvFile = -1;
 
             /* Create/use new log file */
-- 
2.18.4

From 7af716134dceb3bafce421dfeaffebf1e1e3e17d Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Mon, 29 Jun 2020 16:12:01 +0900
Subject: [PATCH v1] Make sure standby archives all segments immediately

Standby may get a bit being late in archive, since walsender doesn't
archive a segment until it receives any data for the next segment, Fix
that by archiving just after a segment is finished.

Also, standby fails to archive a segment if standby is stopped just
after a segment is finished or stopped just after a segment
swtich. Make sure that walreceiver archives all segments by rechecking
at start.
---
 src/backend/replication/walreceiver.c | 82 ++++++++++++++++++---------
 1 file changed, 54 insertions(+), 28 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..831718c859 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -902,49 +902,34 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
     {
         int            segbytes;
 
-        if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        /* Open the segment if not yet */
+        if (recvFile < 0)
         {
             bool        use_existent;
 
+            recvFileTLI = ThisTimeLineID;
+
             /*
-             * fsync() and close current file before we switch to next one. We
-             * would otherwise have to reopen this file to fsync it later
+             * If we are starting streaming at the beginning of a segment,
+             * there may be the case where the previous segment have not been
+             * archived yet.  Make sure it is archived.
              */
-            if (recvFile >= 0)
+            if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0)
             {
                 char        xlogfname[MAXFNAMELEN];
+                XLogSegNo    prevseg;
 
-                XLogWalRcvFlush(false);
+                XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+                XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size);
 
-                XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-                /*
-                 * XLOG segment files will be re-read by recovery in startup
-                 * process soon, so we don't advise the OS to release cache
-                 * pages associated with the file like XLogFileClose() does.
-                 */
-                if (close(recvFile) != 0)
-                    ereport(PANIC,
-                            (errcode_for_file_access(),
-                             errmsg("could not close log segment %s: %m",
-                                    xlogfname)));
-
-                /*
-                 * Create .done file forcibly to prevent the streamed segment
-                 * from being archived later.
-                 */
-                if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-                    XLogArchiveForceDone(xlogfname);
-                else
-                    XLogArchiveNotify(xlogfname);
+                /* Mark as ".ready" of not yet */
+                XLogArchiveCheckDone(xlogfname);
             }
-            recvFile = -1;
 
             /* Create/use new log file */
             XLByteToSeg(recptr, recvSegNo, wal_segment_size);
             use_existent = true;
             recvFile = XLogFileInit(recvSegNo, &use_existent, true);
-            recvFileTLI = ThisTimeLineID;
         }
 
         /* Calculate the start offset of the received logs */
@@ -985,6 +970,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        /*
+         * Close the current WAL segment if it is completed then let the file
+         * be archived if needed.
+         */
+        if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        {
+            char        xlogfname[MAXFNAMELEN];
+
+            Assert (recvFile >= 0);
+
+            /*
+             * fsync() and close current file before we switch to next one. We
+             * would otherwise have to reopen this file to fsync it later
+             */
+            XLogWalRcvFlush(false);
+
+            XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+            /*
+             * XLOG segment files will be re-read by recovery in startup
+             * process soon, so we don't advise the OS to release cache
+             * pages associated with the file like XLogFileClose() does.
+             */
+            if (close(recvFile) != 0)
+                ereport(PANIC,
+                        (errcode_for_file_access(),
+                         errmsg("could not close log segment %s: %m",
+                                xlogfname)));
+
+            /*
+             * Create .done file forcibly to prevent the streamed segment
+             * from being archived later.
+             */
+            if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+                XLogArchiveForceDone(xlogfname);
+            else
+                XLogArchiveNotify(xlogfname);
+
+            recvFile = -1;
+        }
     }
 
     /* Update shared-memory status */
-- 
2.18.4

#! /bin/bash

ROOT=`pwd`
LOGFILE="repro.log"
PGPORT1=15432
PGPORT2=15433

PGDATA1=$ROOT/reprodata1
ARCHDIR1=$ROOT/reproarc1
PGDATA2=$ROOT/reprodata2
ARCHDIR2=$ROOT/reproarc2


function cleanup {
    echo -n "Killing servers..."
    pg_ctl -D $PGDATA1 -m i stop
    pg_ctl -D $PGDATA2 -m i stop
    echo "done."
    exit 1
}

rm -r $PGDATA1 $PGDATA2 $ARCHDIR1 $ARCHDIR2
mkdir $ARCHDIR1 $ARCHDIR2

# Create primary
echo "# Creating primary"
initdb -D $PGDATA1 &>$LOGFILE
cat >> $PGDATA1/postgresql.conf <<EOF
wal_keep_segments=10
archive_mode=always
archive_command='cp %p $ARCHDIR1/%f'
EOF

# Start primary
echo "# Starting primary"
pg_ctl -D $PGDATA1 -o"-p $PGPORT1" start &>>$LOGFILE

# Create standby
echo "# Creating standby"
pg_basebackup -D $PGDATA2 -h /tmp -p $PGPORT1 &>>$LOGFILE
cat >> $PGDATA2/postgresql.conf <<EOF
archive_command='cp %p $ARCHDIR2/%f'
primary_conninfo='host=/tmp port=$PGPORT1'
EOF
touch $PGDATA2/standby.signal

trap cleanup ERR 2 3 15

# Start primary
echo "# Starting standby"
pg_ctl -D $PGDATA2 -o"-p $PGPORT2" start &>>$LOGFILE
sleep 3

echo "Start:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# Delocate from boundary..
psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE
sleep 1

# TEST 1: walreceiver stops just after a segment is completed
echo "# test 1" >> $LOGFILE
touch /tmp/hoge1
psql -p $PGPORT1 -c "create table t(a int); insert into t (select a from generate_series(0, 260000) a); drop table t;"
&>>$LOGFILE
echo "# test 1 end" >> $LOGFILE

psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
sleep 2

echo "After test 1:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
sleep 2

# TEST 2: walreceiver stops while receiving filling chunks after a wal switch.
echo "# test 2" >> $LOGFILE
touch /tmp/hoge2
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
echo "# test 2 end" >> $LOGFILE

sleep 2

echo "After test 2:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# stop servers
pg_ctl -D $PGDATA1 stop &>>$LOGFILE
pg_ctl -D $PGDATA2 stop &>>$LOGFILE

#show last three archived segments
echo "# primary archive"
ls $ARCHDIR1 | egrep '[3-9]$'
echo "# standby archive"
ls $ARCHDIR2 | egrep '[3-9]$'
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..dcccef151d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -985,6 +985,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        {
+            /* fake oneshot SIGTERM just at segment end */
+            struct stat b;
+            char *sigfile1 = "/tmp/hoge1";
+            char *sigfile2 = "/tmp/hoge2";
+
+            if (LogstreamResult.Write % wal_segment_size == 0 &&
+                stat(sigfile1, &b) == 0)
+            {
+                unlink(sigfile1);
+                got_SIGTERM = true;
+                ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+            }
+
+            if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+                stat(sigfile2, &b) == 0)
+            {
+                unlink(sigfile2);
+                got_SIGTERM = true;
+                ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+            }
+        }
     }
 
     /* Update shared-memory status */
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 831718c859..b9a7c73ed7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -1011,6 +1011,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 
             recvFile = -1;
         }
+
+        {
+            /* fake oneshot SIGTERM just at segment end */
+            struct stat b;
+            char *sigfile1 = "/tmp/hoge1";
+            char *sigfile2 = "/tmp/hoge2";
+
+            if (LogstreamResult.Write % wal_segment_size == 0 &&
+                stat(sigfile1, &b) == 0)
+            {
+                unlink(sigfile1);
+                got_SIGTERM = true;
+                ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+            }
+
+            if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+                stat(sigfile2, &b) == 0)
+            {
+                unlink(sigfile2);
+                got_SIGTERM = true;
+                ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+            }
+        }
     }
 
     /* Update shared-memory status */

Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2020/06/30 16:55, Kyotaro Horiguchi wrote:
> Hello.
> 
> While looking a patch, I found that a standby with archive_mode=always
> fails to archive segments under certain conditions.

I encountered this issue, too.


> 1. v1-0001-Make-sure-standby-archives-all-segments.patch:
>    Fix for A and B.
> 
> 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
>    Fix for A, B and C.

You proposed two patches, but this patch should be reviewed preferentially
because this addresses all the issues (i.e., A, B and C) that you reported?


+             * If we are starting streaming at the beginning of a segment,
+             * there may be the case where the previous segment have not been
+             * archived yet.  Make sure it is archived.

Could you clarify why the archive notification file of the previous
WAL segment needs to be checked?


As far as I read the code, the cause of the issue seems to be that
XLogWalRcvWrite() exits without creating an archive notification file
even if the current WAL segment is fully written up in the last cycle of
XLogWalRcvWrite()'s loop. That is, creation of the notification file
and WAL archiving of that completed segment will be delayed
until any data in the next segment is received and written (by next call
to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits
before receiving such next segment, the completed current segment
fails to be archived as Horiguchi-san reported.

Therefore, IMO that the simple approach to fix the issue is to create
an archive notification file if possible at the end of XLogWalRcvWrite().
I implemented this idea. Patch attached.

Regards,


-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Tue, 31 Aug 2021 01:54:36 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2020/06/30 16:55, Kyotaro Horiguchi wrote:
> > Hello.
> > While looking a patch, I found that a standby with archive_mode=always
> > fails to archive segments under certain conditions.
> 
> I encountered this issue, too.
> 
> 
> > 1. v1-0001-Make-sure-standby-archives-all-segments.patch:
> >    Fix for A and B.
> > 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
> >    Fix for A, B and C.
> 
> You proposed two patches, but this patch should be reviewed
> preferentially
> because this addresses all the issues (i.e., A, B and C) that you
> reported?

Maybe. The point here was whether we regard C as an issue, but now I
think it is an issue.

> + * If we are starting streaming at the beginning of a segment,
> + * there may be the case where the previous segment have not been
> +             * archived yet.  Make sure it is archived.
> 
> Could you clarify why the archive notification file of the previous
> WAL segment needs to be checked?
> 
> As far as I read the code, the cause of the issue seems to be that
> XLogWalRcvWrite() exits without creating an archive notification file
> even if the current WAL segment is fully written up in the last cycle
> of
> XLogWalRcvWrite()'s loop. That is, creation of the notification file
> and WAL archiving of that completed segment will be delayed
> until any data in the next segment is received and written (by next
> call
> to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits
> before receiving such next segment, the completed current segment
> fails to be archived as Horiguchi-san reported.

Right.  Finally such segments are archived when a future checkpoint
removes them. In that sense the patch works to just let archiving
happens faster, but on the other hand I came to think we are supposed
to archive a segment as soon as it is completed. (That is, I think C
is a problem.)

> Therefore, IMO that the simple approach to fix the issue is to create
> an archive notification file if possible at the end of
> XLogWalRcvWrite().
> I implemented this idea. Patch attached.

I'm not sure which is simpler, but it works except for B, the case of
a long-jump by a segment switch.  When a segment switch happens,
walsender sends filling zero-pages but even if walreceiver is
terminated before the segment is completed, walsender restarts from
the next segment at the next startup. Concretely like the following.

- pg_switch_wal() invoked at 6003228 (for example)
- walreceiver terminates at 6500000 (or a bit later).
- walrecever rstarts from 7000000

In this case the segment 6 is not notified even with the patch, and my
old patches works the same way. (In other words, the call to
XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
case as you might expect.) If we think it ok that we don't notify the
segment earlier than a future checkpoint removes it, yours or only the
last half of my one is sufficient, but do we really think so?
Furthermore, your patch or only the last half of my second patch
doesn't save the case of a crash unlike the case of a graceful
termination.


The attached files are:

v2wip-0001-Make-sure... : a rebased patch of the old second patch
repro_helper.diff       : reproducer helper patch, used by the script below.
repro.sh                : reproducer script.

(The second diff conflicts with the first patch. Since the second just
inserts a single code block, it is easily applicable manually:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..a44d84f0a7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -883,46 +883,32 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
     {
         int            segbytes;
 
-        if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        /* Open the segment if not yet */
+        if (recvFile < 0)
         {
+            recvFileTLI = ThisTimeLineID;
+
             /*
-             * fsync() and close current file before we switch to next one. We
-             * would otherwise have to reopen this file to fsync it later
+             * If we are starting streaming at the beginning of a segment,
+             * there may be the case where the previous segment have not been
+             * archived yet.  Make sure it is archived.
              */
-            if (recvFile >= 0)
+            if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0)
             {
                 char        xlogfname[MAXFNAMELEN];
+                XLogSegNo    prevseg;
 
                 XLogWalRcvFlush(false);
+                XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+                XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size);
 
-                XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-                /*
-                 * XLOG segment files will be re-read by recovery in startup
-                 * process soon, so we don't advise the OS to release cache
-                 * pages associated with the file like XLogFileClose() does.
-                 */
-                if (close(recvFile) != 0)
-                    ereport(PANIC,
-                            (errcode_for_file_access(),
-                             errmsg("could not close log segment %s: %m",
-                                    xlogfname)));
-
-                /*
-                 * Create .done file forcibly to prevent the streamed segment
-                 * from being archived later.
-                 */
-                if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-                    XLogArchiveForceDone(xlogfname);
-                else
-                    XLogArchiveNotify(xlogfname, true);
+                /* Mark as ".ready" if not yet */
+                XLogArchiveCheckDone(xlogfname);
             }
-            recvFile = -1;
 
             /* Create/use new log file */
             XLByteToSeg(recptr, recvSegNo, wal_segment_size);
             recvFile = XLogFileInit(recvSegNo);
-            recvFileTLI = ThisTimeLineID;
         }
 
         /* Calculate the start offset of the received logs */
@@ -963,6 +949,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        /*
+         * Close the current WAL segment if it is completed then let the file
+         * be archived if needed.
+         */
+        if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        {
+            char        xlogfname[MAXFNAMELEN];
+
+            Assert (recvFile >= 0);
+
+            /*
+             * fsync() and close current file before we switch to next one. We
+             * would otherwise have to reopen this file to fsync it later
+             */
+            XLogWalRcvFlush(false);
+
+            XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+            /*
+             * XLOG segment files will be re-read by recovery in startup
+             * process soon, so we don't advise the OS to release cache
+             * pages associated with the file like XLogFileClose() does.
+             */
+            if (close(recvFile) != 0)
+                ereport(PANIC,
+                        (errcode_for_file_access(),
+                         errmsg("could not close log segment %s: %m",
+                                xlogfname)));
+
+            /*
+             * Create .done file forcibly to prevent the streamed segment
+             * from being archived later.
+             */
+            if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+                XLogArchiveForceDone(xlogfname);
+            else
+                XLogArchiveNotify(xlogfname, true);
+
+            recvFile = -1;
+        }
     }
 
     /* Update shared-memory status */
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..396510a81d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -963,6 +963,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        {
+            /* fake oneshot SIGTERM just at segment end */
+            struct stat b;
+            char *sigfile1 = "/tmp/hoge1";
+            char *sigfile2 = "/tmp/hoge2";
+
+            if (LogstreamResult.Write % wal_segment_size == 0 &&
+                stat(sigfile1, &b) == 0)
+            {
+                unlink(sigfile1);
+                ShutdownRequestPending = true;
+                ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+            }
+
+            if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+                stat(sigfile2, &b) == 0)
+            {
+                unlink(sigfile2);
+                ShutdownRequestPending = true;
+                ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+            }
+        }
     }
 
     /* Update shared-memory status */
#! /bin/bash

ROOT=`pwd`
LOGFILE="repro.log"
PGPORT1=15432
PGPORT2=15433

PGDATA1=$ROOT/reprodata1
ARCHDIR1=$ROOT/reproarc1
PGDATA2=$ROOT/reprodata2
ARCHDIR2=$ROOT/reproarc2


function cleanup {
    echo -n "Killing servers..."
    pg_ctl -D $PGDATA1 -m i stop
    pg_ctl -D $PGDATA2 -m i stop
    echo "done."
    exit 1
}

rm -r $PGDATA1 $PGDATA2 $ARCHDIR1 $ARCHDIR2
mkdir $ARCHDIR1 $ARCHDIR2

# Create primary
echo "# Creating primary"
initdb -D $PGDATA1 &>$LOGFILE
cat >> $PGDATA1/postgresql.conf <<EOF
wal_keep_size=160
archive_mode=always
archive_command='cp %p $ARCHDIR1/%f'
EOF

# Start primary
echo "# Starting primary"
pg_ctl -D $PGDATA1 -o"-p $PGPORT1" start &>>$LOGFILE

# Create standby
echo "# Creating standby"
pg_basebackup -D $PGDATA2 -h /tmp -p $PGPORT1 &>>$LOGFILE
cat >> $PGDATA2/postgresql.conf <<EOF
archive_command='cp %p $ARCHDIR2/%f'
primary_conninfo='host=/tmp port=$PGPORT1'
EOF
touch $PGDATA2/standby.signal

trap cleanup ERR 2 3 15

# Start primary
echo "# Starting standby"
pg_ctl -D $PGDATA2 -o"-p $PGPORT2" start &>>$LOGFILE
sleep 3

echo "Start:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# Delocate from boundary..
psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE
sleep 1

# TEST 1: walreceiver stops just after a segment is completed
echo "Before test 1:"
echo -n "Primary location: " 
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"
echo "# test 1" >> $LOGFILE
touch /tmp/hoge1
# the number of records should be adjusted so that the LSN at "before
# end test 1" below is locaed in the last page of the segment 3.
psql -p $PGPORT1 -c "create table t(a int); insert into t (select a from generate_series(0, 259500) a); drop table t;"
&>>$LOGFILE
echo "# test 1 end" >> $LOGFILE
psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE

echo "before end test 1:"
echo -n "Primary location: " 
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

psql -p $PGPORT1 -c "select pg_switch_wal()" &>>$LOGFILE
sleep 2

echo "After test 1:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
sleep 2

# TEST 2: walreceiver stops while receiving filling chunks after a wal switch.
echo "# test 2" >> $LOGFILE
touch /tmp/hoge2
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
echo "# test 2 end" >> $LOGFILE

sleep 2

echo "After test 2:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# stop servers
pg_ctl -D $PGDATA1 stop &>>$LOGFILE
pg_ctl -D $PGDATA2 stop &>>$LOGFILE

#show last several archived segments
echo "# primary archive"
ls $ARCHDIR1 | egrep '[3-9]$'
echo "# standby archive"
ls $ARCHDIR2 | egrep '[3-9]$'

Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/08/31 16:35, Kyotaro Horiguchi wrote:
> I'm not sure which is simpler, but it works except for B, the case of
> a long-jump by a segment switch.  When a segment switch happens,
> walsender sends filling zero-pages but even if walreceiver is
> terminated before the segment is completed, walsender restarts from
> the next segment at the next startup. Concretely like the following.
> 
> - pg_switch_wal() invoked at 6003228 (for example)
> - walreceiver terminates at 6500000 (or a bit later).
> - walrecever rstarts from 7000000
> 
> In this case the segment 6 is not notified even with the patch, and my
> old patches works the same way. (In other words, the call to
> XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
> case as you might expect.) If we think it ok that we don't notify the
> segment earlier than a future checkpoint removes it, yours or only the
> last half of my one is sufficient, but do we really think so?
> Furthermore, your patch or only the last half of my second patch
> doesn't save the case of a crash unlike the case of a graceful
> termination.

Thanks for the clarification!
Please let me check my understanding about the issue.

The issue happens when walreceiver exits after it receives XLOG_SWITCH record
but before receives the remaining bytes of the segment including that
XLOG_SWITCH record. In this case, the startup process tries to replay that
"half-received" segment, finds XLOG_SWITCH record in it, moves to the next
segment and then starts new walreceiver from that next segment. Therefore,
even with my patch, the segment including that XLOG_SWITCH record is not
archived soon. Is my understanding right? I agree that we should address also
this issue.

ISTM, to address the issue,  it's simpler and less fragile to make the startup
process call XLogArchiveCheckDone() or something whenever it moves
the next segment, rather than make walreceiver do that. Thought?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Tue, 31 Aug 2021 23:23:27 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/08/31 16:35, Kyotaro Horiguchi wrote:
> > I'm not sure which is simpler, but it works except for B, the case of
> > a long-jump by a segment switch.  When a segment switch happens,
> > walsender sends filling zero-pages but even if walreceiver is
> > terminated before the segment is completed, walsender restarts from
> > the next segment at the next startup. Concretely like the following.
> > - pg_switch_wal() invoked at 6003228 (for example)
> > - walreceiver terminates at 6500000 (or a bit later).
> > - walrecever rstarts from 7000000
> > In this case the segment 6 is not notified even with the patch, and my
> > old patches works the same way. (In other words, the call to
> > XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
> > case as you might expect.) If we think it ok that we don't notify the
> > segment earlier than a future checkpoint removes it, yours or only the
> > last half of my one is sufficient, but do we really think so?
> > Furthermore, your patch or only the last half of my second patch
> > doesn't save the case of a crash unlike the case of a graceful
> > termination.
> 
> Thanks for the clarification!
> Please let me check my understanding about the issue.
> 
> The issue happens when walreceiver exits after it receives XLOG_SWITCH
> record
> but before receives the remaining bytes of the segment including that
> XLOG_SWITCH record. In this case, the startup process tries to replay
> that
> "half-received" segment, finds XLOG_SWITCH record in it, moves to the
> next
> segment and then starts new walreceiver from that next
> segment. Therefore,
> even with my patch, the segment including that XLOG_SWITCH record is
> not
> archived soon. Is my understanding right? I agree that we should
> address also
> this issue.

Right.

> ISTM, to address the issue, it's simpler and less fragile to make the
> startup
> process call XLogArchiveCheckDone() or something whenever it moves
> the next segment, rather than make walreceiver do that. Thought?

Putting aside the issue C, it would work as far as recovery is not
paused or delayed.  Although simply doing that means we run additional
and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
imagine moving the responsibility to notify a finished segment from
walsender (writer side) to startup (reader side).

In the first place A and B happens only at termination or crash of
walsender so there's no fragility in checking only the previous
segment at start of walsender.  After a bit thought I noticed that we
don't need to do that in the wal-writing loop. And I noticed that we
need to consider timeline transitions while calculating the previous
segment.  Even though missing-notification at a timeline-switch
doesn't happen unless walsender is killed hard for example by a
sigkill or a power cut, though.

So the attached is a new version of the patch to fix only A and B.

- Moved the check code out of the replication loop.

- Track timeline transition while calculating the previous segment.
  If we don't do that, we would need another means to avoid notifying
  non-existent segment instead of the correct one.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center


diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..81dde27372 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -173,6 +173,7 @@ WalReceiverMain(void)
     XLogRecPtr    startpoint;
     TimeLineID    startpointTLI;
     TimeLineID    primaryTLI;
+    XLogSegNo    startsegno;
     bool        first_stream;
     WalRcvData *walrcv = WalRcv;
     TimestampTz last_recv_timestamp;
@@ -313,6 +314,32 @@ WalReceiverMain(void)
     if (sender_host)
         pfree(sender_host);
 
+    /*
+     * There's a case walreceiver terminated before notifying the last
+     * finished segment. Make sure the last finished segment is archived
+     * immediately.
+     */
+    XLByteToSeg(startpoint, startsegno, wal_segment_size);
+    if (startsegno > 1)
+    {
+        char         xlogfname[MAXFNAMELEN];
+        TimeLineID    prevsegTLI;
+        XLogRecPtr    prevsegEndRecPtr;
+        List       *tles;
+
+        /*
+         * The previous segment may be in the previous timeline.  Track
+         * timelines to find the segment on the correct timeline.
+         */
+        tles = readTimeLineHistory(startpointTLI);
+        prevsegEndRecPtr =
+            startpoint - XLogSegmentOffset(startpoint, wal_segment_size) - 1;
+        prevsegTLI = tliOfPointInHistory(prevsegEndRecPtr, tles);
+        XLogFileName(xlogfname, prevsegTLI, startsegno - 1,
+                     wal_segment_size);
+        XLogArchiveCheckDone(xlogfname);
+    }
+        
     first_stream = true;
     for (;;)
     {

Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/01 12:12, Kyotaro Horiguchi wrote:
> Putting aside the issue C, it would work as far as recovery is not
> paused or delayed.  Although simply doing that means we run additional
> and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
> imagine moving the responsibility to notify a finished segment from
> walsender (writer side) to startup (reader side).

You mean walreceiver, not walsender?

I was thinking to apply my latest patch, to address the issue A and C.
So walreceiver is still basically responsible to create .ready file.
Also regarding the issue B, I was thinking to make the startup process
call XLogArchiveCheckDone() or something only when it finds
XLOG_SWITCH record. Thought?


> In the first place A and B happens only at termination or crash of
> walsender so there's no fragility in checking only the previous
> segment at start of walsender.  After a bit thought I noticed that we
> don't need to do that in the wal-writing loop. And I noticed that we
> need to consider timeline transitions while calculating the previous
> segment.  Even though missing-notification at a timeline-switch
> doesn't happen unless walsender is killed hard for example by a
> sigkill or a power cut, though.

What happens if the server is promoted before that walreceiver is invoked?

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION



Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Wed, 1 Sep 2021 14:37:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/01 12:12, Kyotaro Horiguchi wrote:
> > Putting aside the issue C, it would work as far as recovery is not
> > paused or delayed.  Although simply doing that means we run additional
> > and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to
> > imagine moving the responsibility to notify a finished segment from
> > walsender (writer side) to startup (reader side).
> 
> You mean walreceiver, not walsender?

Sorry, it's walreceiver.

> I was thinking to apply my latest patch, to address the issue A and C.
> So walreceiver is still basically responsible to create .ready file.

Considering the following discussion, I don't object to the patch.

> Also regarding the issue B, I was thinking to make the startup process
> call XLogArchiveCheckDone() or something only when it finds
> XLOG_SWITCH record. Thought?

Sounds workable.  I came to agree to the reader-side amendment as
below. But I might prefer to do that at every segment-switch in case
of a crash.

> What happens if the server is promoted before that walreceiver is
> invoked?

Hmmmmm.  A partial segment is not created if a server promotes just at
a segment boundary, then the previous segment won't get archived until
the next checkpoint runs.

Ok, I agree that the reader-side needs an amendment.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/02 10:16, Kyotaro Horiguchi wrote:
> Ok, I agree that the reader-side needs an amendment.

Thanks for the review! Attached is the updated version of the patch.
Based on my latest patch, I changed the startup process so that
it creates an archive notification file of the streamed WAL segment
including XLOG_SWITCH record if the notification file has not been created yet.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Fri, 3 Sep 2021 02:06:45 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/02 10:16, Kyotaro Horiguchi wrote:
> > Ok, I agree that the reader-side needs an amendment.
> 
> Thanks for the review! Attached is the updated version of the patch.
> Based on my latest patch, I changed the startup process so that
> it creates an archive notification file of the streamed WAL segment
> including XLOG_SWITCH record if the notification file has not been
> created yet.

+                if (readSource == XLOG_FROM_STREAM &&
+                    record->xl_rmid == RM_XLOG_ID &&
+                    (record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)

readSource is the source at the time startup reads it and it could be
different from the source at the time the record was written. We
cannot know where the record came from there, so does the readSource
condition work as expected?  If we had some trouble streaming just
before, readSource at the time is likely to be XLOG_FROM_PG_WAL.

+                        if (XLogArchivingAlways())
+                            XLogArchiveNotify(xlogfilename, true);
+                        else
+                            XLogArchiveForceDone(xlogfilename);

The path is used both for crash and archive recovery. If we pass there
while crash recovery on a primary with archive_mode=on, the file could
be marked .done before actually archived. On the other hand when
archive_mode=always, the file could be re-marked .ready even after it
has been already archived.  Why isn't it XLogArchiveCheckDone?

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/03 14:56, Kyotaro Horiguchi wrote:
> +                if (readSource == XLOG_FROM_STREAM &&
> +                    record->xl_rmid == RM_XLOG_ID &&
> +                    (record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH)
> 
> readSource is the source at the time startup reads it and it could be
> different from the source at the time the record was written. We
> cannot know where the record came from there, so does the readSource
> condition work as expected?  If we had some trouble streaming just
> before, readSource at the time is likely to be XLOG_FROM_PG_WAL.

Yes.


> +                        if (XLogArchivingAlways())
> +                            XLogArchiveNotify(xlogfilename, true);
> +                        else
> +                            XLogArchiveForceDone(xlogfilename);
> 
> The path is used both for crash and archive recovery. If we pass there
> while crash recovery on a primary with archive_mode=on, the file could
> be marked .done before actually archived. On the other hand when
> archive_mode=always, the file could be re-marked .ready even after it
> has been already archived.  Why isn't it XLogArchiveCheckDone?

Yeah, you're right. ISTM what we should do is to just call
XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
i.e., to create .ready file if the segment has no archive notification file yet
and archive_mode is set to 'always'. Even if we don't do this when we reach
XLOG_SWITCH record, subsequent restartpoints eventually will call
XLogArchiveCheckDone() for such segments.

One issue of this approach is that the WAL segment including XLOG_SWITCH
record may be archived before its previous segments are. That is,
the notification file of current segment is created when it's replayed
because it includes XLOG_SWIATCH, but the notification files of
its previous segments will be created by subsequent restartpoints
because they don't have XLOG_SWITCH. Probably we should avoid this?

If yes, one approach for this issue is to call XLogArchiveCheckDone() for
not only the segment including XLOG_SWITCH but also all the segments
older than that. Thought?


Anyway, I extracted the changes in walreceiver from the patch,
because it's self-contained and can be applied separately.
Patch attached.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Tue, 7 Sep 2021 17:03:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> > + if (XLogArchivingAlways())
> > + XLogArchiveNotify(xlogfilename, true);
> > +                        else
> > + XLogArchiveForceDone(xlogfilename);
> > The path is used both for crash and archive recovery. If we pass there
> > while crash recovery on a primary with archive_mode=on, the file could
> > be marked .done before actually archived. On the other hand when
> > archive_mode=always, the file could be re-marked .ready even after it
> > has been already archived.  Why isn't it XLogArchiveCheckDone?
> 
> Yeah, you're right. ISTM what we should do is to just call
> XLogArchiveCheckDone() for the segment including XLOG_SWITCH record,
> i.e., to create .ready file if the segment has no archive notification
> file yet
> and archive_mode is set to 'always'. Even if we don't do this when we
> reach
> XLOG_SWITCH record, subsequent restartpoints eventually will call
> XLogArchiveCheckDone() for such segments.
> 
> One issue of this approach is that the WAL segment including
> XLOG_SWITCH
> record may be archived before its previous segments are. That is,
> the notification file of current segment is created when it's replayed
> because it includes XLOG_SWIATCH, but the notification files of
> its previous segments will be created by subsequent restartpoints
> because they don't have XLOG_SWITCH. Probably we should avoid this?

Anyway there's no guarantee on the archive ordering. As discussed in
the nearby thread [1], newer segment is often archived earlier. I
agree that that happens mainly on busy servers, though. The archiver
is designed to handle such "gaps" and/or out-of-order segment
notifications.  We could impose a strict ordering on archiving but I
think we would take total performance than such strictness.

In short, no.

> If yes, one approach for this issue is to call XLogArchiveCheckDone()
> for
> not only the segment including XLOG_SWITCH but also all the segments
> older than that. Thought?

At least currently, recovery fetches segments by a single process and
every file is marked immediately after being filled-up, so all files
other than the latest one in pg_wal including history files should
have been marked for sure unless file system gets into a trouble. So I
think we don't need to do that even if we want the strictness.

Addition to that that takes too long time when many segments reside in
pg_wal so I think we never want to run such a scan at every segment
end that recovery passes.  If I remember correctly, the reason we
don't fix archive status at start up but at checkpoint is we avoided
extra startup time.

> Anyway, I extracted the changes in walreceiver from the patch,
> because it's self-contained and can be applied separately.
> Patch attached.

I'm not sure I like that XLogWalRcvClose hides the segment-switch
condition.  If we do that check in the function, I'd like to name the
function XLogWalRcvCloseIfSwitched or something indicates the
condition.  I'd like to invert the condition to reduce indentation,
too.

Why don't we call it just after writing data, as my first proposal
did? There's no difference in functionality between doing that and the
patch.  If we do so, recvFile>=0 is always true and that condition can
be removed but that would be optional.  Anyway, by doing that, no
longer need to call the function twice or we can even live without the
new function.

[1] https://www.postgresql.org/message-id/20210504042755.ehuaoz5blcnjw5yk%40alap3.anarazel.de

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/08 10:45, Kyotaro Horiguchi wrote:
> Anyway there's no guarantee on the archive ordering. As discussed in
> the nearby thread [1], newer segment is often archived earlier. I
> agree that that happens mainly on busy servers, though. The archiver
> is designed to handle such "gaps" and/or out-of-order segment
> notifications.  We could impose a strict ordering on archiving but I
> think we would take total performance than such strictness.

Yes, there are other cases causing newer WAL file to be archived eariler.
The issue can happen if XLogArchiveNotify() fails to create .ready file,
for example. Fixing only the case that we're discussing here is not enough.
If *general* fix is discussed at the thread you told, it's better to
do nothing here for the issue and to just make the startup process call
XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH record.


> At least currently, recovery fetches segments by a single process and
> every file is marked immediately after being filled-up, so all files
> other than the latest one in pg_wal including history files should
> have been marked for sure unless file system gets into a trouble.

You can reproduce that situation easily by starting the server with
archive_mode=off, generating WAL files, sometimes running pg_switch_wal(),
causing the server to crash, and then restarting the server with
archive_mode=on. At the beginning of recovery, all the WAL files in pg_wal
don't have their archive notification files at all. Then, with the patch,
only WAL files including XLOG_SWITCH are notified for WAL archiving
during recovery. The other WAL files will be notified at the subsequent
checkpoint.


> I'm not sure I like that XLogWalRcvClose hides the segment-switch
> condition.  If we do that check in the function, I'd like to name the
> function XLogWalRcvCloseIfSwitched or something indicates the
> condition.  I'd like to invert the condition to reduce indentation,
> too.

We can move the condition-check out of the function like the attached patch.


> Why don't we call it just after writing data, as my first proposal
> did? There's no difference in functionality between doing that and the
> patch.  If we do so, recvFile>=0 is always true and that condition can
> be removed but that would be optional.  Anyway, by doing that, no
> longer need to call the function twice or we can even live without the
> new function.

I think that it's better and more robust to confirm that the currently-opened
WAL file is valid target one to write WAL *before* actually writing any data
into it.

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Possible missing segments in archiving on standby

From
Kyotaro Horiguchi
Date:
At Wed, 8 Sep 2021 16:01:22 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2021/09/08 10:45, Kyotaro Horiguchi wrote:
> > Anyway there's no guarantee on the archive ordering. As discussed in
> > the nearby thread [1], newer segment is often archived earlier. I
> > agree that that happens mainly on busy servers, though. The archiver
> > is designed to handle such "gaps" and/or out-of-order segment
> > notifications.  We could impose a strict ordering on archiving but I
> > think we would take total performance than such strictness.
> 
> Yes, there are other cases causing newer WAL file to be archived
> eariler.
> The issue can happen if XLogArchiveNotify() fails to create .ready
> file,
> for example. Fixing only the case that we're discussing here is not
> enough.
> If *general* fix is discussed at the thread you told, it's better to
> do nothing here for the issue and to just make the startup process
> call
> XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH
> record.

No. The discussion taken there is not about permanently missing .ready
files, but about .ready files created out-of-order.  So I don't think
the outcome from the thread does *fix* this issue.

> > At least currently, recovery fetches segments by a single process and
> > every file is marked immediately after being filled-up, so all files
> > other than the latest one in pg_wal including history files should
> > have been marked for sure unless file system gets into a trouble.
> 
> You can reproduce that situation easily by starting the server with
> archive_mode=off, generating WAL files, sometimes running
> pg_switch_wal(),
> causing the server to crash, and then restarting the server with
> archive_mode=on. At the beginning of recovery, all the WAL files in
> pg_wal
> don't have their archive notification files at all. Then, with the
> patch,
> only WAL files including XLOG_SWITCH are notified for WAL archiving
> during recovery. The other WAL files will be notified at the
> subsequent
> checkpoint.

I don't think we want such extent of perfectness at all for the case
where some archive-related parameters are changed after a
crash. Anyway we need to take a backup after that and at least all
segments required for the backup will be properly archived.  The
segments up to the XLOG_SWITCH segment are harmless garbage, or a bit
of food for disk.

> > I'm not sure I like that XLogWalRcvClose hides the segment-switch
> > condition.  If we do that check in the function, I'd like to name the
> > function XLogWalRcvCloseIfSwitched or something indicates the
> > condition.  I'd like to invert the condition to reduce indentation,
> > too.
> 
> We can move the condition-check out of the function like the attached
> patch.

Thanks!

> > Why don't we call it just after writing data, as my first proposal
> > did? There's no difference in functionality between doing that and the
> > patch.  If we do so, recvFile>=0 is always true and that condition can
> > be removed but that would be optional.  Anyway, by doing that, no
> > longer need to call the function twice or we can even live without the
> > new function.
> 
> I think that it's better and more robust to confirm that the
> currently-opened
> WAL file is valid target one to write WAL *before* actually writing
> any data
> into it.

Sounds convincing.  Ok, I agree to that.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/08 16:40, Kyotaro Horiguchi wrote:
> No. The discussion taken there is not about permanently missing .ready
> files, but about .ready files created out-of-order.  So I don't think
> the outcome from the thread does *fix* this issue.

Hmm...

> I don't think we want such extent of perfectness at all for the case
> where some archive-related parameters are changed after a
> crash. Anyway we need to take a backup after that and at least all
> segments required for the backup will be properly archived.  The
> segments up to the XLOG_SWITCH segment are harmless garbage, or a bit
> of food for disk.

So probably we reached the consensus that something like the attached patch
(XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner
case where walreceiver fails to create .ready file of WAL file including
XLOG_SWITCH?

> Sounds convincing.  Ok, I agree to that.

So barring any objection, I will commit the patch
and back-patch it to all supported version.

walreceiver_notify_archive_soon_v5.patch
walreceiver_notify_archive_soon_v5_pg14-13.patch
walreceiver_notify_archive_soon_v5_pg12-11.patch
walreceiver_notify_archive_soon_v5_pg10.patch
walreceiver_notify_archive_soon_v5_pg96.patch

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment

Re: Possible missing segments in archiving on standby

From
Fujii Masao
Date:

On 2021/09/08 22:41, Fujii Masao wrote:
> So probably we reached the consensus that something like the attached patch
> (XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner
> case where walreceiver fails to create .ready file of WAL file including
> XLOG_SWITCH?

I attached the patch again, just in the case.


>> Sounds convincing.  Ok, I agree to that.
> 
> So barring any objection, I will commit the patch
> and back-patch it to all supported version.

Pushed. Thanks!

Regards,

-- 
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

Attachment