Re: Possible missing segments in archiving on standby - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Possible missing segments in archiving on standby
Date
Msg-id 20210831.163536.2047557124269492272.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Possible missing segments in archiving on standby  (Fujii Masao <masao.fujii@oss.nttdata.com>)
Responses Re: Possible missing segments in archiving on standby  (Fujii Masao <masao.fujii@oss.nttdata.com>)
List pgsql-hackers
At Tue, 31 Aug 2021 01:54:36 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in 
> 
> 
> On 2020/06/30 16:55, Kyotaro Horiguchi wrote:
> > Hello.
> > While looking a patch, I found that a standby with archive_mode=always
> > fails to archive segments under certain conditions.
> 
> I encountered this issue, too.
> 
> 
> > 1. v1-0001-Make-sure-standby-archives-all-segments.patch:
> >    Fix for A and B.
> > 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch:
> >    Fix for A, B and C.
> 
> You proposed two patches, but this patch should be reviewed
> preferentially
> because this addresses all the issues (i.e., A, B and C) that you
> reported?

Maybe. The point here was whether we regard C as an issue, but now I
think it is an issue.

> + * If we are starting streaming at the beginning of a segment,
> + * there may be the case where the previous segment have not been
> +             * archived yet.  Make sure it is archived.
> 
> Could you clarify why the archive notification file of the previous
> WAL segment needs to be checked?
> 
> As far as I read the code, the cause of the issue seems to be that
> XLogWalRcvWrite() exits without creating an archive notification file
> even if the current WAL segment is fully written up in the last cycle
> of
> XLogWalRcvWrite()'s loop. That is, creation of the notification file
> and WAL archiving of that completed segment will be delayed
> until any data in the next segment is received and written (by next
> call
> to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits
> before receiving such next segment, the completed current segment
> fails to be archived as Horiguchi-san reported.

Right.  Finally such segments are archived when a future checkpoint
removes them. In that sense the patch works to just let archiving
happens faster, but on the other hand I came to think we are supposed
to archive a segment as soon as it is completed. (That is, I think C
is a problem.)

> Therefore, IMO that the simple approach to fix the issue is to create
> an archive notification file if possible at the end of
> XLogWalRcvWrite().
> I implemented this idea. Patch attached.

I'm not sure which is simpler, but it works except for B, the case of
a long-jump by a segment switch.  When a segment switch happens,
walsender sends filling zero-pages but even if walreceiver is
terminated before the segment is completed, walsender restarts from
the next segment at the next startup. Concretely like the following.

- pg_switch_wal() invoked at 6003228 (for example)
- walreceiver terminates at 6500000 (or a bit later).
- walrecever rstarts from 7000000

In this case the segment 6 is not notified even with the patch, and my
old patches works the same way. (In other words, the call to
XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the
case as you might expect.) If we think it ok that we don't notify the
segment earlier than a future checkpoint removes it, yours or only the
last half of my one is sufficient, but do we really think so?
Furthermore, your patch or only the last half of my second patch
doesn't save the case of a crash unlike the case of a graceful
termination.


The attached files are:

v2wip-0001-Make-sure... : a rebased patch of the old second patch
repro_helper.diff       : reproducer helper patch, used by the script below.
repro.sh                : reproducer script.

(The second diff conflicts with the first patch. Since the second just
inserts a single code block, it is easily applicable manually:p)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..a44d84f0a7 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -883,46 +883,32 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
     {
         int            segbytes;
 
-        if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        /* Open the segment if not yet */
+        if (recvFile < 0)
         {
+            recvFileTLI = ThisTimeLineID;
+
             /*
-             * fsync() and close current file before we switch to next one. We
-             * would otherwise have to reopen this file to fsync it later
+             * If we are starting streaming at the beginning of a segment,
+             * there may be the case where the previous segment have not been
+             * archived yet.  Make sure it is archived.
              */
-            if (recvFile >= 0)
+            if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0)
             {
                 char        xlogfname[MAXFNAMELEN];
+                XLogSegNo    prevseg;
 
                 XLogWalRcvFlush(false);
+                XLByteToPrevSeg(recptr, prevseg, wal_segment_size);
+                XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size);
 
-                XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-
-                /*
-                 * XLOG segment files will be re-read by recovery in startup
-                 * process soon, so we don't advise the OS to release cache
-                 * pages associated with the file like XLogFileClose() does.
-                 */
-                if (close(recvFile) != 0)
-                    ereport(PANIC,
-                            (errcode_for_file_access(),
-                             errmsg("could not close log segment %s: %m",
-                                    xlogfname)));
-
-                /*
-                 * Create .done file forcibly to prevent the streamed segment
-                 * from being archived later.
-                 */
-                if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
-                    XLogArchiveForceDone(xlogfname);
-                else
-                    XLogArchiveNotify(xlogfname, true);
+                /* Mark as ".ready" if not yet */
+                XLogArchiveCheckDone(xlogfname);
             }
-            recvFile = -1;
 
             /* Create/use new log file */
             XLByteToSeg(recptr, recvSegNo, wal_segment_size);
             recvFile = XLogFileInit(recvSegNo);
-            recvFileTLI = ThisTimeLineID;
         }
 
         /* Calculate the start offset of the received logs */
@@ -963,6 +949,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        /*
+         * Close the current WAL segment if it is completed then let the file
+         * be archived if needed.
+         */
+        if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+        {
+            char        xlogfname[MAXFNAMELEN];
+
+            Assert (recvFile >= 0);
+
+            /*
+             * fsync() and close current file before we switch to next one. We
+             * would otherwise have to reopen this file to fsync it later
+             */
+            XLogWalRcvFlush(false);
+
+            XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+
+            /*
+             * XLOG segment files will be re-read by recovery in startup
+             * process soon, so we don't advise the OS to release cache
+             * pages associated with the file like XLogFileClose() does.
+             */
+            if (close(recvFile) != 0)
+                ereport(PANIC,
+                        (errcode_for_file_access(),
+                         errmsg("could not close log segment %s: %m",
+                                xlogfname)));
+
+            /*
+             * Create .done file forcibly to prevent the streamed segment
+             * from being archived later.
+             */
+            if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS)
+                XLogArchiveForceDone(xlogfname);
+            else
+                XLogArchiveNotify(xlogfname, true);
+
+            recvFile = -1;
+        }
     }
 
     /* Update shared-memory status */
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 60de3be92c..396510a81d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -963,6 +963,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
         buf += byteswritten;
 
         LogstreamResult.Write = recptr;
+
+        {
+            /* fake oneshot SIGTERM just at segment end */
+            struct stat b;
+            char *sigfile1 = "/tmp/hoge1";
+            char *sigfile2 = "/tmp/hoge2";
+
+            if (LogstreamResult.Write % wal_segment_size == 0 &&
+                stat(sigfile1, &b) == 0)
+            {
+                unlink(sigfile1);
+                ShutdownRequestPending = true;
+                ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write)));
+            }
+
+            if (LogstreamResult.Write % wal_segment_size == 0x500000 &&
+                stat(sigfile2, &b) == 0)
+            {
+                unlink(sigfile2);
+                ShutdownRequestPending = true;
+                ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write)));
+            }
+        }
     }
 
     /* Update shared-memory status */
#! /bin/bash

ROOT=`pwd`
LOGFILE="repro.log"
PGPORT1=15432
PGPORT2=15433

PGDATA1=$ROOT/reprodata1
ARCHDIR1=$ROOT/reproarc1
PGDATA2=$ROOT/reprodata2
ARCHDIR2=$ROOT/reproarc2


function cleanup {
    echo -n "Killing servers..."
    pg_ctl -D $PGDATA1 -m i stop
    pg_ctl -D $PGDATA2 -m i stop
    echo "done."
    exit 1
}

rm -r $PGDATA1 $PGDATA2 $ARCHDIR1 $ARCHDIR2
mkdir $ARCHDIR1 $ARCHDIR2

# Create primary
echo "# Creating primary"
initdb -D $PGDATA1 &>$LOGFILE
cat >> $PGDATA1/postgresql.conf <<EOF
wal_keep_size=160
archive_mode=always
archive_command='cp %p $ARCHDIR1/%f'
EOF

# Start primary
echo "# Starting primary"
pg_ctl -D $PGDATA1 -o"-p $PGPORT1" start &>>$LOGFILE

# Create standby
echo "# Creating standby"
pg_basebackup -D $PGDATA2 -h /tmp -p $PGPORT1 &>>$LOGFILE
cat >> $PGDATA2/postgresql.conf <<EOF
archive_command='cp %p $ARCHDIR2/%f'
primary_conninfo='host=/tmp port=$PGPORT1'
EOF
touch $PGDATA2/standby.signal

trap cleanup ERR 2 3 15

# Start primary
echo "# Starting standby"
pg_ctl -D $PGDATA2 -o"-p $PGPORT2" start &>>$LOGFILE
sleep 3

echo "Start:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# Delocate from boundary..
psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE
sleep 1

# TEST 1: walreceiver stops just after a segment is completed
echo "Before test 1:"
echo -n "Primary location: " 
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"
echo "# test 1" >> $LOGFILE
touch /tmp/hoge1
# the number of records should be adjusted so that the LSN at "before
# end test 1" below is locaed in the last page of the segment 3.
psql -p $PGPORT1 -c "create table t(a int); insert into t (select a from generate_series(0, 259500) a); drop table t;"
&>>$LOGFILE
echo "# test 1 end" >> $LOGFILE
psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE

echo "before end test 1:"
echo -n "Primary location: " 
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

psql -p $PGPORT1 -c "select pg_switch_wal()" &>>$LOGFILE
sleep 2

echo "After test 1:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
sleep 2

# TEST 2: walreceiver stops while receiving filling chunks after a wal switch.
echo "# test 2" >> $LOGFILE
touch /tmp/hoge2
psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE
echo "# test 2 end" >> $LOGFILE

sleep 2

echo "After test 2:"
echo -n "Primary location: "
psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()"
echo -n "Standby location: "
psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()"

# stop servers
pg_ctl -D $PGDATA1 stop &>>$LOGFILE
pg_ctl -D $PGDATA2 stop &>>$LOGFILE

#show last several archived segments
echo "# primary archive"
ls $ARCHDIR1 | egrep '[3-9]$'
echo "# standby archive"
ls $ARCHDIR2 | egrep '[3-9]$'

pgsql-hackers by date:

Previous
From: "kuroda.hayato@fujitsu.com"
Date:
Subject: RE: Allow escape in application_name (was: [postgres_fdw] add local pid to fallback_application_name)
Next
From: Andres Freund
Date:
Subject: Re: archive status ".ready" files may be created too early