Thread: Possible missing segments in archiving on standby
Hello. While looking a patch, I found that a standby with archive_mode=always fails to archive segments under certain conditions. A. Walreceiver is gracefully terminated just after a segment is finished. B. Walreceiver is gracefully terminated while receiving filling chunks for a segment switch. The two above are reprodusible (without distinction between the two) using a helper patch. See below. There's one more issue here. C. Standby doesn't archive a segment until walreceiver receives any data for the next segment. I'm not sure wehther we assume C as an issue. The first attached patch fixes A and B. A side-effect of that is that standby archives the previous segment of the streaming start location. Concretely 00..0100..2 gets to be archived in the above case (recovery starts at 0/3000000). That behavior doesn't seem to be a proble since the segment is a part of the standby's data anyway. The second attached patch fixes all of A to C, but seems somewhat redundant. Any opnions and/or suggestions are welcome. The attached files are: 1. v1-0001-Make-sure-standby-archives-all-segments.patch: Fix for A and B. 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch: Fix for A, B and C. 3. repro.sh The reproducer shell script used below. 4. repro_helper.patch Helper patch for repro.sh for master and patch 1 above. 5. repro_helper2.patch Helper patch for repro.sh for patch 2 above. ===== ** REPRODUCER The failure is reproducible with some code tweak. 1. Create a primary server with archive_mode=always then start it. 2. Create and start a standby. 3. touch /tmp/hoge 4. psql -c "create table t(); drop table t; select pg_switch_wal(); select pg_sleep(1); create table t(); drop table t; selectpg_switch_wal();" 5. look into the archive directory of the standby. If no missing segments found in archive, repeat from 3. The third attached shell script is a reproducer for the problem, needing the aid of the fourth patch attached. $ mkdir testdir $ cd testdir $ bash ..../repro.sh .... After test 2: Primary location: 0/8000310 Standby location: 0/8000310 # primary archive 000000010000000000000003 000000010000000000000004 000000010000000000000005 000000010000000000000006 000000010000000000000007 000000010000000000000008 # standby archive 000000010000000000000003 000000010000000000000005 000000010000000000000006 000000010000000000000008 The segment 4 is skipped by the issue A and 7 is skipped by the issue B. regards. -- Kyotaro Horiguchi NTT Open Source Software Center From afa907bca7db8ea6335d47bd02761f567591d553 Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Tue, 30 Jun 2020 14:21:30 +0900 Subject: [PATCH v1] Make sure standby archives all segments Standby fails to archive a segment if standby is stopped just after a segment is finished or stopped just after a segment swtich. Make sure that walreceiver archives all segments by rechecking at start. --- src/backend/replication/walreceiver.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index d1ad75da87..680154365d 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -938,6 +938,23 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) else XLogArchiveNotify(xlogfname); } + else if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS) + { + /* + * If we are starting streaming at the beginning of a segment, + * there may be casees where the previous segment have not been + * archived yet. Make sure it is archived. + */ + char xlogfname[MAXFNAMELEN]; + XLogSegNo prevseg; + + XLByteToPrevSeg(recptr, prevseg, wal_segment_size); + XLogFileName(xlogfname, ThisTimeLineID, prevseg, + wal_segment_size); + + /* Mark as ".ready" of not yet */ + XLogArchiveCheckDone(xlogfname); + } recvFile = -1; /* Create/use new log file */ -- 2.18.4 From 7af716134dceb3bafce421dfeaffebf1e1e3e17d Mon Sep 17 00:00:00 2001 From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com> Date: Mon, 29 Jun 2020 16:12:01 +0900 Subject: [PATCH v1] Make sure standby archives all segments immediately Standby may get a bit being late in archive, since walsender doesn't archive a segment until it receives any data for the next segment, Fix that by archiving just after a segment is finished. Also, standby fails to archive a segment if standby is stopped just after a segment is finished or stopped just after a segment swtich. Make sure that walreceiver archives all segments by rechecking at start. --- src/backend/replication/walreceiver.c | 82 ++++++++++++++++++--------- 1 file changed, 54 insertions(+), 28 deletions(-) diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index d1ad75da87..831718c859 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -902,49 +902,34 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) { int segbytes; - if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + /* Open the segment if not yet */ + if (recvFile < 0) { bool use_existent; + recvFileTLI = ThisTimeLineID; + /* - * fsync() and close current file before we switch to next one. We - * would otherwise have to reopen this file to fsync it later + * If we are starting streaming at the beginning of a segment, + * there may be the case where the previous segment have not been + * archived yet. Make sure it is archived. */ - if (recvFile >= 0) + if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0) { char xlogfname[MAXFNAMELEN]; + XLogSegNo prevseg; - XLogWalRcvFlush(false); + XLByteToPrevSeg(recptr, prevseg, wal_segment_size); + XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size); - XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); - - /* - * XLOG segment files will be re-read by recovery in startup - * process soon, so we don't advise the OS to release cache - * pages associated with the file like XLogFileClose() does. - */ - if (close(recvFile) != 0) - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not close log segment %s: %m", - xlogfname))); - - /* - * Create .done file forcibly to prevent the streamed segment - * from being archived later. - */ - if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) - XLogArchiveForceDone(xlogfname); - else - XLogArchiveNotify(xlogfname); + /* Mark as ".ready" of not yet */ + XLogArchiveCheckDone(xlogfname); } - recvFile = -1; /* Create/use new log file */ XLByteToSeg(recptr, recvSegNo, wal_segment_size); use_existent = true; recvFile = XLogFileInit(recvSegNo, &use_existent, true); - recvFileTLI = ThisTimeLineID; } /* Calculate the start offset of the received logs */ @@ -985,6 +970,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) buf += byteswritten; LogstreamResult.Write = recptr; + + /* + * Close the current WAL segment if it is completed then let the file + * be archived if needed. + */ + if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + { + char xlogfname[MAXFNAMELEN]; + + Assert (recvFile >= 0); + + /* + * fsync() and close current file before we switch to next one. We + * would otherwise have to reopen this file to fsync it later + */ + XLogWalRcvFlush(false); + + XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); + + /* + * XLOG segment files will be re-read by recovery in startup + * process soon, so we don't advise the OS to release cache + * pages associated with the file like XLogFileClose() does. + */ + if (close(recvFile) != 0) + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not close log segment %s: %m", + xlogfname))); + + /* + * Create .done file forcibly to prevent the streamed segment + * from being archived later. + */ + if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) + XLogArchiveForceDone(xlogfname); + else + XLogArchiveNotify(xlogfname); + + recvFile = -1; + } } /* Update shared-memory status */ -- 2.18.4 #! /bin/bash ROOT=`pwd` LOGFILE="repro.log" PGPORT1=15432 PGPORT2=15433 PGDATA1=$ROOT/reprodata1 ARCHDIR1=$ROOT/reproarc1 PGDATA2=$ROOT/reprodata2 ARCHDIR2=$ROOT/reproarc2 function cleanup { echo -n "Killing servers..." pg_ctl -D $PGDATA1 -m i stop pg_ctl -D $PGDATA2 -m i stop echo "done." exit 1 } rm -r $PGDATA1 $PGDATA2 $ARCHDIR1 $ARCHDIR2 mkdir $ARCHDIR1 $ARCHDIR2 # Create primary echo "# Creating primary" initdb -D $PGDATA1 &>$LOGFILE cat >> $PGDATA1/postgresql.conf <<EOF wal_keep_segments=10 archive_mode=always archive_command='cp %p $ARCHDIR1/%f' EOF # Start primary echo "# Starting primary" pg_ctl -D $PGDATA1 -o"-p $PGPORT1" start &>>$LOGFILE # Create standby echo "# Creating standby" pg_basebackup -D $PGDATA2 -h /tmp -p $PGPORT1 &>>$LOGFILE cat >> $PGDATA2/postgresql.conf <<EOF archive_command='cp %p $ARCHDIR2/%f' primary_conninfo='host=/tmp port=$PGPORT1' EOF touch $PGDATA2/standby.signal trap cleanup ERR 2 3 15 # Start primary echo "# Starting standby" pg_ctl -D $PGDATA2 -o"-p $PGPORT2" start &>>$LOGFILE sleep 3 echo "Start:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" # Delocate from boundary.. psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE sleep 1 # TEST 1: walreceiver stops just after a segment is completed echo "# test 1" >> $LOGFILE touch /tmp/hoge1 psql -p $PGPORT1 -c "create table t(a int); insert into t (select a from generate_series(0, 260000) a); drop table t;" &>>$LOGFILE echo "# test 1 end" >> $LOGFILE psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE sleep 2 echo "After test 1:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE sleep 2 # TEST 2: walreceiver stops while receiving filling chunks after a wal switch. echo "# test 2" >> $LOGFILE touch /tmp/hoge2 psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE echo "# test 2 end" >> $LOGFILE sleep 2 echo "After test 2:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" # stop servers pg_ctl -D $PGDATA1 stop &>>$LOGFILE pg_ctl -D $PGDATA2 stop &>>$LOGFILE #show last three archived segments echo "# primary archive" ls $ARCHDIR1 | egrep '[3-9]$' echo "# standby archive" ls $ARCHDIR2 | egrep '[3-9]$' diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index d1ad75da87..dcccef151d 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -985,6 +985,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) buf += byteswritten; LogstreamResult.Write = recptr; + + { + /* fake oneshot SIGTERM just at segment end */ + struct stat b; + char *sigfile1 = "/tmp/hoge1"; + char *sigfile2 = "/tmp/hoge2"; + + if (LogstreamResult.Write % wal_segment_size == 0 && + stat(sigfile1, &b) == 0) + { + unlink(sigfile1); + got_SIGTERM = true; + ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write))); + } + + if (LogstreamResult.Write % wal_segment_size == 0x500000 && + stat(sigfile2, &b) == 0) + { + unlink(sigfile2); + got_SIGTERM = true; + ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write))); + } + } } /* Update shared-memory status */ diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 831718c859..b9a7c73ed7 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -1011,6 +1011,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) recvFile = -1; } + + { + /* fake oneshot SIGTERM just at segment end */ + struct stat b; + char *sigfile1 = "/tmp/hoge1"; + char *sigfile2 = "/tmp/hoge2"; + + if (LogstreamResult.Write % wal_segment_size == 0 && + stat(sigfile1, &b) == 0) + { + unlink(sigfile1); + got_SIGTERM = true; + ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write))); + } + + if (LogstreamResult.Write % wal_segment_size == 0x500000 && + stat(sigfile2, &b) == 0) + { + unlink(sigfile2); + got_SIGTERM = true; + ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write))); + } + } } /* Update shared-memory status */
On 2020/06/30 16:55, Kyotaro Horiguchi wrote: > Hello. > > While looking a patch, I found that a standby with archive_mode=always > fails to archive segments under certain conditions. I encountered this issue, too. > 1. v1-0001-Make-sure-standby-archives-all-segments.patch: > Fix for A and B. > > 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch: > Fix for A, B and C. You proposed two patches, but this patch should be reviewed preferentially because this addresses all the issues (i.e., A, B and C) that you reported? + * If we are starting streaming at the beginning of a segment, + * there may be the case where the previous segment have not been + * archived yet. Make sure it is archived. Could you clarify why the archive notification file of the previous WAL segment needs to be checked? As far as I read the code, the cause of the issue seems to be that XLogWalRcvWrite() exits without creating an archive notification file even if the current WAL segment is fully written up in the last cycle of XLogWalRcvWrite()'s loop. That is, creation of the notification file and WAL archiving of that completed segment will be delayed until any data in the next segment is received and written (by next call to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits before receiving such next segment, the completed current segment fails to be archived as Horiguchi-san reported. Therefore, IMO that the simple approach to fix the issue is to create an archive notification file if possible at the end of XLogWalRcvWrite(). I implemented this idea. Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Tue, 31 Aug 2021 01:54:36 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2020/06/30 16:55, Kyotaro Horiguchi wrote: > > Hello. > > While looking a patch, I found that a standby with archive_mode=always > > fails to archive segments under certain conditions. > > I encountered this issue, too. > > > > 1. v1-0001-Make-sure-standby-archives-all-segments.patch: > > Fix for A and B. > > 2. v1-0001-Make-sure-standby-archives-all-segments-immediate.patch: > > Fix for A, B and C. > > You proposed two patches, but this patch should be reviewed > preferentially > because this addresses all the issues (i.e., A, B and C) that you > reported? Maybe. The point here was whether we regard C as an issue, but now I think it is an issue. > + * If we are starting streaming at the beginning of a segment, > + * there may be the case where the previous segment have not been > + * archived yet. Make sure it is archived. > > Could you clarify why the archive notification file of the previous > WAL segment needs to be checked? > > As far as I read the code, the cause of the issue seems to be that > XLogWalRcvWrite() exits without creating an archive notification file > even if the current WAL segment is fully written up in the last cycle > of > XLogWalRcvWrite()'s loop. That is, creation of the notification file > and WAL archiving of that completed segment will be delayed > until any data in the next segment is received and written (by next > call > to XLogWalRcvWrite()). Furthermore, in that case, if walreceiver exits > before receiving such next segment, the completed current segment > fails to be archived as Horiguchi-san reported. Right. Finally such segments are archived when a future checkpoint removes them. In that sense the patch works to just let archiving happens faster, but on the other hand I came to think we are supposed to archive a segment as soon as it is completed. (That is, I think C is a problem.) > Therefore, IMO that the simple approach to fix the issue is to create > an archive notification file if possible at the end of > XLogWalRcvWrite(). > I implemented this idea. Patch attached. I'm not sure which is simpler, but it works except for B, the case of a long-jump by a segment switch. When a segment switch happens, walsender sends filling zero-pages but even if walreceiver is terminated before the segment is completed, walsender restarts from the next segment at the next startup. Concretely like the following. - pg_switch_wal() invoked at 6003228 (for example) - walreceiver terminates at 6500000 (or a bit later). - walrecever rstarts from 7000000 In this case the segment 6 is not notified even with the patch, and my old patches works the same way. (In other words, the call to XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the case as you might expect.) If we think it ok that we don't notify the segment earlier than a future checkpoint removes it, yours or only the last half of my one is sufficient, but do we really think so? Furthermore, your patch or only the last half of my second patch doesn't save the case of a crash unlike the case of a graceful termination. The attached files are: v2wip-0001-Make-sure... : a rebased patch of the old second patch repro_helper.diff : reproducer helper patch, used by the script below. repro.sh : reproducer script. (The second diff conflicts with the first patch. Since the second just inserts a single code block, it is easily applicable manually:p) regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 60de3be92c..a44d84f0a7 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -883,46 +883,32 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) { int segbytes; - if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + /* Open the segment if not yet */ + if (recvFile < 0) { + recvFileTLI = ThisTimeLineID; + /* - * fsync() and close current file before we switch to next one. We - * would otherwise have to reopen this file to fsync it later + * If we are starting streaming at the beginning of a segment, + * there may be the case where the previous segment have not been + * archived yet. Make sure it is archived. */ - if (recvFile >= 0) + if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS && recvSegNo == 0) { char xlogfname[MAXFNAMELEN]; + XLogSegNo prevseg; XLogWalRcvFlush(false); + XLByteToPrevSeg(recptr, prevseg, wal_segment_size); + XLogFileName(xlogfname, recvFileTLI, prevseg, wal_segment_size); - XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); - - /* - * XLOG segment files will be re-read by recovery in startup - * process soon, so we don't advise the OS to release cache - * pages associated with the file like XLogFileClose() does. - */ - if (close(recvFile) != 0) - ereport(PANIC, - (errcode_for_file_access(), - errmsg("could not close log segment %s: %m", - xlogfname))); - - /* - * Create .done file forcibly to prevent the streamed segment - * from being archived later. - */ - if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) - XLogArchiveForceDone(xlogfname); - else - XLogArchiveNotify(xlogfname, true); + /* Mark as ".ready" if not yet */ + XLogArchiveCheckDone(xlogfname); } - recvFile = -1; /* Create/use new log file */ XLByteToSeg(recptr, recvSegNo, wal_segment_size); recvFile = XLogFileInit(recvSegNo); - recvFileTLI = ThisTimeLineID; } /* Calculate the start offset of the received logs */ @@ -963,6 +949,47 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) buf += byteswritten; LogstreamResult.Write = recptr; + + /* + * Close the current WAL segment if it is completed then let the file + * be archived if needed. + */ + if (!XLByteInSeg(recptr, recvSegNo, wal_segment_size)) + { + char xlogfname[MAXFNAMELEN]; + + Assert (recvFile >= 0); + + /* + * fsync() and close current file before we switch to next one. We + * would otherwise have to reopen this file to fsync it later + */ + XLogWalRcvFlush(false); + + XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size); + + /* + * XLOG segment files will be re-read by recovery in startup + * process soon, so we don't advise the OS to release cache + * pages associated with the file like XLogFileClose() does. + */ + if (close(recvFile) != 0) + ereport(PANIC, + (errcode_for_file_access(), + errmsg("could not close log segment %s: %m", + xlogfname))); + + /* + * Create .done file forcibly to prevent the streamed segment + * from being archived later. + */ + if (XLogArchiveMode != ARCHIVE_MODE_ALWAYS) + XLogArchiveForceDone(xlogfname); + else + XLogArchiveNotify(xlogfname, true); + + recvFile = -1; + } } /* Update shared-memory status */ diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 60de3be92c..396510a81d 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -963,6 +963,29 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr) buf += byteswritten; LogstreamResult.Write = recptr; + + { + /* fake oneshot SIGTERM just at segment end */ + struct stat b; + char *sigfile1 = "/tmp/hoge1"; + char *sigfile2 = "/tmp/hoge2"; + + if (LogstreamResult.Write % wal_segment_size == 0 && + stat(sigfile1, &b) == 0) + { + unlink(sigfile1); + ShutdownRequestPending = true; + ereport(LOG,(errmsg("STOP BY trig1@%lX", LogstreamResult.Write))); + } + + if (LogstreamResult.Write % wal_segment_size == 0x500000 && + stat(sigfile2, &b) == 0) + { + unlink(sigfile2); + ShutdownRequestPending = true; + ereport(LOG,(errmsg("STOP BY trig2@%lX", LogstreamResult.Write))); + } + } } /* Update shared-memory status */ #! /bin/bash ROOT=`pwd` LOGFILE="repro.log" PGPORT1=15432 PGPORT2=15433 PGDATA1=$ROOT/reprodata1 ARCHDIR1=$ROOT/reproarc1 PGDATA2=$ROOT/reprodata2 ARCHDIR2=$ROOT/reproarc2 function cleanup { echo -n "Killing servers..." pg_ctl -D $PGDATA1 -m i stop pg_ctl -D $PGDATA2 -m i stop echo "done." exit 1 } rm -r $PGDATA1 $PGDATA2 $ARCHDIR1 $ARCHDIR2 mkdir $ARCHDIR1 $ARCHDIR2 # Create primary echo "# Creating primary" initdb -D $PGDATA1 &>$LOGFILE cat >> $PGDATA1/postgresql.conf <<EOF wal_keep_size=160 archive_mode=always archive_command='cp %p $ARCHDIR1/%f' EOF # Start primary echo "# Starting primary" pg_ctl -D $PGDATA1 -o"-p $PGPORT1" start &>>$LOGFILE # Create standby echo "# Creating standby" pg_basebackup -D $PGDATA2 -h /tmp -p $PGPORT1 &>>$LOGFILE cat >> $PGDATA2/postgresql.conf <<EOF archive_command='cp %p $ARCHDIR2/%f' primary_conninfo='host=/tmp port=$PGPORT1' EOF touch $PGDATA2/standby.signal trap cleanup ERR 2 3 15 # Start primary echo "# Starting standby" pg_ctl -D $PGDATA2 -o"-p $PGPORT2" start &>>$LOGFILE sleep 3 echo "Start:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" # Delocate from boundary.. psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE sleep 1 # TEST 1: walreceiver stops just after a segment is completed echo "Before test 1:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" echo "# test 1" >> $LOGFILE touch /tmp/hoge1 # the number of records should be adjusted so that the LSN at "before # end test 1" below is locaed in the last page of the segment 3. psql -p $PGPORT1 -c "create table t(a int); insert into t (select a from generate_series(0, 259500) a); drop table t;" &>>$LOGFILE echo "# test 1 end" >> $LOGFILE psql -p $PGPORT1 -c "create table t(); drop table t" &>>$LOGFILE echo "before end test 1:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" psql -p $PGPORT1 -c "select pg_switch_wal()" &>>$LOGFILE sleep 2 echo "After test 1:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE sleep 2 # TEST 2: walreceiver stops while receiving filling chunks after a wal switch. echo "# test 2" >> $LOGFILE touch /tmp/hoge2 psql -p $PGPORT1 -c "create table t(); drop table t; select pg_switch_wal()" &>>$LOGFILE echo "# test 2 end" >> $LOGFILE sleep 2 echo "After test 2:" echo -n "Primary location: " psql -tAp $PGPORT1 -c "select pg_current_wal_lsn()" echo -n "Standby location: " psql -tAp $PGPORT2 -c "select pg_last_wal_receive_lsn()" # stop servers pg_ctl -D $PGDATA1 stop &>>$LOGFILE pg_ctl -D $PGDATA2 stop &>>$LOGFILE #show last several archived segments echo "# primary archive" ls $ARCHDIR1 | egrep '[3-9]$' echo "# standby archive" ls $ARCHDIR2 | egrep '[3-9]$'
On 2021/08/31 16:35, Kyotaro Horiguchi wrote: > I'm not sure which is simpler, but it works except for B, the case of > a long-jump by a segment switch. When a segment switch happens, > walsender sends filling zero-pages but even if walreceiver is > terminated before the segment is completed, walsender restarts from > the next segment at the next startup. Concretely like the following. > > - pg_switch_wal() invoked at 6003228 (for example) > - walreceiver terminates at 6500000 (or a bit later). > - walrecever rstarts from 7000000 > > In this case the segment 6 is not notified even with the patch, and my > old patches works the same way. (In other words, the call to > XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the > case as you might expect.) If we think it ok that we don't notify the > segment earlier than a future checkpoint removes it, yours or only the > last half of my one is sufficient, but do we really think so? > Furthermore, your patch or only the last half of my second patch > doesn't save the case of a crash unlike the case of a graceful > termination. Thanks for the clarification! Please let me check my understanding about the issue. The issue happens when walreceiver exits after it receives XLOG_SWITCH record but before receives the remaining bytes of the segment including that XLOG_SWITCH record. In this case, the startup process tries to replay that "half-received" segment, finds XLOG_SWITCH record in it, moves to the next segment and then starts new walreceiver from that next segment. Therefore, even with my patch, the segment including that XLOG_SWITCH record is not archived soon. Is my understanding right? I agree that we should address also this issue. ISTM, to address the issue, it's simpler and less fragile to make the startup process call XLogArchiveCheckDone() or something whenever it moves the next segment, rather than make walreceiver do that. Thought? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Tue, 31 Aug 2021 23:23:27 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/08/31 16:35, Kyotaro Horiguchi wrote: > > I'm not sure which is simpler, but it works except for B, the case of > > a long-jump by a segment switch. When a segment switch happens, > > walsender sends filling zero-pages but even if walreceiver is > > terminated before the segment is completed, walsender restarts from > > the next segment at the next startup. Concretely like the following. > > - pg_switch_wal() invoked at 6003228 (for example) > > - walreceiver terminates at 6500000 (or a bit later). > > - walrecever rstarts from 7000000 > > In this case the segment 6 is not notified even with the patch, and my > > old patches works the same way. (In other words, the call to > > XLogWalRcvClose() at the end of XLogWalRcvWrite doens't work for the > > case as you might expect.) If we think it ok that we don't notify the > > segment earlier than a future checkpoint removes it, yours or only the > > last half of my one is sufficient, but do we really think so? > > Furthermore, your patch or only the last half of my second patch > > doesn't save the case of a crash unlike the case of a graceful > > termination. > > Thanks for the clarification! > Please let me check my understanding about the issue. > > The issue happens when walreceiver exits after it receives XLOG_SWITCH > record > but before receives the remaining bytes of the segment including that > XLOG_SWITCH record. In this case, the startup process tries to replay > that > "half-received" segment, finds XLOG_SWITCH record in it, moves to the > next > segment and then starts new walreceiver from that next > segment. Therefore, > even with my patch, the segment including that XLOG_SWITCH record is > not > archived soon. Is my understanding right? I agree that we should > address also > this issue. Right. > ISTM, to address the issue, it's simpler and less fragile to make the > startup > process call XLogArchiveCheckDone() or something whenever it moves > the next segment, rather than make walreceiver do that. Thought? Putting aside the issue C, it would work as far as recovery is not paused or delayed. Although simply doing that means we run additional and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to imagine moving the responsibility to notify a finished segment from walsender (writer side) to startup (reader side). In the first place A and B happens only at termination or crash of walsender so there's no fragility in checking only the previous segment at start of walsender. After a bit thought I noticed that we don't need to do that in the wal-writing loop. And I noticed that we need to consider timeline transitions while calculating the previous segment. Even though missing-notification at a timeline-switch doesn't happen unless walsender is killed hard for example by a sigkill or a power cut, though. So the attached is a new version of the patch to fix only A and B. - Moved the check code out of the replication loop. - Track timeline transition while calculating the previous segment. If we don't do that, we would need another means to avoid notifying non-existent segment instead of the correct one. regards. -- Kyotaro Horiguchi NTT Open Source Software Center diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 60de3be92c..81dde27372 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -173,6 +173,7 @@ WalReceiverMain(void) XLogRecPtr startpoint; TimeLineID startpointTLI; TimeLineID primaryTLI; + XLogSegNo startsegno; bool first_stream; WalRcvData *walrcv = WalRcv; TimestampTz last_recv_timestamp; @@ -313,6 +314,32 @@ WalReceiverMain(void) if (sender_host) pfree(sender_host); + /* + * There's a case walreceiver terminated before notifying the last + * finished segment. Make sure the last finished segment is archived + * immediately. + */ + XLByteToSeg(startpoint, startsegno, wal_segment_size); + if (startsegno > 1) + { + char xlogfname[MAXFNAMELEN]; + TimeLineID prevsegTLI; + XLogRecPtr prevsegEndRecPtr; + List *tles; + + /* + * The previous segment may be in the previous timeline. Track + * timelines to find the segment on the correct timeline. + */ + tles = readTimeLineHistory(startpointTLI); + prevsegEndRecPtr = + startpoint - XLogSegmentOffset(startpoint, wal_segment_size) - 1; + prevsegTLI = tliOfPointInHistory(prevsegEndRecPtr, tles); + XLogFileName(xlogfname, prevsegTLI, startsegno - 1, + wal_segment_size); + XLogArchiveCheckDone(xlogfname); + } + first_stream = true; for (;;) {
On 2021/09/01 12:12, Kyotaro Horiguchi wrote: > Putting aside the issue C, it would work as far as recovery is not > paused or delayed. Although simply doing that means we run additional > and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to > imagine moving the responsibility to notify a finished segment from > walsender (writer side) to startup (reader side). You mean walreceiver, not walsender? I was thinking to apply my latest patch, to address the issue A and C. So walreceiver is still basically responsible to create .ready file. Also regarding the issue B, I was thinking to make the startup process call XLogArchiveCheckDone() or something only when it finds XLOG_SWITCH record. Thought? > In the first place A and B happens only at termination or crash of > walsender so there's no fragility in checking only the previous > segment at start of walsender. After a bit thought I noticed that we > don't need to do that in the wal-writing loop. And I noticed that we > need to consider timeline transitions while calculating the previous > segment. Even though missing-notification at a timeline-switch > doesn't happen unless walsender is killed hard for example by a > sigkill or a power cut, though. What happens if the server is promoted before that walreceiver is invoked? Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
At Wed, 1 Sep 2021 14:37:43 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/01 12:12, Kyotaro Horiguchi wrote: > > Putting aside the issue C, it would work as far as recovery is not > > paused or delayed. Although simply doing that means we run additional > > and a bit) wasteful XLogArchiveCheckDone() in most cases, It's hard to > > imagine moving the responsibility to notify a finished segment from > > walsender (writer side) to startup (reader side). > > You mean walreceiver, not walsender? Sorry, it's walreceiver. > I was thinking to apply my latest patch, to address the issue A and C. > So walreceiver is still basically responsible to create .ready file. Considering the following discussion, I don't object to the patch. > Also regarding the issue B, I was thinking to make the startup process > call XLogArchiveCheckDone() or something only when it finds > XLOG_SWITCH record. Thought? Sounds workable. I came to agree to the reader-side amendment as below. But I might prefer to do that at every segment-switch in case of a crash. > What happens if the server is promoted before that walreceiver is > invoked? Hmmmmm. A partial segment is not created if a server promotes just at a segment boundary, then the previous segment won't get archived until the next checkpoint runs. Ok, I agree that the reader-side needs an amendment. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/02 10:16, Kyotaro Horiguchi wrote: > Ok, I agree that the reader-side needs an amendment. Thanks for the review! Attached is the updated version of the patch. Based on my latest patch, I changed the startup process so that it creates an archive notification file of the streamed WAL segment including XLOG_SWITCH record if the notification file has not been created yet. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Fri, 3 Sep 2021 02:06:45 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/02 10:16, Kyotaro Horiguchi wrote: > > Ok, I agree that the reader-side needs an amendment. > > Thanks for the review! Attached is the updated version of the patch. > Based on my latest patch, I changed the startup process so that > it creates an archive notification file of the streamed WAL segment > including XLOG_SWITCH record if the notification file has not been > created yet. + if (readSource == XLOG_FROM_STREAM && + record->xl_rmid == RM_XLOG_ID && + (record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH) readSource is the source at the time startup reads it and it could be different from the source at the time the record was written. We cannot know where the record came from there, so does the readSource condition work as expected? If we had some trouble streaming just before, readSource at the time is likely to be XLOG_FROM_PG_WAL. + if (XLogArchivingAlways()) + XLogArchiveNotify(xlogfilename, true); + else + XLogArchiveForceDone(xlogfilename); The path is used both for crash and archive recovery. If we pass there while crash recovery on a primary with archive_mode=on, the file could be marked .done before actually archived. On the other hand when archive_mode=always, the file could be re-marked .ready even after it has been already archived. Why isn't it XLogArchiveCheckDone? regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/03 14:56, Kyotaro Horiguchi wrote: > + if (readSource == XLOG_FROM_STREAM && > + record->xl_rmid == RM_XLOG_ID && > + (record->xl_info & ~XLR_INFO_MASK) == XLOG_SWITCH) > > readSource is the source at the time startup reads it and it could be > different from the source at the time the record was written. We > cannot know where the record came from there, so does the readSource > condition work as expected? If we had some trouble streaming just > before, readSource at the time is likely to be XLOG_FROM_PG_WAL. Yes. > + if (XLogArchivingAlways()) > + XLogArchiveNotify(xlogfilename, true); > + else > + XLogArchiveForceDone(xlogfilename); > > The path is used both for crash and archive recovery. If we pass there > while crash recovery on a primary with archive_mode=on, the file could > be marked .done before actually archived. On the other hand when > archive_mode=always, the file could be re-marked .ready even after it > has been already archived. Why isn't it XLogArchiveCheckDone? Yeah, you're right. ISTM what we should do is to just call XLogArchiveCheckDone() for the segment including XLOG_SWITCH record, i.e., to create .ready file if the segment has no archive notification file yet and archive_mode is set to 'always'. Even if we don't do this when we reach XLOG_SWITCH record, subsequent restartpoints eventually will call XLogArchiveCheckDone() for such segments. One issue of this approach is that the WAL segment including XLOG_SWITCH record may be archived before its previous segments are. That is, the notification file of current segment is created when it's replayed because it includes XLOG_SWIATCH, but the notification files of its previous segments will be created by subsequent restartpoints because they don't have XLOG_SWITCH. Probably we should avoid this? If yes, one approach for this issue is to call XLogArchiveCheckDone() for not only the segment including XLOG_SWITCH but also all the segments older than that. Thought? Anyway, I extracted the changes in walreceiver from the patch, because it's self-contained and can be applied separately. Patch attached. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Tue, 7 Sep 2021 17:03:06 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > + if (XLogArchivingAlways()) > > + XLogArchiveNotify(xlogfilename, true); > > + else > > + XLogArchiveForceDone(xlogfilename); > > The path is used both for crash and archive recovery. If we pass there > > while crash recovery on a primary with archive_mode=on, the file could > > be marked .done before actually archived. On the other hand when > > archive_mode=always, the file could be re-marked .ready even after it > > has been already archived. Why isn't it XLogArchiveCheckDone? > > Yeah, you're right. ISTM what we should do is to just call > XLogArchiveCheckDone() for the segment including XLOG_SWITCH record, > i.e., to create .ready file if the segment has no archive notification > file yet > and archive_mode is set to 'always'. Even if we don't do this when we > reach > XLOG_SWITCH record, subsequent restartpoints eventually will call > XLogArchiveCheckDone() for such segments. > > One issue of this approach is that the WAL segment including > XLOG_SWITCH > record may be archived before its previous segments are. That is, > the notification file of current segment is created when it's replayed > because it includes XLOG_SWIATCH, but the notification files of > its previous segments will be created by subsequent restartpoints > because they don't have XLOG_SWITCH. Probably we should avoid this? Anyway there's no guarantee on the archive ordering. As discussed in the nearby thread [1], newer segment is often archived earlier. I agree that that happens mainly on busy servers, though. The archiver is designed to handle such "gaps" and/or out-of-order segment notifications. We could impose a strict ordering on archiving but I think we would take total performance than such strictness. In short, no. > If yes, one approach for this issue is to call XLogArchiveCheckDone() > for > not only the segment including XLOG_SWITCH but also all the segments > older than that. Thought? At least currently, recovery fetches segments by a single process and every file is marked immediately after being filled-up, so all files other than the latest one in pg_wal including history files should have been marked for sure unless file system gets into a trouble. So I think we don't need to do that even if we want the strictness. Addition to that that takes too long time when many segments reside in pg_wal so I think we never want to run such a scan at every segment end that recovery passes. If I remember correctly, the reason we don't fix archive status at start up but at checkpoint is we avoided extra startup time. > Anyway, I extracted the changes in walreceiver from the patch, > because it's self-contained and can be applied separately. > Patch attached. I'm not sure I like that XLogWalRcvClose hides the segment-switch condition. If we do that check in the function, I'd like to name the function XLogWalRcvCloseIfSwitched or something indicates the condition. I'd like to invert the condition to reduce indentation, too. Why don't we call it just after writing data, as my first proposal did? There's no difference in functionality between doing that and the patch. If we do so, recvFile>=0 is always true and that condition can be removed but that would be optional. Anyway, by doing that, no longer need to call the function twice or we can even live without the new function. [1] https://www.postgresql.org/message-id/20210504042755.ehuaoz5blcnjw5yk%40alap3.anarazel.de regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/08 10:45, Kyotaro Horiguchi wrote: > Anyway there's no guarantee on the archive ordering. As discussed in > the nearby thread [1], newer segment is often archived earlier. I > agree that that happens mainly on busy servers, though. The archiver > is designed to handle such "gaps" and/or out-of-order segment > notifications. We could impose a strict ordering on archiving but I > think we would take total performance than such strictness. Yes, there are other cases causing newer WAL file to be archived eariler. The issue can happen if XLogArchiveNotify() fails to create .ready file, for example. Fixing only the case that we're discussing here is not enough. If *general* fix is discussed at the thread you told, it's better to do nothing here for the issue and to just make the startup process call XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH record. > At least currently, recovery fetches segments by a single process and > every file is marked immediately after being filled-up, so all files > other than the latest one in pg_wal including history files should > have been marked for sure unless file system gets into a trouble. You can reproduce that situation easily by starting the server with archive_mode=off, generating WAL files, sometimes running pg_switch_wal(), causing the server to crash, and then restarting the server with archive_mode=on. At the beginning of recovery, all the WAL files in pg_wal don't have their archive notification files at all. Then, with the patch, only WAL files including XLOG_SWITCH are notified for WAL archiving during recovery. The other WAL files will be notified at the subsequent checkpoint. > I'm not sure I like that XLogWalRcvClose hides the segment-switch > condition. If we do that check in the function, I'd like to name the > function XLogWalRcvCloseIfSwitched or something indicates the > condition. I'd like to invert the condition to reduce indentation, > too. We can move the condition-check out of the function like the attached patch. > Why don't we call it just after writing data, as my first proposal > did? There's no difference in functionality between doing that and the > patch. If we do so, recvFile>=0 is always true and that condition can > be removed but that would be optional. Anyway, by doing that, no > longer need to call the function twice or we can even live without the > new function. I think that it's better and more robust to confirm that the currently-opened WAL file is valid target one to write WAL *before* actually writing any data into it. Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
At Wed, 8 Sep 2021 16:01:22 +0900, Fujii Masao <masao.fujii@oss.nttdata.com> wrote in > > > On 2021/09/08 10:45, Kyotaro Horiguchi wrote: > > Anyway there's no guarantee on the archive ordering. As discussed in > > the nearby thread [1], newer segment is often archived earlier. I > > agree that that happens mainly on busy servers, though. The archiver > > is designed to handle such "gaps" and/or out-of-order segment > > notifications. We could impose a strict ordering on archiving but I > > think we would take total performance than such strictness. > > Yes, there are other cases causing newer WAL file to be archived > eariler. > The issue can happen if XLogArchiveNotify() fails to create .ready > file, > for example. Fixing only the case that we're discussing here is not > enough. > If *general* fix is discussed at the thread you told, it's better to > do nothing here for the issue and to just make the startup process > call > XLogArchiveCheckDone() if it finds the WAL file including XLOG_SWITCH > record. No. The discussion taken there is not about permanently missing .ready files, but about .ready files created out-of-order. So I don't think the outcome from the thread does *fix* this issue. > > At least currently, recovery fetches segments by a single process and > > every file is marked immediately after being filled-up, so all files > > other than the latest one in pg_wal including history files should > > have been marked for sure unless file system gets into a trouble. > > You can reproduce that situation easily by starting the server with > archive_mode=off, generating WAL files, sometimes running > pg_switch_wal(), > causing the server to crash, and then restarting the server with > archive_mode=on. At the beginning of recovery, all the WAL files in > pg_wal > don't have their archive notification files at all. Then, with the > patch, > only WAL files including XLOG_SWITCH are notified for WAL archiving > during recovery. The other WAL files will be notified at the > subsequent > checkpoint. I don't think we want such extent of perfectness at all for the case where some archive-related parameters are changed after a crash. Anyway we need to take a backup after that and at least all segments required for the backup will be properly archived. The segments up to the XLOG_SWITCH segment are harmless garbage, or a bit of food for disk. > > I'm not sure I like that XLogWalRcvClose hides the segment-switch > > condition. If we do that check in the function, I'd like to name the > > function XLogWalRcvCloseIfSwitched or something indicates the > > condition. I'd like to invert the condition to reduce indentation, > > too. > > We can move the condition-check out of the function like the attached > patch. Thanks! > > Why don't we call it just after writing data, as my first proposal > > did? There's no difference in functionality between doing that and the > > patch. If we do so, recvFile>=0 is always true and that condition can > > be removed but that would be optional. Anyway, by doing that, no > > longer need to call the function twice or we can even live without the > > new function. > > I think that it's better and more robust to confirm that the > currently-opened > WAL file is valid target one to write WAL *before* actually writing > any data > into it. Sounds convincing. Ok, I agree to that. regards. -- Kyotaro Horiguchi NTT Open Source Software Center
On 2021/09/08 16:40, Kyotaro Horiguchi wrote: > No. The discussion taken there is not about permanently missing .ready > files, but about .ready files created out-of-order. So I don't think > the outcome from the thread does *fix* this issue. Hmm... > I don't think we want such extent of perfectness at all for the case > where some archive-related parameters are changed after a > crash. Anyway we need to take a backup after that and at least all > segments required for the backup will be properly archived. The > segments up to the XLOG_SWITCH segment are harmless garbage, or a bit > of food for disk. So probably we reached the consensus that something like the attached patch (XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner case where walreceiver fails to create .ready file of WAL file including XLOG_SWITCH? > Sounds convincing. Ok, I agree to that. So barring any objection, I will commit the patch and back-patch it to all supported version. walreceiver_notify_archive_soon_v5.patch walreceiver_notify_archive_soon_v5_pg14-13.patch walreceiver_notify_archive_soon_v5_pg12-11.patch walreceiver_notify_archive_soon_v5_pg10.patch walreceiver_notify_archive_soon_v5_pg96.patch Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION
Attachment
On 2021/09/08 22:41, Fujii Masao wrote: > So probably we reached the consensus that something like the attached patch > (XLogArchiveCheckDone_walfile_xlog_switch.patch) is enough for the corner > case where walreceiver fails to create .ready file of WAL file including > XLOG_SWITCH? I attached the patch again, just in the case. >> Sounds convincing. Ok, I agree to that. > > So barring any objection, I will commit the patch > and back-patch it to all supported version. Pushed. Thanks! Regards, -- Fujii Masao Advanced Computing Technology Center Research and Development Headquarters NTT DATA CORPORATION