Thread: Wall shiping replica failed to recover database with error: invalidcontrecord length 1956 at FED/38FFE208

Hello,

we run out of disk space on our production primary database on file system dedicated for WALs (one of our logical replica died and thus WALs were accumulated).
As expeced, primary instance shuts down:


Primary instance:
PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit

2019-09-29 19:24:28 UTC 2241 5d36488f.8c1 1    0    [53100]:PANIC:  could not write to file "pg_wal/xlogtemp.2241": No space left on device
2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3387    0    [00000]:LOG:  WAL writer process (PID 2241) was terminated by signal 6: Aborted
2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3388    0    [00000]:LOG:  terminating any other active server processes

2019-09-29 19:24:30 UTC 1706 5d364871.6aa 3390    0    [00000]:LOG:  all server processes terminated; reinitializing
2019-09-29 19:24:30 UTC 21499 5d9104ee.53fb 1    0    [00000]:LOG:  database system was interrupted; last known up at 2019-09-29 19:23:47 UTC

2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 2    0    [00000]:LOG:  recovered replication state of node 1 to CF3/442D0758
2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 3    0    [00000]:LOG:  database system was not properly shut down; automatic recovery in progress
2019-09-29 19:24:33 UTC 21499 5d9104ee.53fb 4    0    [00000]:LOG:  redo starts at FE9/A49830D8

2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 5    0    [00000]:LOG:  redo done at FED/38FFC540
2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 6    0    [00000]:LOG:  last completed transaction was at log time 2019-09-29 19:22:06.597333+00
2019-09-29 19:25:55 UTC 21499 5d9104ee.53fb 7    0    [00000]:LOG:  checkpoint starting: end-of-recovery immediate

2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 8    0    [00000]:LOG:  checkpoint complete: wrote 1046940 buffers (99.8%); 0 WAL file(s) added, 0 removed, 0 recycled; write=64.166 s, sync=0.008 s, total=64.212 s; sync files=184, longest=0.001 s, average=0.000 s; distance=15014380 kB, estimate=15014380 kB
2019-09-29 19:26:59 UTC 21499 5d9104ee.53fb 9    0    [53100]:FATAL:  could not write to file "pg_wal/xlogtemp.21499": No space left on device
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3391    0    [00000]:LOG:  startup process (PID 21499) exited with exit code 1
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3392    0    [00000]:LOG:  aborting startup due to startup process failure
2019-09-29 19:26:59 UTC 1706 5d364871.6aa 3393    0    [00000]:LOG:  database system is shut down

Expceted bahavior, free space on file system for WAL was 14MB, so not enough for next WAL file to be saved.

Once falied logical replica was back up & running, we have extened filesystem for WALs and started primary instance:

2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 1    0    [00000]:LOG:  Auto detecting pg_stat_kcache.linux_hz parameter...
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 2    0    [00000]:LOG:  pg_stat_kcache.linux_hz is set to 1000000
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 3    0    [00000]:LOG:  listening on IPv4 address "0.0.0.0", port 5432
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 4    0    [00000]:LOG:  listening on IPv6 address "::", port 5432
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 5    0    [00000]:LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 1    0    [00000]:LOG:  database system was shut down at 2019-09-29 19:26:59 UTC
2019-09-30 08:21:44 UTC 13637 5d91bb18.3545 2    0    [00000]:LOG:  recovered replication state of node 1 to CF3/442D0758
2019-09-30 08:21:44 UTC 13646 5d91bb18.354e 1    0    [00000]:LOG:  pg_partman master background worker master process initialized with role postgres
2019-09-30 08:21:44 UTC 13635 5d91bb18.3543 6    0    [00000]:LOG:  database system is ready to accept connections
2019-09-30 08:21:44 UTC 13645 5d91bb18.354d 1    0 2/0   [00000]:LOG:  pg_cron scheduler started
2019-09-30 08:21:44 UTC 13647 5d91bb18.354f 1    0 3/0   [00000]:LOG:  POWA connected to database powa

So primary instance stated, all logical replications streams connected and as soon as logical replication for failed subscriber process WALs, unnecessery WALs were deleted as expected. Until this point, all works like a charm  from primary instance point of view.


Form this primary instance we are running wall shipping replica (pgBackRest is used) and this is where issues started.

WAL recovery replica:
PostgreSQL 11.3 (Ubuntu 11.3-1.pgdg18.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0, 64-bit

WAL files were restored from backup till last one backed up before primary instance shut down.

2019-09-29 19:24:31.390 P00   INFO: found 0000000100000FED00000037 in the archive
2019-09-29 19:24:31.391 P00   INFO: archive-get command end: completed successfully (43ms)
2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181012    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000037" from archive
2019-09-29 19:24:31.919 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:31.958 P00   INFO: found 0000000100000FED00000038 in the archive
2019-09-29 19:24:31.959 P00   INFO: archive-get command end: completed successfully (41ms)
2019-09-29 19:24:31 UTC 3062 5d76145f.bf6 181013    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000038" from archive
2019-09-29 19:24:32.505 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:32.505 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-29 19:24:32.505 P00   INFO: archive-get command end: completed successfully (2ms)
2019-09-29 19:24:34.247 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-29 19:24:34.448 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-29 19:24:34.448 P00   INFO: archive-get command end: completed successfully (204ms)

So the replica instance was paintfully waiting and waiting and waiting for next WAL: 0000000100000FED00000039

And yes, after the long outage pgBackRest found and restored WAL 0000000100000FED00000039:

2019-09-30 08:22:03.673 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/
pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<red
acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:03.874 P00   INFO: unable to find 0000000100000FED00000039 in the archive
2019-09-30 08:22:03.875 P00   INFO: archive-get command end: completed successfully (203ms)
2019-09-30 08:22:08.678 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/
pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<red
acted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:09.121 P00   INFO: found 0000000100000FED00000039 in the archive
2019-09-30 08:22:09.123 P00   INFO: archive-get command end: completed successfully (446ms)
2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181014    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000039" from archive
2019-09-30 08:22:09 UTC 3062 5d76145f.bf6 181015    0 1/0   [00000]:LOG:  invalid contrecord length 1956 at FED/38FFE208
2019-09-30 08:22:13.683 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000038, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:14.132 P00   INFO: found 0000000100000FED00000038 in the archive
2019-09-30 08:22:14.135 P00   INFO: archive-get command end: completed successfully (453ms)
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181016    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000038" from archive
2019-09-30 08:22:14.195 P00   INFO: archive-get command begin 2.16: [0000000100000FED00000039, pg_wal/RECOVERYXLOG] --archive-async --buffer-size=16777216 --compress-level=3 --log-level-console=info --pg1-path=/pgsql/anl_master_rep/pgcluster/11/pgdata --process-max=2 --repo1-cipher-type=none --repo1-path=/pgbackrest --repo1-s3-bucket=finmason-db-backups --repo1-s3-endpoint=s3.us-east-1.amazonaws.com --repo1-s3-key=<redacted> --repo1-s3-key-secret=<redacted> --repo1-s3-region=us-east-1 --repo1-type=s3 --spool-path=/var/spool/pgbackrest --stanza=anl_master_prod
2019-09-30 08:22:14.236 P00   INFO: found 0000000100000FED00000039 in the archive
2019-09-30 08:22:14.238 P00   INFO: archive-get command end: completed successfully (44ms)
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181017    0 1/0   [00000]:LOG:  restored log file "0000000100000FED00000039" from archive
2019-09-30 08:22:14 UTC 3062 5d76145f.bf6 181018    0 1/0   [00000]:LOG:  invalid contrecord length 1956 at FED/38FFE208

But recovery on replica failed to proceed WAL file  0000000100000FED00000039  with log message: " invalid contrecord length 1956 at FED/38FFE208".

Since the WAL 0000000100000FED00000039  still exists on primary instance, I've copied it to WAL replica, but it was identical to the file restored by pgBackRest, so I assume from this, that WAL file backup/restore was not root cause fro this issue.

Recovery conf on replica instance is simple:

standby_mode = 'on'
restore_command = 'pgbackrest --log-level-console=info --process-max=6 --stanza=anl_master_prod archive-get %f "%p"'

During planned properly proceeded shutdown on primary instance, recovery on WAL replica resumes smoothly.

And for sure, after restoring database from backup taken past the primary instance issue, WAL replica works perfectly, as usual.

Except the obvious need to re-configure monitoring to warn on WAL free space sooner, to avoid such case, my expectation was, taht WAL replica will be able to resume recovery it was not.

We have stored copy of 0000000100000FED0000003[89] WAL file for analyses, is there something what can help us to find the issu root cause and mitigate this issue?

I've found BUG #15412,seems to be same issue, but remains open.

Thanks for any hints how to recover from this issue or analyze it deeper to find root cause.

Kind regards Ales Zeleny
Greetings,

* Aleš Zelený (zeleny.ales@gmail.com) wrote:
> But recovery on replica failed to proceed WAL file
> 0000000100000FED00000039  with log message: " invalid contrecord length
> 1956 at FED/38FFE208".

Err- you've drawn the wrong conclusion from that message (and you're
certainly not alone- it's a terrible message and we should really have a
HINT there or something).  That's an INFO-level message, not an error,
and basically just means "oh, look, there's an invalid WAL record, guess
we got to the end of the WAL available from this source."  If you had
had primary_conninfo configured in your recovery.conf, PG would likely
have connected to the primary and started replication.  One other point
is that if you actually did a promotion in this process somewhere, then
you might want to set recovery_target_timeline=latest, to make sure the
replica follows along on the timeline switch that happens when a
promotion happens.

Thanks,

Stephen

Attachment
Stephen Frost <sfrost@snowman.net> writes:
> * Aleš Zelený (zeleny.ales@gmail.com) wrote:
>> But recovery on replica failed to proceed WAL file
>> 0000000100000FED00000039  with log message: " invalid contrecord length
>> 1956 at FED/38FFE208".

> Err- you've drawn the wrong conclusion from that message (and you're
> certainly not alone- it's a terrible message and we should really have a
> HINT there or something).

Yeah, those messages are all pretty ancient, from when WAL was new and not
to be trusted much.  Perhaps the thing to do is move the existing info
into DETAIL and make the primary message be something like "reached
apparent end of WAL stream".

            regards, tom lane



Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > * Aleš Zelený (zeleny.ales@gmail.com) wrote:
> >> But recovery on replica failed to proceed WAL file
> >> 0000000100000FED00000039  with log message: " invalid contrecord length
> >> 1956 at FED/38FFE208".
>
> > Err- you've drawn the wrong conclusion from that message (and you're
> > certainly not alone- it's a terrible message and we should really have a
> > HINT there or something).
>
> Yeah, those messages are all pretty ancient, from when WAL was new and not
> to be trusted much.  Perhaps the thing to do is move the existing info
> into DETAIL and make the primary message be something like "reached
> apparent end of WAL stream".

Yes, +1 on that.

Thanks,

Stephen

Attachment
Hello,

čt 3. 10. 2019 v 0:09 odesílatel Stephen Frost <sfrost@snowman.net> napsal:
Greetings,

* Aleš Zelený (zeleny.ales@gmail.com) wrote:
> But recovery on replica failed to proceed WAL file
> 0000000100000FED00000039  with log message: " invalid contrecord length
> 1956 at FED/38FFE208".

Err- you've drawn the wrong conclusion from that message (and you're
certainly not alone- it's a terrible message and we should really have a
HINT there or something).  That's an INFO-level message, not an error,
and basically just means "oh, look, there's an invalid WAL record, guess
we got to the end of the WAL available from this source."  If you had
had primary_conninfo configured in your recovery.conf, PG would likely
have connected to the primary and started replication.  One other point
is that if you actually did a promotion in this process somewhere, then
you might want to set recovery_target_timeline=latest, to make sure the
replica follows along on the timeline switch that happens when a
promotion happens.

Thanks (for all comments form others as well) for this explanation. I've failed to describe properly our case. We are recovering  the replica instance from WALs only, since the replica is in separate network (used as source for zfs clonning for development), so there is no primary_conninfo (replica can't influence primary instance any way, it juts consumes primary instance WALs) and we did not performed replica promotion.

I'd guess, thath on out of disk space issue, last WAL might be incomplete, but the size was expected 16777216 Bytes on primary instance disk and it was binary identical to file restores on replica from backup. The issue wsas, that replica emit this INFO message, but it was not able to move to next wal file and started falling behind primary instance.

If the WAL was incomplete during out of space it probably might be appeneded during instance start ( but I'll doubt incomplete archive_command to be invoced on incomplete WAL), that is why I have checked the file on primary (after it was back up&running) with restored one on replica instance.

In orther words, if this log message will be emmited only once and recovery continue retoring subsequent WALs, I'll be OK with that, but due to recovery stucked at this WAL I'm in doubts whether I did something wrong (e.g. improper recovery.conf ...) or what is possible workaround to enable replica (if possible) proceed this wal and continue with recovery. The database size is almost 2 TB, so that is why I'd like to avoid full restores to create DEV environments and using ZFS clones instead.

Thanks for any hints how to let replica continue applying WAL files.

Kind regards Ales Zeleny
Hello.

At Wed, 2 Oct 2019 19:24:02 -0400, Stephen Frost <sfrost@snowman.net> wrote in
> Greetings,
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> > Stephen Frost <sfrost@snowman.net> writes:
> > > * Aleš Zelený (zeleny.ales@gmail.com) wrote:
> > >> But recovery on replica failed to proceed WAL file
> > >> 0000000100000FED00000039  with log message: " invalid contrecord length
> > >> 1956 at FED/38FFE208".
> >
> > > Err- you've drawn the wrong conclusion from that message (and you're
> > > certainly not alone- it's a terrible message and we should really have a
> > > HINT there or something).
> >
> > Yeah, those messages are all pretty ancient, from when WAL was new and not
> > to be trusted much.  Perhaps the thing to do is move the existing info
> > into DETAIL and make the primary message be something like "reached
> > apparent end of WAL stream".
>
> Yes, +1 on that.

The attached is something like that. AFAICS we can assume that an
invalid record means the end of WAL as long as we are fetching
successive records and XLogPageRead() has not logged anything.

As the current comment in ReadRecord says, we don't have a message if
standby has been triggered, but that's not always the case. Still we
may have a message if new segment is available after triggering.

I used a kind-of-tricky way to handle optional errdetail but it may be
better to have some additional feature as an ereport
subfunction. (Maybe named "errif" or something?)

I think it works fine as expected but I find one arguable behavior.

In the first patch, the "end of WAL" message is accompanied by source,
LSN and timeline.

LOG:  reached end of WAL in archive at 0/A3000060 on timeline 8
DETAIL:  invalid record length at 0/A3000060: wanted 24, got 0

The last two just seems good but the first one shows "archive" even on
promotion. It is right that we scan archive after promotion but seems
confusing. So I attached another patch that shows operation instead of
WAL source.

LOG:  reached end of WAL during streaming reaplication at 0/A5000060 on timeline 8
DETAIL:  invalid record length at 0/A5000060: wanted 24, got 0

What do you think about this?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2bd3d0e5e5..70fd34659a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -852,6 +852,9 @@ static bool bgwriterLaunched = false;
 static int    MyLockNo = 0;
 static bool holdingAllLocks = false;
 
+/* Have we complaind about the record at the location? */
+static XLogRecPtr lastComplaint = 0;
+
 #ifdef WAL_DEBUG
 static MemoryContext walDebugCxt = NULL;
 #endif
@@ -889,6 +892,7 @@ static int    XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                         bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int    emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool have_complained_at(XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
@@ -4264,8 +4268,11 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
         record = XLogReadRecord(xlogreader, RecPtr, &errormsg);
         ReadRecPtr = xlogreader->ReadRecPtr;
         EndRecPtr = xlogreader->EndRecPtr;
+
         if (record == NULL)
         {
+            XLogRecPtr reportptr = RecPtr ? RecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4273,13 +4280,24 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something.
+             * When we end up here while reading successive records, we reached
+             * the end of WAL unless XLogPageRead() has logged something. We
+             * may or may not have a message here depending on the situation.
+             * Otherwise we just report the received message if any and if
+             * needed.
+             *
+             * Note: errormsg is alreay translated.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode,
-                                                 RecPtr ? RecPtr : EndRecPtr),
-                        (errmsg_internal("%s", errormsg) /* already translated */ ));
+            if (RecPtr == InvalidXLogRecPtr && !have_complained_at(EndRecPtr))
+                ereport(emode,
+                        (errmsg("reached end of WAL in %s at %X/%X on timeline %u",
+                                xlogSourceNames[currentSource],
+                                (uint32) (reportptr >> 32), (uint32) reportptr,
+                                ThisTimeLineID),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            else if (errormsg)
+                ereport(emode_for_corrupt_record(emode, reportptr),
+                        (errmsg_internal("%s", errormsg)));
         }
 
         /*
@@ -4331,7 +4349,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
                 !fetching_ckpt)
             {
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("entering archive recovery")));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -12160,8 +12178,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
-    static XLogRecPtr lastComplaint = 0;
-
     if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
     {
         if (RecPtr == lastComplaint)
@@ -12172,6 +12188,13 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
     return emode;
 }
 
+/* Have we complained about the record at the location? */
+static bool
+have_complained_at(XLogRecPtr RecPtr)
+{
+    return lastComplaint == RecPtr;
+}
+
 /*
  * Check to see whether the user-specified trigger file exists and whether a
  * promote request has arrived.  If either condition holds, return true.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2bd3d0e5e5..5f1f7e5d7e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -852,6 +852,9 @@ static bool bgwriterLaunched = false;
 static int    MyLockNo = 0;
 static bool holdingAllLocks = false;
 
+/* Have we complaind about the record at the location? */
+static XLogRecPtr lastComplaint = 0;
+
 #ifdef WAL_DEBUG
 static MemoryContext walDebugCxt = NULL;
 #endif
@@ -889,6 +892,7 @@ static int    XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
                                         bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int    emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
+static bool have_complained_at(XLogRecPtr RecPtr);
 static void XLogFileClose(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
@@ -4266,6 +4270,8 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            XLogRecPtr reportptr = RecPtr ? RecPtr : EndRecPtr;
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4273,13 +4279,33 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something.
+             * When we end up here while reading successive records, we reached
+             * the end of WAL unless XLogPageRead() has logged something. We
+             * may or may not have a message here depending on the situation.
+             * Otherwise we just report the received message if any and if
+             * needed.
+             *
+             * Note: errormsg is alreay translated.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode,
-                                                 RecPtr ? RecPtr : EndRecPtr),
-                        (errmsg_internal("%s", errormsg) /* already translated */ ));
+            if (RecPtr == InvalidXLogRecPtr && !have_complained_at(EndRecPtr))
+            {
+                char *operation = "crash recovery";
+
+                if (StandbyMode)
+                    operation = "streaming reaplication";
+                else if (InArchiveRecovery)
+                    operation = "archive recovery";
+
+                ereport(emode,
+                        (errmsg("reached end of WAL during %s at %X/%X on timeline %u",
+                                operation,
+                                (uint32) (reportptr >> 32), (uint32) reportptr,
+                                ThisTimeLineID),
+                         (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
+            else if (errormsg)
+                ereport(emode_for_corrupt_record(emode, reportptr),
+                        (errmsg_internal("%s", errormsg)));
         }
 
         /*
@@ -4331,7 +4357,7 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
                 !fetching_ckpt)
             {
                 ereport(DEBUG1,
-                        (errmsg_internal("reached end of WAL in pg_wal, entering archive recovery")));
+                        (errmsg_internal("entering archive recovery")));
                 InArchiveRecovery = true;
                 if (StandbyModeRequested)
                     StandbyMode = true;
@@ -12160,8 +12186,6 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 static int
 emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
 {
-    static XLogRecPtr lastComplaint = 0;
-
     if (readSource == XLOG_FROM_PG_WAL && emode == LOG)
     {
         if (RecPtr == lastComplaint)
@@ -12172,6 +12196,13 @@ emode_for_corrupt_record(int emode, XLogRecPtr RecPtr)
     return emode;
 }
 
+/* Have we complained about the record at the location? */
+static bool
+have_complained_at(XLogRecPtr RecPtr)
+{
+    return lastComplaint == RecPtr;
+}
+
 /*
  * Check to see whether the user-specified trigger file exists and whether a
  * promote request has arrived.  If either condition holds, return true.

Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> At Wed, 2 Oct 2019 19:24:02 -0400, Stephen Frost <sfrost@snowman.net> wrote in
>> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>>> Yeah, those messages are all pretty ancient, from when WAL was new and not
>>> to be trusted much.  Perhaps the thing to do is move the existing info
>>> into DETAIL and make the primary message be something like "reached
>>> apparent end of WAL stream".

>> Yes, +1 on that.

> What do you think about this?

It seems overcomplicated.  Why do you need to reach into
emode_for_corrupt_record's private state?  I think we could just
change the message text without that, and leave the emode
machinations as-is.

I don't understand the restriction to "if (RecPtr == InvalidXLogRecPtr)"
either?  Maybe that's fine but the comment fails to explain it.

Another point is that, as the comment for emode_for_corrupt_record
notes, we don't really want to consider that we've hit end-of-WAL
when reading any source except XLOG_FROM_PG_WAL.  So I think the
change of message ought to include a test of the source, but this
doesn't.

Also, the business with an "operation" string violates the message
translatability guideline about "don't assemble a message out of
phrases".  If we want to have that extra detail, it's better just
to make three separate ereport() calls with separately translatable
messages.

Also, it seems wrong that the page TLI check, just below, is where
it is and isn't part of the main set of page header sanity checks.
That's sort of unrelated to this patch, except not really, because
shouldn't a failure of that test also be treated as an "end of WAL"
condition?

            regards, tom lane



Thank you, and sorry for overlooking your comment.

At Thu, 14 Nov 2019 12:28:13 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in 
> Kyotaro Horiguchi <horikyota.ntt@gmail.com> writes:
> > At Wed, 2 Oct 2019 19:24:02 -0400, Stephen Frost <sfrost@snowman.net> wrote in 
> >> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
> >>> Yeah, those messages are all pretty ancient, from when WAL was new and not
> >>> to be trusted much.  Perhaps the thing to do is move the existing info
> >>> into DETAIL and make the primary message be something like "reached
> >>> apparent end of WAL stream".
> 
> >> Yes, +1 on that.
> 
> > What do you think about this?
> 
> It seems overcomplicated.  Why do you need to reach into
> emode_for_corrupt_record's private state? 
> emode_for_corrupt_record's private state?  I think we could just
> change the message text without that, and leave the emode
> machinations as-is.

I tried to avoid messages at the same LSN. But it is done by
emode_for_corrupt_record by reducing error level. I reverted that
part.

> I don't understand the restriction to "if (RecPtr == InvalidXLogRecPtr)"
> either?  Maybe that's fine but the comment fails to explain it.

"When we end up here while reading successive recored" meant that, but
it is not explicit. If RecPtr was not invalid, it means that the
caller is requesting for a specific record that should exist.  It is
not end-of-wal at all. I rewrote the comment.

> Another point is that, as the comment for emode_for_corrupt_record
> notes, we don't really want to consider that we've hit end-of-WAL
> when reading any source except XLOG_FROM_PG_WAL.  So I think the
> change of message ought to include a test of the source, but this
> doesn't.

Maybe no. The function just mutes messages for repeated error on
XLOG_FROM_WAL. We consider end-of-WAL while XLOG_FROM_ARCHIVE.  In
this version the "reached end of WAL" is not emitted when
emode_for_corrupt_record decided not to show the message.

> Also, the business with an "operation" string violates the message
> translatability guideline about "don't assemble a message out of
> phrases".  If we want to have that extra detail, it's better just
> to make three separate ereport() calls with separately translatable
> messages.

Though I thought that the operation is just a noun and it is safely
processed separately, I followed your comment. And this version takes
the more verbose one of the previous two.

> Also, it seems wrong that the page TLI check, just below, is where
> it is and isn't part of the main set of page header sanity checks.
> That's sort of unrelated to this patch, except not really, because
> shouldn't a failure of that test also be treated as an "end of WAL"
> condition?

It seems checking if xlogreader did something wrong, since it is
pluggable.  I'm not sure there is any concrete reason for that,
though.  As for recovery, it is actually redundant because
XLogFileReadAnyTLI already checked that for the record, but it doesn't
matter.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center
From 94ee9c982afc5b39bb062c59a95ac2323a4109c8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyota.ntt@gmail.com>
Date: Fri, 17 Jan 2020 17:15:58 +0900
Subject: [PATCH] Change end-of-WAL message less scary

At the end of WAL during recovery, users read just "invalid record"
message in logs, which is too scary. This patch change the message for
the case to "reached end of WAL" to let users know that they are not
in a trouble.
---
 src/backend/access/transam/xlog.c | 48 +++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7f4f784c0e..a7757ea4bf 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4279,6 +4279,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
         EndRecPtr = xlogreader->EndRecPtr;
         if (record == NULL)
         {
+            int actual_emode =
+                emode_for_corrupt_record(emode,
+                                         RecPtr ? RecPtr : EndRecPtr);
+
             if (readFile >= 0)
             {
                 close(readFile);
@@ -4286,15 +4290,43 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
             }
 
             /*
-             * We only end up here without a message when XLogPageRead()
-             * failed - in that case we already logged something. In
-             * StandbyMode that only happens if we have been triggered, so we
-             * shouldn't loop anymore in that case.
+             * Invalid RecPtr here means we are reading successive records
+             * during recovery. If we get here during recovery, we assume that
+             * we reached the end of WAL.  Otherwise something's really wrong
+             * and we report just only the errormsg if any. If we don't receive
+             * errormsg here, we already logged something.  We are going to
+             * emit duplicate message at the same LSN if
+             * emode_for_currupt_record decided to mute it.  We don't repeat
+             * "reached end of WAL" in the muted messages.
+             *
+             * Note: errormsg is alreay translated.
              */
-            if (errormsg)
-                ereport(emode_for_corrupt_record(emode,
-                                                 RecPtr ? RecPtr : EndRecPtr),
-                        (errmsg_internal("%s", errormsg) /* already translated */ ));
+            if (RecPtr == InvalidXLogRecPtr && actual_emode == emode)
+            {
+                if (StandbyMode)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during streaming replication",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else if (InArchiveRecovery)
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during archive recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+                else
+                    ereport(actual_emode,
+                            (errmsg ("rached end of WAL at %X/%X on timeline %u in %s during crash recovery",
+                                     (uint32) (EndRecPtr >> 32), (uint32) EndRecPtr,
+                                     ThisTimeLineID,
+                                     xlogSourceNames[currentSource]),
+                             (errormsg ? errdetail_internal("%s", errormsg) : 0)));
+            }
+            else if (errormsg)
+                ereport(actual_emode, (errmsg_internal("%s", errormsg)));
         }
 
         /*
-- 
2.23.0