Hi,
I've run into a strange issue after restoring a backup that I created
using pg_basebackup on a standby instance. The command I use to create
the backup is this:
pg_basebackup -v --write-recovery-conf -h$BACKUP_HOST -p5432
-U$BACKUP_USER --format tar --wal-method stream --compress=2 -D
"$BACKUP_DIR"
This backup runs fine and produces a "base.tar.gz" and "pg_wal.tar.gz"
file. The server version is 11.7.
To restore the backup on another system I unpack the "base.tar.gz" file
into the data directory and "pg_wal.tar.gz" into the "pg_wal"
sub-directory. I then comment out the "primary_conninfo" directive in
the "recovery.conf" file since I don't want to replicate from the
primary on this system I just want to restore the state from the backup.
All of this works fine and the logs report that the db reaches a
consistent recovery state but as last entry it reports an "invalid
record length":
2020-11-13 12:25:34.266 UTC [1] LOG: listening on IPv4 address
"0.0.0.0", port 5432
2020-11-13 12:25:34.266 UTC [1] LOG: listening on IPv6 address "::",
port 5432
2020-11-13 12:25:34.269 UTC [1] LOG: listening on Unix socket
"/var/run/postgresql/.s.PGSQL.5432"
2020-11-13 12:25:34.437 UTC [20] LOG: database system was interrupted
while in recovery at log time 2020-11-03 14:36:00 UTC
2020-11-13 12:25:34.437 UTC [20] HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
2020-11-13 12:25:35.359 UTC [20] WARNING: recovery command file
"recovery.conf" specified neither primary_conninfo nor restore_command
2020-11-13 12:25:35.359 UTC [20] HINT: The database server will
regularly poll the pg_wal subdirectory to check for files placed there.
2020-11-13 12:25:35.359 UTC [20] LOG: entering standby mode
2020-11-13 12:25:35.372 UTC [20] LOG: redo starts at 932/9D123E70
2020-11-13 12:26:45.435 UTC [20] LOG: consistent recovery state reached
at 933/AA63F600
2020-11-13 12:26:45.435 UTC [1] LOG: database system is ready to accept
read only connections
2020-11-13 12:26:45.444 UTC [20] LOG: invalid record length at
933/AA6DA660: wanted 24, got 0
Looking at the WAL data the last file is "0000000100000933000000AA" and
there is a file "archive_status/0000000100000933000000A9.done" but no
corresponding ".done" file for the "...AA" WAL file.
Running pg_waldump on the "...AA" file shows the following at the end:
$ pg_waldump 0000000100000933000000AA
...
rmgr: Heap len (rec/tot): 54/ 54, tx: 1743220, lsn:
933/AA6DA510, prev 933/AA6D8F08, desc: LOCK off 12: xid 1743220: flags 0
LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/13117/874105 blk 0
rmgr: Heap len (rec/tot): 81/ 81, tx: 1743220, lsn:
933/AA6DA548, prev 933/AA6DA510, desc: INSERT off 19, blkref #0: rel
1663/13117/868296 blk 0
rmgr: Btree len (rec/tot): 80/ 80, tx: 1743220, lsn:
933/AA6DA5A0, prev 933/AA6DA548, desc: INSERT_LEAF off 24, blkref #0:
rel 1663/13117/868299 blk 1
rmgr: Btree len (rec/tot): 72/ 72, tx: 1743220, lsn:
933/AA6DA5F0, prev 933/AA6DA5A0, desc: INSERT_LEAF off 23, blkref #0:
rel 1663/13117/915254 blk 2
rmgr: Transaction len (rec/tot): 34/ 34, tx: 1743220, lsn:
933/AA6DA638, prev 933/AA6DA5F0, desc: COMMIT 2020-11-03 14:49:29.205562 UTC
pg_waldump: FATAL: error in WAL record at 933/AA6DA638: invalid record
length at 933/AA6DA660: wanted 24, got 0
It looks like the WAL in the backup is inconsistent at the end but I'd
expect pg_basebackup to create a clean backup. Is this something I
should expect or is this a problem with pg_basebackup?
Regards,
Dennis