Thread: BUG #15938: Corrupted WAL segment after crash recovery

BUG #15938: Corrupted WAL segment after crash recovery

From
PG Bug reporting form
Date:
01 August 2019, 13:52:52
The following bug has been logged on the website:

Bug reference:      15938
Logged by:          Alexander Kukushkin
Email address:      cyberdemn@gmail.com
PostgreSQL version: 10.9
Operating system:   Ubuntu 18.04.2 LTS
Description:

On one of our cluster one of the postgres backend processes was killed by
kernel oom.
In the postgres log it is visible as:
2019-08-01 12:40:58.550 UTC,,,56,,5d3096d1.38,26,,2019-07-18 15:57:05
UTC,,0,LOG,00000,"server process (PID 6637) was terminated by signal 9:
Killed","Failed process was running: select ""code"", ""purchase_order"",
""po_event_id"", ""po_delivery_event_id"", ""created"", ""last_modified""
from ""zpofrog_data"".""purchase_order"" where

(((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((((",,,,,,,,""

As expected postmaster started terminating other processes:
2019-08-01 12:40:58.550 UTC,,,56,,5d3096d1.38,27,,2019-07-18 15:57:05
UTC,,0,LOG,00000,"terminating any other active server
processes",,,,,,,,,""
2019-08-01 12:40:58.550
UTC,"read_write_user","purchase_orders_frontend_gateway",19603,"10.2.26.20:35944",5d42dbf7.4c93,3,"idle",2019-08-01
12:32:55 UTC,106/0,0,WARNING,57P02,"terminating connection because of crash
of another server process","The postmaster has commanded this server process
to roll back the current transaction and exit, because another server
process exited abnormally and possibly corrupted shared memory.","In a
moment you should be able to reconnect to the database and repeat your
command.",,,,,,,"PostgreSQL JDBC Driver"
... a bunch of similar lines ...
2019-08-01 12:40:59.165 UTC,,,5439,,5d30e37b.153f,1,,2019-07-18 21:24:11
UTC,,0,FATAL,XX000,"archive command was terminated by signal 3: Quit","The
failed archive command was: envdir ""/home/postgres/etc/wal-e.d/env"" wal-e
wal-push ""pg_wal/0000005000000B8D00000044""",,,,,,,,""
2019-08-01 12:40:59.179 UTC,,,56,,5d3096d1.38,28,,2019-07-18 15:57:05
UTC,,0,LOG,00000,"archiver process (PID 5439) exited with exit code
1",,,,,,,,,""

And crash recovery:
2019-08-01 12:40:59.179 UTC,,,56,,5d3096d1.38,29,,2019-07-18 15:57:05
UTC,,0,LOG,00000,"all server processes terminated;
reinitializing",,,,,,,,,""
2019-08-01 12:40:59.423
UTC,,,22698,"10.2.73.6:47474",5d42dddb.58aa,1,"",2019-08-01 12:40:59
UTC,,0,LOG,00000,"connection received: host=10.2.73.6
port=47474",,,,,,,,,""
2019-08-01 12:40:59.423 UTC,,,22697,,5d42dddb.58a9,1,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"database system was interrupted; last known up at
2019-08-01 12:37:59 UTC",,,,,,,,,""
2019-08-01 12:40:59.423
UTC,,,22699,"10.2.72.9:33244",5d42dddb.58ab,1,"",2019-08-01 12:40:59
UTC,,0,LOG,00000,"connection received: host=10.2.72.9
port=33244",,,,,,,,,""
2019-08-01 12:40:59.470
UTC,"standby","",22698,"10.2.73.6:47474",5d42dddb.58aa,2,"",2019-08-01
12:40:59 UTC,,0,FATAL,57P03,"the database system is in recovery
mode",,,,,,,,,""
2019-08-01 12:40:59.470
UTC,"standby","",22699,"10.2.72.9:33244",5d42dddb.58ab,2,"",2019-08-01
12:40:59 UTC,,0,FATAL,57P03,"the database system is in recovery
mode",,,,,,,,,""
2019-08-01 12:40:59.473
UTC,,,22700,"10.2.73.6:47478",5d42dddb.58ac,1,"",2019-08-01 12:40:59
UTC,,0,LOG,00000,"connection received: host=10.2.73.6
port=47478",,,,,,,,,""
2019-08-01 12:40:59.473
UTC,"standby","",22700,"10.2.73.6:47478",5d42dddb.58ac,2,"",2019-08-01
12:40:59 UTC,,0,FATAL,57P03,"the database system is in recovery
mode",,,,,,,,,""
2019-08-01 12:40:59.473
UTC,,,22701,"10.2.72.9:33246",5d42dddb.58ad,1,"",2019-08-01 12:40:59
UTC,,0,LOG,00000,"connection received: host=10.2.72.9
port=33246",,,,,,,,,""
2019-08-01 12:40:59.473
UTC,"standby","",22701,"10.2.72.9:33246",5d42dddb.58ad,2,"",2019-08-01
12:40:59 UTC,,0,FATAL,57P03,"the database system is in recovery
mode",,,,,,,,,""
2019-08-01 12:40:59.493 UTC,,,22697,,5d42dddb.58a9,2,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"database system was not properly shut down; automatic
recovery in progress",,,,,,,,,""
2019-08-01 12:40:59.497 UTC,,,22697,,5d42dddb.58a9,3,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"redo starts at B8D/193A7850",,,,,,,,,""
... a bunch of connection attempts ...
2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,4,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"redo done at B8D/44FFE570",,,,,,,,,""
2019-08-01 12:41:06.376 UTC,,,22697,,5d42dddb.58a9,5,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"last completed transaction was at log time 2019-08-01
12:40:54.782747+00",,,,,,,,,""
2019-08-01 12:41:06.381 UTC,,,22697,,5d42dddb.58a9,6,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"checkpoint starting: end-of-recovery
immediate",,,,,,,,,""
... a bunch of connection attempts ...
2019-08-01 12:41:15.780 UTC,,,22697,,5d42dddb.58a9,7,,2019-08-01 12:40:59
UTC,,0,LOG,00000,"checkpoint complete: wrote 94580 buffers (36.1%); 0 WAL
file(s) added, 6 removed, 20 recycled; write=9.349 s, sync=0.007 s,
total=9.400 s; sync files=45, longest=0.007 s, average=0.000 s;
distance=717151 kB, estimate=717151 kB",,,,,,,,,""
2019-08-01 12:41:15.824 UTC,,,56,,5d3096d1.38,30,,2019-07-18 15:57:05
UTC,,0,LOG,00000,"database system is ready to accept
connections",,,,,,,,,""

At 12:41:17 UTC (after recovery was done), the WAL segment
0000005000000B8D00000044 was archived.

After that both replicas started failing to replay WAL's (logs from one of
the replicas):
2019-08-01 12:37:58.436 UTC,,,285,,5d30e6c4.11d,15294,,2019-07-18 21:38:12
UTC,,0,LOG,00000,"recovery restart point at B8C/FF1FB760","last completed
transaction was at log time 2019-08-01 12:37:58.161596+00",,,,,,,,""
2019-08-01 12:38:01.971 UTC,,,285,,5d30e6c4.11d,15295,,2019-07-18 21:38:12
UTC,,0,LOG,00000,"restartpoint starting: xlog",,,,,,,,,""
2019-08-01 12:40:58.589 UTC,,,2369,,5d30e6ca.941,2,,2019-07-18 21:38:18
UTC,,0,FATAL,XX000,"could not receive data from WAL stream: SSL SYSCALL
error: EOF detected",,,,,,,,,""
2019-08-01 12:40:59.473 UTC,,,7967,,5d42dddb.1f1f,1,,2019-08-01 12:40:59
UTC,,0,FATAL,XX000,"could not connect to the primary server: FATAL:  the
database system is in recovery mode
FATAL:  the database system is in recovery mode",,,,,,,,,""
2019-08-01 12:41:04.311 UTC,,,8134,,5d42dde0.1fc6,1,,2019-08-01 12:41:04
UTC,,0,FATAL,XX000,"could not connect to the primary server: FATAL:  the
database system is in recovery mode
FATAL:  the database system is in recovery mode",,,,,,,,,""
2019-08-01 12:41:09.292 UTC,,,8301,,5d42dde5.206d,1,,2019-08-01 12:41:09
UTC,,0,FATAL,XX000,"could not connect to the primary server: FATAL:  the
database system is in recovery mode
FATAL:  the database system is in recovery mode",,,,,,,,,""
2019-08-01 12:41:14.296 UTC,,,8468,,5d42ddea.2114,1,,2019-08-01 12:41:14
UTC,,0,FATAL,XX000,"could not connect to the primary server: FATAL:  the
database system is in recovery mode
FATAL:  the database system is in recovery mode",,,,,,,,,""
2019-08-01 12:41:19.309 UTC,,,8637,,5d42ddef.21bd,1,,2019-08-01 12:41:19
UTC,,0,LOG,00000,"started streaming WAL from primary at B8D/45000000 on
timeline 80",,,,,,,,,""
2019-08-01 12:41:19.351 UTC,,,57,,5d30e6c3.39,37,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"invalid contrecord length 3641 at
B8D/44FFF6D0",,,,,,,,,""
2019-08-01 12:41:19.359 UTC,,,8637,,5d42ddef.21bd,2,,2019-08-01 12:41:19
UTC,,0,FATAL,57P01,"terminating walreceiver process due to administrator
command",,,,,,,,,""
2019-08-01 12:41:19.837 UTC,,,57,,5d30e6c3.39,38,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"restored log file ""0000005000000B8D00000044"" from
archive",,,,,,,,,""
2019-08-01 12:41:19.866 UTC,,,57,,5d30e6c3.39,39,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"invalid record length at B8D/44FFF740: wanted 24, got
0",,,,,,,,,""
2019-08-01 12:41:19.866 UTC,,,57,,5d30e6c3.39,40,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"invalid record length at B8D/44FFF740: wanted 24, got
0",,,,,,,,,""
2019-08-01 12:41:24.841 UTC,,,57,,5d30e6c3.39,41,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"restored log file ""0000005000000B8D00000044"" from
archive",,,,,,,,,""
2019-08-01 12:41:24.989 UTC,,,57,,5d30e6c3.39,42,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"invalid record length at B8D/44FFF740: wanted 24, got
0",,,,,,,,,""
2019-08-01 12:41:24.989 UTC,,,57,,5d30e6c3.39,43,,2019-07-18 21:38:11
UTC,1/0,0,LOG,00000,"invalid record length at B8D/44FFF740: wanted 24, got
0",,,,,,,,,""

and so on.

Unfortunately I can't compare files in the archive and on the primary,
because it was recycled despite usage of replication slots.
pg_replication_slots view reports restart_lsn as B8D/451EB540 (the next
segment).

But I can run pg_waldump, and the last record in this file is
CHECKPOINT_SHUTDOWN:
rmgr: XLOG        len (rec/tot):    106/   106, tx:          0, lsn:
B8D/44FFF6D0, prev B8D/44FFE570, desc: CHECKPOINT_SHUTDOWN redo
B8D/44FFF6D0; tli 80; prev tli 80; fpw true; xid 0:42293547; oid 10741249;
multi 1; offset 0; oldest xid 549 in DB 1; oldest multi 1 in DB 1;
oldest/newest commit timestamp xid: 0/0; oldest running xid 0; shutdown
pg_waldump: FATAL:  error in WAL record at B8D/44FFF6D0: invalid record
length at B8D/44FFF740: wanted 24, got 0