BUG #15346: Replica fails to start after the crash - Mailing list pgsql-bugs
From | PG Bug reporting form |
---|---|
Subject | BUG #15346: Replica fails to start after the crash |
Date | |
Msg-id | 153492341830.1368.3936905691758473953@wrigleys.postgresql.org Whole thread Raw |
Responses |
Re: BUG #15346: Replica fails to start after the crash
|
List | pgsql-bugs |
The following bug has been logged on the website: Bug reference: 15346 Logged by: Alexander Kukushkin Email address: cyberdemn@gmail.com PostgreSQL version: 9.6.10 Operating system: Ubuntu Description: We run quite a lot of clusters and from time to time EC2 instances are dying, but so far we never had problems which I describe here. The first time we experienced it on the last week when doing upgrade to the next minor versions. A few replicas failed to start with very similar symptoms and I just reinitialized them because I was in a hurry. But, this morning one of EC2 instances running a replica has crashed. After the crash, the new instances was spawned and existing EBS volume (with existing PGDATA) attached to it. Postgres logs before the crash: 2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2636,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint complete: wrote 82957 buffers (17.0%); 0 transaction log file(s) added, 3 removed, 7 recycled; write=269.887 s, sync=0.012 s, total=269.926 s; sync files=105, longe st=0.002 s, average=0.000 s; distance=168429 kB, estimate=173142 kB",,,,,,,,,"" 2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2637,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"recovery restart point at AB3/37BD3C60","last completed transaction was at log time 2018-08-22 05:14:43.022752+00",,,,,,,,"" 2018-08-22 05:15:13.065 UTC,,,23838,,5b78e9a0.5d1e,2638,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,"" 2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2639,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint complete: wrote 86588 buffers (17.7%); 0 transaction log file(s) added, 2 removed, 8 recycled; write=269.983 s, sync=0.017 s, total=270.028 s; sync files=112, longe st=0.002 s, average=0.000 s; distance=156445 kB, estimate=171472 kB",,,,,,,,,"" 2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2640,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"recovery restart point at AB3/4149B340","last completed transaction was at log time 2018-08-22 05:19:43.077943+00",,,,,,,,"" 2018-08-22 05:20:13.121 UTC,,,23838,,5b78e9a0.5d1e,2641,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,"" Here it crashed. Now postgres fails to start with the following logs: 2018-08-22 06:22:12.994 UTC,,,51,,5b7d0114.33,1,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"" 2018-08-22 06:22:12.996 UTC,,,54,,5b7d0114.36,1,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"database system was interrupted while in recovery at log time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" 2018-08-22 06:22:13.232 UTC,,,54,,5b7d0114.36,2,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"entering standby mode",,,,,,,,,"" 2018-08-22 06:22:13.420 UTC,,,54,,5b7d0114.36,3,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"restored log file ""00000005.history"" from archive",,,,,,,,,"" 2018-08-22 06:22:13.430 UTC,,,147,"[local]",5b7d0115.93,1,"",2018-08-22 06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"" 2018-08-22 06:22:13.430 UTC,"postgres","postgres",147,"[local]",5b7d0115.93,2,"",2018-08-22 06:22:13 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,"" 2018-08-22 06:22:13.435 UTC,,,149,"[local]",5b7d0115.95,1,"",2018-08-22 06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"" 2018-08-22 06:22:13.435 UTC,"postgres","postgres",149,"[local]",5b7d0115.95,2,"",2018-08-22 06:22:13 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,"" 2018-08-22 06:22:13.912 UTC,,,54,,5b7d0114.36,4,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"restored log file ""0000000500000AB30000004A"" from archive",,,,,,,,,"" ... restores WAL files from S3 2018-08-22 06:22:23.489 UTC,,,54,,5b7d0114.36,22,,2018-08-22 06:22:12 UTC,1/0,0,LOG,00000,"restored log file ""0000000500000AB300000050"" from archive",,,,,,,,,"" 2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12 UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,3,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"startup process (PID 54) was terminated by signal 6: Aborted",,,,,,,,,"" 2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,4,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" 2018-08-22 06:22:23.847 UTC,"postgres","postgres",842,"[local]",5b7d011c.34a,3,"idle",2018-08-22 06:22:20 UTC,3/0,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,"Patroni" 2018-08-22 06:22:23.945 UTC,,,51,,5b7d0114.33,5,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"" 2018-08-22 06:22:35.294 UTC,,,1135,,5b7d012b.46f,1,,2018-08-22 06:22:35 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"" 2018-08-22 06:22:35.295 UTC,,,1138,,5b7d012b.472,1,,2018-08-22 06:22:35 UTC,,0,LOG,00000,"database system was interrupted while in recovery at log time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" Here are a few log lines from pg_xlogdump 0000000500000AB300000050 around LSN AB3/50323E78: rmgr: Heap len (rec/tot): 54/ 54, tx: 518902809, lsn: AB3/50323D30, prev AB3/50323CE8, desc: LOCK off 18: xid 518902809: flags 0 LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/18053/213284 blk 137366 rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/50323D68, prev AB3/50323D30, desc: INSERT_LEAF off 66, blkref #0: rel 1663/18055/212936 blk 32947 rmgr: Btree len (rec/tot): 96/ 96, tx: 518902810, lsn: AB3/50323DC0, prev AB3/50323D68, desc: INSERT_LEAF off 3, blkref #0: rel 1663/18055/212935 blk 6 rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/50323E20, prev AB3/50323DC0, desc: INSERT_LEAF off 12, blkref #0: rel 1663/18055/212930 blk 25152 rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn: AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0: rel 1663/18055/212907 blk 72478 FPW rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/503240B8, prev AB3/50323E78, desc: INSERT_LEAF off 2, blkref #0: rel 1663/18055/212907 blk 72478 rmgr: Gist len (rec/tot): 55/ 163, tx: 518902810, lsn: AB3/50324110, prev AB3/503240B8, desc: PAGE_UPDATE , blkref #0: rel 1663/18055/212906 blk 273288 FPW rmgr: Btree len (rec/tot): 160/ 160, tx: 518902810, lsn: AB3/503241B8, prev AB3/50324110, desc: INSERT_LEAF off 8, blkref #0: rel 1663/18055/212892 blk 85620 rmgr: Gist len (rec/tot): 57/ 201, tx: 518902812, lsn: AB3/50324258, prev AB3/503241B8, desc: PAGE_UPDATE , blkref #0: rel 1663/18053/213364 blk 99265 FPW rmgr: Btree len (rec/tot): 55/ 2403, tx: 518902810, lsn: AB3/50324328, prev AB3/50324258, desc: INSERT_LEAF off 13, blkref #0: rel 1663/18055/212893 blk 53544 FPW rmgr: Btree len (rec/tot): 55/ 1464, tx: 518902812, lsn: AB3/50324C90, prev AB3/50324328, desc: INSERT_LEAF off 83, blkref #0: rel 1663/18053/213369 blk 41494 FPW rmgr: Transaction len (rec/tot): 34/ 34, tx: 518902810, lsn: AB3/50325248, prev AB3/50324C90, desc: COMMIT 2018-08-22 05:18:03.670434 UTC And the list of files for a given relation: $ ls -al base/18055/212875* -rw------- 1 postgres postgres 1073741824 Aug 22 05:20 base/18055/212875 -rw------- 1 postgres postgres 1014947840 Aug 22 05:21 base/18055/212875.1 -rw------- 1 postgres postgres 532480 Aug 22 05:21 base/18055/212875_fsm -rw------- 1 postgres postgres 65536 Aug 22 05:21 base/18055/212875_vm What I find really strange is a huge difference in the page number between postgres log and pg_xlogdump output: "page 179503104 of relation base/18055/212875 does not exist" vs "rel 1663/18055/212907 blk 72478 FPW" I will keep this instance around for further investigation and would be happy to provide some more details if you need.
pgsql-bugs by date: