Thread: BUG #15346: Replica fails to start after the crash
The following bug has been logged on the website: Bug reference: 15346 Logged by: Alexander Kukushkin Email address: cyberdemn@gmail.com PostgreSQL version: 9.6.10 Operating system: Ubuntu Description: We run quite a lot of clusters and from time to time EC2 instances are dying, but so far we never had problems which I describe here. The first time we experienced it on the last week when doing upgrade to the next minor versions. A few replicas failed to start with very similar symptoms and I just reinitialized them because I was in a hurry. But, this morning one of EC2 instances running a replica has crashed. After the crash, the new instances was spawned and existing EBS volume (with existing PGDATA) attached to it. Postgres logs before the crash: 2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2636,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint complete: wrote 82957 buffers (17.0%); 0 transaction log file(s) added, 3 removed, 7 recycled; write=269.887 s, sync=0.012 s, total=269.926 s; sync files=105, longe st=0.002 s, average=0.000 s; distance=168429 kB, estimate=173142 kB",,,,,,,,,"" 2018-08-22 05:14:43.036 UTC,,,23838,,5b78e9a0.5d1e,2637,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"recovery restart point at AB3/37BD3C60","last completed transaction was at log time 2018-08-22 05:14:43.022752+00",,,,,,,,"" 2018-08-22 05:15:13.065 UTC,,,23838,,5b78e9a0.5d1e,2638,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,"" 2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2639,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint complete: wrote 86588 buffers (17.7%); 0 transaction log file(s) added, 2 removed, 8 recycled; write=269.983 s, sync=0.017 s, total=270.028 s; sync files=112, longe st=0.002 s, average=0.000 s; distance=156445 kB, estimate=171472 kB",,,,,,,,,"" 2018-08-22 05:19:43.093 UTC,,,23838,,5b78e9a0.5d1e,2640,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"recovery restart point at AB3/4149B340","last completed transaction was at log time 2018-08-22 05:19:43.077943+00",,,,,,,,"" 2018-08-22 05:20:13.121 UTC,,,23838,,5b78e9a0.5d1e,2641,,2018-08-19 03:53:04 UTC,,0,LOG,00000,"restartpoint starting: time",,,,,,,,,"" Here it crashed. Now postgres fails to start with the following logs: 2018-08-22 06:22:12.994 UTC,,,51,,5b7d0114.33,1,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"" 2018-08-22 06:22:12.996 UTC,,,54,,5b7d0114.36,1,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"database system was interrupted while in recovery at log time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" 2018-08-22 06:22:13.232 UTC,,,54,,5b7d0114.36,2,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"entering standby mode",,,,,,,,,"" 2018-08-22 06:22:13.420 UTC,,,54,,5b7d0114.36,3,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"restored log file ""00000005.history"" from archive",,,,,,,,,"" 2018-08-22 06:22:13.430 UTC,,,147,"[local]",5b7d0115.93,1,"",2018-08-22 06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"" 2018-08-22 06:22:13.430 UTC,"postgres","postgres",147,"[local]",5b7d0115.93,2,"",2018-08-22 06:22:13 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,"" 2018-08-22 06:22:13.435 UTC,,,149,"[local]",5b7d0115.95,1,"",2018-08-22 06:22:13 UTC,,0,LOG,00000,"connection received: host=[local]",,,,,,,,,"" 2018-08-22 06:22:13.435 UTC,"postgres","postgres",149,"[local]",5b7d0115.95,2,"",2018-08-22 06:22:13 UTC,,0,FATAL,57P03,"the database system is starting up",,,,,,,,,"" 2018-08-22 06:22:13.912 UTC,,,54,,5b7d0114.36,4,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"restored log file ""0000000500000AB30000004A"" from archive",,,,,,,,,"" ... restores WAL files from S3 2018-08-22 06:22:23.489 UTC,,,54,,5b7d0114.36,22,,2018-08-22 06:22:12 UTC,1/0,0,LOG,00000,"restored log file ""0000000500000AB300000050"" from archive",,,,,,,,,"" 2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12 UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,3,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"startup process (PID 54) was terminated by signal 6: Aborted",,,,,,,,,"" 2018-08-22 06:22:23.846 UTC,,,51,,5b7d0114.33,4,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" 2018-08-22 06:22:23.847 UTC,"postgres","postgres",842,"[local]",5b7d011c.34a,3,"idle",2018-08-22 06:22:20 UTC,3/0,0,WARNING,57P02,"terminating connection because of crash of another server process","The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.","In a moment you should be able to reconnect to the database and repeat your command.",,,,,,,"Patroni" 2018-08-22 06:22:23.945 UTC,,,51,,5b7d0114.33,5,,2018-08-22 06:22:12 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"" 2018-08-22 06:22:35.294 UTC,,,1135,,5b7d012b.46f,1,,2018-08-22 06:22:35 UTC,,0,LOG,00000,"ending log output to stderr",,"Future log output will go to log destination ""csvlog"".",,,,,,,"" 2018-08-22 06:22:35.295 UTC,,,1138,,5b7d012b.472,1,,2018-08-22 06:22:35 UTC,,0,LOG,00000,"database system was interrupted while in recovery at log time 2018-08-22 05:10:29 UTC",,"If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.",,,,,,,"" Here are a few log lines from pg_xlogdump 0000000500000AB300000050 around LSN AB3/50323E78: rmgr: Heap len (rec/tot): 54/ 54, tx: 518902809, lsn: AB3/50323D30, prev AB3/50323CE8, desc: LOCK off 18: xid 518902809: flags 0 LOCK_ONLY KEYSHR_LOCK , blkref #0: rel 1663/18053/213284 blk 137366 rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/50323D68, prev AB3/50323D30, desc: INSERT_LEAF off 66, blkref #0: rel 1663/18055/212936 blk 32947 rmgr: Btree len (rec/tot): 96/ 96, tx: 518902810, lsn: AB3/50323DC0, prev AB3/50323D68, desc: INSERT_LEAF off 3, blkref #0: rel 1663/18055/212935 blk 6 rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/50323E20, prev AB3/50323DC0, desc: INSERT_LEAF off 12, blkref #0: rel 1663/18055/212930 blk 25152 rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn: AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0: rel 1663/18055/212907 blk 72478 FPW rmgr: Btree len (rec/tot): 88/ 88, tx: 518902810, lsn: AB3/503240B8, prev AB3/50323E78, desc: INSERT_LEAF off 2, blkref #0: rel 1663/18055/212907 blk 72478 rmgr: Gist len (rec/tot): 55/ 163, tx: 518902810, lsn: AB3/50324110, prev AB3/503240B8, desc: PAGE_UPDATE , blkref #0: rel 1663/18055/212906 blk 273288 FPW rmgr: Btree len (rec/tot): 160/ 160, tx: 518902810, lsn: AB3/503241B8, prev AB3/50324110, desc: INSERT_LEAF off 8, blkref #0: rel 1663/18055/212892 blk 85620 rmgr: Gist len (rec/tot): 57/ 201, tx: 518902812, lsn: AB3/50324258, prev AB3/503241B8, desc: PAGE_UPDATE , blkref #0: rel 1663/18053/213364 blk 99265 FPW rmgr: Btree len (rec/tot): 55/ 2403, tx: 518902810, lsn: AB3/50324328, prev AB3/50324258, desc: INSERT_LEAF off 13, blkref #0: rel 1663/18055/212893 blk 53544 FPW rmgr: Btree len (rec/tot): 55/ 1464, tx: 518902812, lsn: AB3/50324C90, prev AB3/50324328, desc: INSERT_LEAF off 83, blkref #0: rel 1663/18053/213369 blk 41494 FPW rmgr: Transaction len (rec/tot): 34/ 34, tx: 518902810, lsn: AB3/50325248, prev AB3/50324C90, desc: COMMIT 2018-08-22 05:18:03.670434 UTC And the list of files for a given relation: $ ls -al base/18055/212875* -rw------- 1 postgres postgres 1073741824 Aug 22 05:20 base/18055/212875 -rw------- 1 postgres postgres 1014947840 Aug 22 05:21 base/18055/212875.1 -rw------- 1 postgres postgres 532480 Aug 22 05:21 base/18055/212875_fsm -rw------- 1 postgres postgres 65536 Aug 22 05:21 base/18055/212875_vm What I find really strange is a huge difference in the page number between postgres log and pg_xlogdump output: "page 179503104 of relation base/18055/212875 does not exist" vs "rel 1663/18055/212907 blk 72478 FPW" I will keep this instance around for further investigation and would be happy to provide some more details if you need.
On Wed, Aug 22, 2018 at 07:36:58AM +0000, PG Bug reporting form wrote: > 2018-08-22 06:22:23.633 UTC,,,54,,5b7d0114.36,23,,2018-08-22 06:22:12 > UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does > not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 > items",,,,"" > 2018-08-22 06:22:23.634 UTC,,,54,,5b7d0114.36,24,,2018-08-22 06:22:12 > UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog > redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" Once recovery has reached a consistent state, the startup process would look at if there are any invalid pages tracked in a given hash table and complains loudly about them. It is not the last record or its surroundings which matter in case, but if this page has been found in one of the records replayed during recovery up to the consistent point. Do you have in any records from the WAL segments fetched a reference to this page? A page is 8kB, and the page number is 179503104, which is definitely weird as that would cause a relation file to be more than 1000GB. If the record itself is in bad shape, this may be a corrupted segment. As far as I can see you only have one incorrect page reference (see XLogCheckInvalidPages in xlog.c). > I will keep this instance around for further investigation and would be > happy to provide some more details if you need. That would be nice! -- Michael
Attachment
Hi, I've figured out what is going on. On this server we have a background worker, which starts from shared_preload_libraries. In order to debug and reproduce it, I removed everything from background worker code except _PG_init, worker_main and couple of sighandler functions. Here is the code: void worker_main(Datum main_arg) { pqsignal(SIGHUP, bg_mon_sighup); pqsignal(SIGTERM, bg_mon_sigterm); if (signal(SIGPIPE, SIG_IGN) == SIG_ERR) proc_exit(1); BackgroundWorkerUnblockSignals(); BackgroundWorkerInitializeConnection("postgres", NULL); while (!got_sigterm) { int rc = WaitLatch(MyLatch, WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, naptime*1000L); ResetLatch(MyLatch); if (rc & WL_POSTMASTER_DEATH) proc_exit(1); } proc_exit(1); } void _PG_init(void) { BackgroundWorker worker; if (!process_shared_preload_libraries_in_progress) return; worker.bgw_flags = BGWORKER_SHMEM_ACCESS | BGWORKER_BACKEND_DATABASE_CONNECTION; worker.bgw_start_time = BgWorkerStart_ConsistentState; worker.bgw_restart_time = 1; worker.bgw_main = worker_main; worker.bgw_notify_pid = 0; snprintf(worker.bgw_name, BGW_MAXLEN, "my_worker"); RegisterBackgroundWorker(&worker); } Most of this code is taken from "worker_spi.c". Basically, it just initializes connection to the postgres database and sleeps all the time. If I comment out the 'BackgroundWorkerInitializeConnection("postgres", NULL);' call, postgres starts without any problem. What is very strange, because background worker itself is not doing anything... And one more thing, if I add sleep(15) before calling BackgroundWorkerInitializeConnection, postgres manages to start successfully. Is there a very strange race condition here? Regards, -- Alexander Kukushkin
On 2018-Aug-22, Alexander Kukushkin wrote: > If I comment out the 'BackgroundWorkerInitializeConnection("postgres", > NULL);' call, postgres starts without any problem. > What is very strange, because background worker itself is not doing anything... > > And one more thing, if I add sleep(15) before calling > BackgroundWorkerInitializeConnection, postgres manages to start > successfully. > Is there a very strange race condition here? Sounds likely. I suggest to have a look at what's going on inside the postmaster process when it gets stuck. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>: > > Sounds likely. I suggest to have a look at what's going on inside the > postmaster process when it gets stuck. Well, it doesn't get stuck, it aborts start with the message: 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22 14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22 14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items",,,,"" 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22 14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated by signal 6: Aborted",,,,,,,,,"" 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,4,,2018-08-22 14:26:10 UTC,,0,LOG,00000,"terminating any other active server processes",,,,,,,,,"" 2018-08-22 14:26:42.290 UTC,,,28483,,5b7d7282.6f43,5,,2018-08-22 14:26:10 UTC,,0,LOG,00000,"database system is shut down",,,,,,,,,"" Regards, -- Alexander Kukushkin
On 2018-Aug-22, Alexander Kukushkin wrote: > 2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>: > > > > > Sounds likely. I suggest to have a look at what's going on inside the > > postmaster process when it gets stuck. > > Well, it doesn't get stuck, it aborts start with the message: > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22 > 14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation > base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for > Btree/DELETE: 182 items",,,,"" > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22 > 14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid > pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 > items",,,,"" > 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22 > 14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated > by signal 6: Aborted",,,,,,,,,"" Oh, that's weird ... sounds like the fact that the bgworker starts somehow manages to corrupt the list of invalid pages in the startup process. That doesn't make any sense ... ENOTIME for a closer look ATM, though, sorry. Maybe you could try running under valgrind? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Wed, 22 Aug 2018 at 17:08, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > On 2018-Aug-22, Alexander Kukushkin wrote: > > > 2018-08-22 16:44 GMT+02:00 Alvaro Herrera <alvherre@2ndquadrant.com>: > > > > > > > > Sounds likely. I suggest to have a look at what's going on inside the > > > postmaster process when it gets stuck. > > > > Well, it doesn't get stuck, it aborts start with the message: > > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,23,,2018-08-22 > > 14:26:10 UTC,1/0,0,WARNING,01000,"page 179503104 of relation > > base/18055/212875 does not exist",,,,,"xlog redo at AB3/50323E78 for > > Btree/DELETE: 182 items",,,,"" > > 2018-08-22 14:26:42.073 UTC,,,28485,,5b7d7282.6f45,24,,2018-08-22 > > 14:26:10 UTC,1/0,0,PANIC,XX000,"WAL contains references to invalid > > pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 > > items",,,,"" > > 2018-08-22 14:26:42.214 UTC,,,28483,,5b7d7282.6f43,3,,2018-08-22 > > 14:26:10 UTC,,0,LOG,00000,"startup process (PID 28485) was terminated > > by signal 6: Aborted",,,,,,,,,"" > > Oh, that's weird ... sounds like the fact that the bgworker starts > somehow manages to corrupt the list of invalid pages in the startup > process. That doesn't make any sense ... We can see that the crash itself happened because in XLogReadBufferExtended at `if (PageIsNew(page))` (xlogutils.c:512) we've got a page that apparently wasn't initialized yet, and, since we've reached a consistent state, log_invalid_page panics. > ENOTIME for a closer look ATM, though, sorry. Maybe you could try > running under valgrind? Could you elaborate please, what can we find using valgrind in this case, some memory leaks? In any way there is a chance that everything will be ok, since even just a slow tracing under gdb leads to disappearing of this race condition.
On 2018-Aug-22, Dmitry Dolgov wrote: > > ENOTIME for a closer look ATM, though, sorry. Maybe you could try > > running under valgrind? > > Could you elaborate please, what can we find using valgrind in this case, some > memory leaks? In any way there is a chance that everything will be ok, since > even just a slow tracing under gdb leads to disappearing of this race > condition. I don't know. I was thinking it could detect some invalid write in shared memory. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Aug 22, 2018 at 02:50:34PM -0300, Alvaro Herrera wrote: > On 2018-Aug-22, Dmitry Dolgov wrote: > I don't know. I was thinking it could detect some invalid write in > shared memory. invalid_page_tab does not use HASH_SHARED_MEM. What did you have in mind? On my side, I have been letting a primary/standby set run for a certain amount of time with the following running: - pgbench load on the primary. - background worker connecting to database, just running in loop and sleeping. It uses BgWorkerStart_ConsistentState as start mode. - The standby gets stopped in immediate mode, and restarted after some time to let some activity happening after the recovery point has been reached. After close to an hour of repeated tests, I am not able to reproduce that. Maybe there are some specifics in your schema causing more certain types of records to be created. Could you check if in the set of WAL segments replayed you have references to the block the hash table is complaining about? -- Michael
Attachment
Hi Michae, It aborts not right after reaching a consistent state, but a little bit later: "consistent recovery state reached at AB3/4A1B3118" "WAL contains references to invalid pages",,,,,"xlog redo at AB3/50323E78 for Btree/DELETE: 182 items" Wal segments 0000000500000AB30000004A and 0000000500000AB300000050 correspondingly. Also I managed to get a backtrace: Program received signal SIGABRT, Aborted. __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 51 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007fd3f1deb801 in __GI_abort () at abort.c:79 #2 0x00005644f7094aaf in errfinish (dummy=dummy@entry=0) at ./build/../src/backend/utils/error/elog.c:557 #3 0x00005644f7096a49 in elog_finish (elevel=elevel@entry=22, fmt=fmt@entry=0x5644f70e5998 "WAL contains references to invalid pages") at ./build/../src/backend/utils/error/elog.c:1378 #4 0x00005644f6d8ec18 in log_invalid_page (node=..., forkno=forkno@entry=MAIN_FORKNUM, blkno=blkno@entry=179503104, present=<optimized out>) at ./build/../src/backend/access/transam/xlogutils.c:95 #5 0x00005644f6d8f168 in XLogReadBufferExtended (rnode=..., forknum=forknum@entry=MAIN_FORKNUM, blkno=179503104, mode=mode@entry=RBM_NORMAL) at ./build/../src/backend/access/transam/xlogutils.c:515 #6 0x00005644f6d59448 in btree_xlog_delete_get_latestRemovedXid (record=0x5644f7d35758) at ./build/../src/backend/access/nbtree/nbtxlog.c:593 #7 btree_xlog_delete (record=0x5644f7d35758) at ./build/../src/backend/access/nbtree/nbtxlog.c:682 #8 btree_redo (record=0x5644f7d35758) at ./build/../src/backend/access/nbtree/nbtxlog.c:1012 #9 0x00005644f6d83c34 in StartupXLOG () at ./build/../src/backend/access/transam/xlog.c:6967 #10 0x00005644f6f27953 in StartupProcessMain () at ./build/../src/backend/postmaster/startup.c:216 #11 0x00005644f6d925a7 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7fff1f893a80) at ./build/../src/backend/bootstrap/bootstrap.c:419 #12 0x00005644f6f24983 in StartChildProcess (type=StartupProcess) at ./build/../src/backend/postmaster/postmaster.c:5300 #13 0x00005644f6f27437 in PostmasterMain (argc=16, argv=0x5644f7d07f10) at ./build/../src/backend/postmaster/postmaster.c:1321 #14 0x00005644f6d08b5b in main (argc=16, argv=0x5644f7d07f10) at ./build/../src/backend/main/main.c:228 From looking on xlogutils.c I actually figured out that this backtrace is a bit misleading because it can't get to the line 515 So I checked out REL9_6_10 tag and build it with -O0, now it makes more sense. (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f9ae0777801 in __GI_abort () at abort.c:79 #2 0x000055c9f285f56d in errfinish (dummy=0) at elog.c:557 #3 0x000055c9f2861d7c in elog_finish (elevel=22, fmt=0x55c9f28cbff0 "WAL contains references to invalid pages") at elog.c:1378 #4 0x000055c9f2424e76 in log_invalid_page (node=..., forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at xlogutils.c:95 #5 0x000055c9f24258ce in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at xlogutils.c:470 #6 0x000055c9f23de8ce in btree_xlog_delete_get_latestRemovedXid (record=0x55c9f3cbf438) at nbtxlog.c:593 #7 0x000055c9f23dea88 in btree_xlog_delete (record=0x55c9f3cbf438) at nbtxlog.c:682 #8 0x000055c9f23df857 in btree_redo (record=0x55c9f3cbf438) at nbtxlog.c:1012 #9 0x000055c9f24160e8 in StartupXLOG () at xlog.c:6967 #10 0x000055c9f267d9f6 in StartupProcessMain () at startup.c:216 #11 0x000055c9f242a276 in AuxiliaryProcessMain (argc=2, argv=0x7ffea746c260) at bootstrap.c:419 #12 0x000055c9f267c93e in StartChildProcess (type=StartupProcess) at postmaster.c:5300 #13 0x000055c9f2677469 in PostmasterMain (argc=16, argv=0x55c9f3c91690) at postmaster.c:1321 #14 0x000055c9f25c1515 in main (argc=16, argv=0x55c9f3c91690) at main.c:228 > After close to an hour of repeated tests, I am not able to reproduce > that. Maybe there are some specifics in your schema causing more > certain types of records to be created. Could you check if in the set > of WAL segments replayed you have references to the block the hash table > is complaining about? I am sorry, but don't really know where to look :( Regards, -- Alexander Kukushkin
Hi, I think I am getting closer: If I start postgres without background worker, then works following code: // https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/backend/access/nbtree/nbtxlog.c;h=c536e224321dbc46574aa76876b9d49fa4b5e9a4;hb=REL9_6_10#l549 if (CountDBBackends(InvalidOid) == 0) return latestRemovedXid; And it returns from the btree_xlog_delete_get_latestRemovedXid function. In case if we have a background worker, then CountDBBackends(InvalidOid) = 1 and as a result it aborts. Comment before if statement clearly tells about possible race condition: /* * If there's nothing running on the standby we don't need to derive a * full latestRemovedXid value, so use a fast path out of here. This * returns InvalidTransactionId, and so will conflict with all HS * transactions; but since we just worked out that that's zero people, * it's OK. * * XXX There is a race condition here, which is that a new backend might * start just after we look. If so, it cannot need to conflict, but this * coding will result in throwing a conflict anyway. */ Regards, -- Alexander Kukushkin
Hi, btree_xlog_delete_get_latestRemovedXid is callled only when the postgres is already running InHotStandby And what this function is actually doing, it reads index page from disk and iterates over IndexTuple records to get corresponding HeapPages. The problem is that xlrec->nitems = 182 in my case, but on the index page there are only 56 tuples and the rest of the page is filled with zeros: (gdb) bt #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51 #1 0x00007f07a2be2801 in __GI_abort () at abort.c:79 #2 0x000055938000f586 in errfinish (dummy=0) at elog.c:557 #3 0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010 "WAL contains references to invalid pages") at elog.c:1378 #4 0x000055937fbd4e8f in log_invalid_page (node=..., forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at xlogutils.c:95 #5 0x000055937fbd58e7 in XLogReadBufferExtended (rnode=..., forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at xlogutils.c:470 #6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid (record=0x55938151b438) at nbtxlog.c:599 #7 0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at nbtxlog.c:688 #8 0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018 #9 0x000055937fbc6101 in StartupXLOG () at xlog.c:6967 #10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216 #11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2, argv=0x7ffe88f10df0) at bootstrap.c:419 #12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at postmaster.c:5300 #13 0x000055937fe27482 in PostmasterMain (argc=16, argv=0x5593814ed690) at postmaster.c:1321 #14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228 (gdb) frame 6 #6 0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid (record=0x55938151b438) at nbtxlog.c:599 599 hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM, hblkno, RBM_NORMAL); (gdb) l 587 582 * Loop through the deleted index items to obtain the TransactionId from 583 * the heap items they point to. 584 */ 585 unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete); 586 587 for (i = 0; i < xlrec->nitems; i++) 588 { 589 /* 590 * Identify the index tuple about to be deleted 591 */ (gdb) l 592 iitemid = PageGetItemId(ipage, unused[i]); 593 itup = (IndexTuple) PageGetItem(ipage, iitemid); 594 595 /* 596 * Locate the heap page that the index tuple points at 597 */ 598 hblkno = ItemPointerGetBlockNumber(&(itup->t_tid)); 599 hbuffer = XLogReadBufferExtended(xlrec->hnode, MAIN_FORKNUM, hblkno, RBM_NORMAL); 600 if (!BufferIsValid(hbuffer)) 601 { (gdb) p i $45 = 53 (gdb) p unused[i] $46 = 57 (gdb) p *iitemid $47 = {lp_off = 0, lp_flags = 0, lp_len = 0} Therefore iitemid points to the beginning of the index page, what doesn't make sense. Why the number of tuples in the xlog is greater than the number of tuples on the index page? Because this page was already overwritten and its LSN is HIGHER than the current LSN! (gdb) p *(PageHeader)ipage $49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum = 3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special = 8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp = 0x7f06ed9e9598} Index page LSN: AB3/56BF3B68 Current LSN: AB3/50323E78 Is there a way to recover from such a situation? Should the postgres in such case do comparison of LSNs and if the LSN on the page is higher than the current LSN simply return InvalidTransactionId? Apparently, if there are no connections open postgres simply is not running this code and it seems ok. Regards, -- Alexander Kukushkin
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote: > Why the number of tuples in the xlog is greater than the number of > tuples on the index page? > Because this page was already overwritten and its LSN is HIGHER than > the current LSN! That's annoying. Because that means that the control file of your server maps to a consistent point which is older than some of the relation pages. How was the base backup of this node created? Please remember that when taking a base backup from a standby, you should backup the control file last, as there is no control of end backup with records available. So it seems to me that the origin of your problem comes from an incorrect base backup expectation? > Is there a way to recover from such a situation? Should the postgres > in such case do comparison of LSNs and if the LSN on the page is > higher than the current LSN simply return InvalidTransactionId? > Apparently, if there are no connections open postgres simply is not > running this code and it seems ok. One idea I have would be to copy all the WAL segments up to the point where the pages to-be-updated are, and let Postgres replay all the local WALs first. However it is hard to say if that would be enough, as you could have more references to pages even newer than the btree one you just found. -- Michael
Attachment
On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote: >On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote: >> Why the number of tuples in the xlog is greater than the number of >> tuples on the index page? >> Because this page was already overwritten and its LSN is HIGHER than >> the current LSN! > >That's annoying. Because that means that the control file of your >server maps to a consistent point which is older than some of the >relation pages. How was the base backup of this node created? Please >remember that when taking a base backup from a standby, you should >backup the control file last, as there is no control of end backup with >records available. So it seems to me that the origin of your problem >comes from an incorrect base backup expectation? Uh, where is that "control file last" bit coming from? Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
Hi Michael, > That's annoying. Because that means that the control file of your > server maps to a consistent point which is older than some of the > relation pages. How was the base backup of this node created? Please > remember that when taking a base backup from a standby, you should > backup the control file last, as there is no control of end backup with > records available. So it seems to me that the origin of your problem > comes from an incorrect base backup expectation? We are running the cluster of 3 nodes (m4.large + EBS volume for PGDATA) on AWS. Replicas were initialized about a years ago with pg_basebackup and working absolutely fine. In the past year I did a few minor upgrades with switchover (first upgrade of the replicas, switchover, and upgrade the former primary). The last switchover was done on the August 19th. This instance was working as a replica for about three days until the sudden crash of EC2 instance. On the new instance we attached existing EBS volume with existing the PGDATA and tried to start postgres. Consequences you can see in the very first email. > One idea I have would be to copy all the WAL segments up to the point > where the pages to-be-updated are, and let Postgres replay all the local > WALs first. However it is hard to say if that would be enough, as you > could have more references to pages even newer than the btree one you > just found. Well, I did some experiments, among them was the approach you suggest, i.e. I commented out restore_command in the recovery.conf and copied quite a few WAL segments to the pg_xlog. Results are the same. It aborts as long as there are connections open :( Regards, -- Alexander Kukushkin
Greetings, * Michael Paquier (michael@paquier.xyz) wrote: > On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote: > > Why the number of tuples in the xlog is greater than the number of > > tuples on the index page? > > Because this page was already overwritten and its LSN is HIGHER than > > the current LSN! > > That's annoying. Because that means that the control file of your > server maps to a consistent point which is older than some of the > relation pages. How was the base backup of this node created? Please > remember that when taking a base backup from a standby, you should > backup the control file last, as there is no control of end backup with > records available. So it seems to me that the origin of your problem > comes from an incorrect base backup expectation? Yeah, we really need to know how this system was built. If it was built using the low-level pg_start/stop_backup, then which API was used- the old one that creates a backup_label file, or the new API which requires the user to create the backup_label file? I haven't followed this thread very closely, but I wonder if maybe the new API was used, but no backup_label file was created, making PG think it was doing crash recovery instead of restoring from a file-level backup... Thanks! Stephen
Attachment
Greetings, * Andres Freund (andres@anarazel.de) wrote: > On August 28, 2018 11:44:09 AM GMT+09:00, Michael Paquier <michael@paquier.xyz> wrote: > >On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote: > >> Why the number of tuples in the xlog is greater than the number of > >> tuples on the index page? > >> Because this page was already overwritten and its LSN is HIGHER than > >> the current LSN! > > > >That's annoying. Because that means that the control file of your > >server maps to a consistent point which is older than some of the > >relation pages. How was the base backup of this node created? Please > >remember that when taking a base backup from a standby, you should > >backup the control file last, as there is no control of end backup with > >records available. So it seems to me that the origin of your problem > >comes from an incorrect base backup expectation? > > Uh, where is that "control file last" bit coming from? pg_basebackup copies it last. The comments should probably be improved as to *why* but my recollection is that it's, at least in part, to ensure the new cluster can't be used until it's actually a complete backup. Thanks! Stephen
Attachment
On Tue, Aug 28, 2018 at 08:23:11AM -0400, Stephen Frost wrote: > * Andres Freund (andres@anarazel.de) wrote: >> Uh, where is that "control file last" bit coming from? > > pg_basebackup copies it last. The comments should probably be improved > as to *why* but my recollection is that it's, at least in part, to > ensure the new cluster can't be used until it's actually a complete > backup. What we have now is mainly in basebackup.c. See 8366c780 which introduced that. Stephen has that right, as we cannot rely on an end-backup record when taking a backup from a standby, copying the control file last ensures that the consistent point should be late enough that no other pages are inconsistent. Even with that, I think that there is still a small race condition but I cannot put my finger on it now. I agree that the current comments do a bad job as to why this happens. That's actually something I discovered when discussing what has resulted in f267c1c2. -- Michael
Attachment
Hi Michael,
it looks like you missed my answer: https://www.postgresql.org/message-id/CAFh8B%3Dm0Bht-BfKmyzfxcivzjcqRd7BbNHeWthDveWwZ%2BDrV2A%40mail.gmail.com
Regards,
it looks like you missed my answer: https://www.postgresql.org/message-id/CAFh8B%3Dm0Bht-BfKmyzfxcivzjcqRd7BbNHeWthDveWwZ%2BDrV2A%40mail.gmail.com
Regards,
--
Alexander
Alexander
On Sat, Aug 25, 2018 at 09:54:39AM +0200, Alexander Kukushkin wrote: > Is there a way to recover from such a situation? Should the postgres > in such case do comparison of LSNs and if the LSN on the page is > higher than the current LSN simply return InvalidTransactionId? Hmm. That does not sound right to me. If the page has a header LSN higher than the one replayed, we should not see it as consistency has normally been reached. XLogReadBufferExtended() seems to complain in your case about a page which should not exist per the information of your backtrace. What's the length of relation file at this point? If the relation is considered as having less blocks, shouldn't we just ignore it if we're trying to delete items on it and return InvalidTransactionId? I have to admit that I am not the best specialist with this code. hblkno looks also unsanely high to me if you look at the other blkno references you are mentioning upthread. > Apparently, if there are no connections open postgres simply is not > running this code and it seems ok. Yeah, that's used for standby conflicts. -- Michael
Attachment
Hi, 2018-08-29 8:17 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > Hmm. That does not sound right to me. If the page has a header LSN > higher than the one replayed, we should not see it as consistency has > normally been reached. XLogReadBufferExtended() seems to complain in > your case about a page which should not exist per the information of > your backtrace. What's the length of relation file at this point? If > the relation is considered as having less blocks, shouldn't we just > ignore it if we're trying to delete items on it and return > InvalidTransactionId? I have to admit that I am not the best specialist > with this code. This is what pg_controldata reports: Latest checkpoint location: AB3/4A1B30A8 Prior checkpoint location: AB3/4A1B30A8 Latest checkpoint's REDO location: AB3/4149B340 Latest checkpoint's REDO WAL file: 0000000500000AB300000041 Latest checkpoint's TimeLineID: 5 Latest checkpoint's PrevTimeLineID: 5 Latest checkpoint's full_page_writes: on Minimum recovery ending location: AB3/4A1B3118 Min recovery ending loc's timeline: 5 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: on Therefore it reaches consistency at AB3/4A1B3118 and starts accepting connections. A few moments later it starts replaying 0000000500000AB300000050, where "AB3/50323E78" contains "Btree/DELETE: 182". This record in the pg_xlogdump looks like: rmgr: Btree len (rec/tot): 436/ 547, tx: 518902810, lsn: AB3/50323E78, prev AB3/50323E20, desc: DELETE 182 items, blkref #0: rel 1663/18055/212907 blk 72478 FPW 212907 -- btree index: $ ls -al data/base/18055/212907* -rw------- 1 akukushkin akukushkin 594812928 Aug 22 07:22 data/base/18055/212907 -rw------- 1 akukushkin akukushkin 163840 Aug 22 07:22 data/base/18055/212907_fsm 212875 -- the table: $ ls -al data/base/18055/212875* -rw------- 1 akukushkin akukushkin 1073741824 Aug 22 07:20 data/base/18055/212875 -rw------- 1 akukushkin akukushkin 1014947840 Aug 22 07:21 data/base/18055/212875.1 -rw------- 1 akukushkin akukushkin 532480 Aug 22 07:21 data/base/18055/212875_fsm -rw------- 1 akukushkin akukushkin 65536 Aug 22 07:21 data/base/18055/212875_vm As you can see its size is only about 2GB. > > hblkno looks also unsanely high to me if you look at the other blkno > references you are mentioning upthread. hblkno is calculated from the data which postgres reads from indexfile data/base/18055/212907, block 72478. This block has only 56 index tuples, while there expected to be at least 182, therefore starting from tuple 57 the tuple pointer starts looking as: (gdb) p *iitemid $47 = {lp_off = 0, lp_flags = 0, lp_len = 0} Basically it points to the beginning of the page, what doesn't make sense at all. Why the block 72478 of index relfile doesn't meet our expectations (contains so few tuples)? The answer to this question is in the page header. LSN, written in the indexpage header is AB3/56BF3B68. That has only one meaning, while the postgres was working before the crash it managed to apply WAL stream til at least AB3/56BF3B68, what is far ahead of "Minimum recovery ending location: AB3/4A1B3118". Regards, -- Alexander Kukushkin
On Wed, Aug 29, 2018 at 08:59:16AM +0200, Alexander Kukushkin wrote: > Why the block 72478 of index relfile doesn't meet our expectations > (contains so few tuples)? > The answer to this question is in the page header. LSN, written in the > indexpage header is AB3/56BF3B68. > That has only one meaning, while the postgres was working before the > crash it managed to apply WAL stream til at least AB3/56BF3B68, what > is far ahead of "Minimum recovery ending location: AB3/4A1B3118". Yeah, that's the pinpoint. Do you know by chance what was the content of the control file for each standby you have upgraded to 9.6.10 before starting them with the new binaries? You mentioned a cluster of three nodes, so I guess that you have two standbys, and that one of them did not see the symptoms discussed here, while the other saw them. Do you still have the logs of the recovery just after starting the other standby with 9.4.10 which did not see the symptom? All your standbys are using the background worker which would cause the btree deletion code to be scanned, right? I am trying to work on a reproducer with a bgworker starting once recovery has been reached, without success yet. Does your cluster generate some XLOG_PARAMETER_CHANGE records? In some cases, 9.4.8 could have updated minRecoveryPoint to go backward, which is something that 8d68ee6 has been working on addressing. Did you also try to use local WAL segments up where AB3/56BF3B68 is applied, and also have a restore_command so as extra WAL segment fetches from the archive would happen? -- Michael
Attachment
Hi, 2018-08-29 14:10 GMT+02:00 Michael Paquier <michael@paquier.xyz>: > Yeah, that's the pinpoint. Do you know by chance what was the content > of the control file for each standby you have upgraded to 9.6.10 before > starting them with the new binaries? You mentioned a cluster of three No, I don't. Right after the upgrade they started normally and have been working for a few days. I believe the controlfile was overwritten a few hundred times before the instance crashed. > nodes, so I guess that you have two standbys, and that one of them did > not see the symptoms discussed here, while the other saw them. Do you The other node didn't crash and still working. > still have the logs of the recovery just after starting the other > standby with 9.4.10 which did not see the symptom? All your standbys I don't think it is really related to the minor upgrade. After the upgrade the whole cluster was running for about 3 days. Every day it generates about 2000 WAL segments, the total volume of daily WALs is very close to the size of cluster, which is 38GB. > are using the background worker which would cause the btree deletion > code to be scanned, right? Well, any open connection to the database will produce the same result. In our case we are using Patroni for automatic failover, which connects immediately after postgres has started and keeps this connection permanently open. Background worker just appeared to be faster than anything else. > I am trying to work on a reproducer with a bgworker starting once > recovery has been reached, without success yet. Does your cluster > generate some XLOG_PARAMETER_CHANGE records? In some cases, 9.4.8 could > have updated minRecoveryPoint to go backward, which is something that > 8d68ee6 has been working on addressing. No, it doesn't. > > Did you also try to use local WAL segments up where AB3/56BF3B68 is > applied, and also have a restore_command so as extra WAL segment fetches > from the archive would happen? If there are no connections open, it applies a necessary amount of WAL segments (with the help of restore_command off course) and reaches the real consistency. After that, it is possible to connect and it doesn't startup process doesn't abort anymore. BTW, I am thinking that we should return InvalidTransactionId from the btree_xlog_delete_get_latestRemovedXid if the index page we read from disk is newer then xlog record we are currently processing. Please see the patch attached. -- Alexander Kukushkin
Attachment
On Wed, Aug 29, 2018 at 03:05:25PM +0200, Alexander Kukushkin wrote: > BTW, I am thinking that we should return InvalidTransactionId from the > btree_xlog_delete_get_latestRemovedXid if the index page we read from > disk is newer then xlog record we are currently processing. Please see > the patch attached. This is not a solution in my opinion, as you could invalidate activities of backends connected to the database when the incorrect consistent point allows connections to come in too early. What happens if you replay with hot_standby = on up to the latest point, without any concurrent connections, then issue a checkpoint on the standby once you got to a point newer than the complain, and finally restart the standby with the bgworker? Another idea I have would be to make the standby promote, issue a checkpoint on it, and then use pg_rewind as a trick to update the control file to a point newer than the inconsistency. As PG < 9.6.10 could make the minimum recovery point go backwards, applying the upgrade after the consistent point got to an incorrect state would trigger the failure. -- Michael