Re: BUG #15346: Replica fails to start after the crash - Mailing list pgsql-bugs

From Alexander Kukushkin
Subject Re: BUG #15346: Replica fails to start after the crash
Date
Msg-id CAFh8B==4gpwor6ODercru0iifqwd0nQrdRvXzmorWkMRQa4q=g@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15346: Replica fails to start after the crash  (Michael Paquier <michael@paquier.xyz>)
Responses Re: BUG #15346: Replica fails to start after the crash
List pgsql-bugs
Hi Michae,

It aborts not right after reaching a consistent state, but a little bit later:
"consistent recovery state reached at AB3/4A1B3118"
"WAL contains references to invalid pages",,,,,"xlog redo at
AB3/50323E78 for Btree/DELETE: 182 items"

Wal segments 0000000500000AB30000004A and 0000000500000AB300000050
correspondingly.

Also I managed to get a backtrace:

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
51      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007fd3f1deb801 in __GI_abort () at abort.c:79
#2  0x00005644f7094aaf in errfinish (dummy=dummy@entry=0) at
./build/../src/backend/utils/error/elog.c:557
#3  0x00005644f7096a49 in elog_finish (elevel=elevel@entry=22,
fmt=fmt@entry=0x5644f70e5998 "WAL contains references to invalid
pages") at ./build/../src/backend/utils/error/elog.c:1378
#4  0x00005644f6d8ec18 in log_invalid_page (node=...,
forkno=forkno@entry=MAIN_FORKNUM, blkno=blkno@entry=179503104,
present=<optimized out>) at
./build/../src/backend/access/transam/xlogutils.c:95
#5  0x00005644f6d8f168 in XLogReadBufferExtended (rnode=...,
forknum=forknum@entry=MAIN_FORKNUM, blkno=179503104,
mode=mode@entry=RBM_NORMAL) at
./build/../src/backend/access/transam/xlogutils.c:515
#6  0x00005644f6d59448 in btree_xlog_delete_get_latestRemovedXid
(record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:593
#7  btree_xlog_delete (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:682
#8  btree_redo (record=0x5644f7d35758) at
./build/../src/backend/access/nbtree/nbtxlog.c:1012
#9  0x00005644f6d83c34 in StartupXLOG () at
./build/../src/backend/access/transam/xlog.c:6967
#10 0x00005644f6f27953 in StartupProcessMain () at
./build/../src/backend/postmaster/startup.c:216
#11 0x00005644f6d925a7 in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7fff1f893a80) at
./build/../src/backend/bootstrap/bootstrap.c:419
#12 0x00005644f6f24983 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5300
#13 0x00005644f6f27437 in PostmasterMain (argc=16,
argv=0x5644f7d07f10) at
./build/../src/backend/postmaster/postmaster.c:1321
#14 0x00005644f6d08b5b in main (argc=16, argv=0x5644f7d07f10) at
./build/../src/backend/main/main.c:228


From looking on xlogutils.c I actually figured out that this backtrace
is a bit misleading because it can't get to the line 515
So I checked out REL9_6_10 tag and build it with -O0, now it makes more sense.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f9ae0777801 in __GI_abort () at abort.c:79
#2  0x000055c9f285f56d in errfinish (dummy=0) at elog.c:557
#3  0x000055c9f2861d7c in elog_finish (elevel=22, fmt=0x55c9f28cbff0
"WAL contains references to invalid pages") at elog.c:1378
#4  0x000055c9f2424e76 in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5  0x000055c9f24258ce in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6  0x000055c9f23de8ce in btree_xlog_delete_get_latestRemovedXid
(record=0x55c9f3cbf438) at nbtxlog.c:593
#7  0x000055c9f23dea88 in btree_xlog_delete (record=0x55c9f3cbf438) at
nbtxlog.c:682
#8  0x000055c9f23df857 in btree_redo (record=0x55c9f3cbf438) at nbtxlog.c:1012
#9  0x000055c9f24160e8 in StartupXLOG () at xlog.c:6967
#10 0x000055c9f267d9f6 in StartupProcessMain () at startup.c:216
#11 0x000055c9f242a276 in AuxiliaryProcessMain (argc=2,
argv=0x7ffea746c260) at bootstrap.c:419
#12 0x000055c9f267c93e in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055c9f2677469 in PostmasterMain (argc=16,
argv=0x55c9f3c91690) at postmaster.c:1321
#14 0x000055c9f25c1515 in main (argc=16, argv=0x55c9f3c91690) at main.c:228


> After close to an hour of repeated tests, I am not able to reproduce
> that.  Maybe there are some specifics in your schema causing more
> certain types of records to be created.  Could you check if in the set
> of WAL segments replayed you have references to the block the hash table
> is complaining about?

I am sorry, but don't really know where to look :(


Regards,
--
Alexander Kukushkin


pgsql-bugs by date:

Previous
From: Michael Paquier
Date:
Subject: Re: BUG #15347: Unaccent for greek characters does not work
Next
From: PG Bug reporting form
Date:
Subject: BUG #15348: Postgres 8.4 accepte les connexions entrantes