Re: BUG #15346: Replica fails to start after the crash - Mailing list pgsql-bugs

From Alexander Kukushkin
Subject Re: BUG #15346: Replica fails to start after the crash
Date
Msg-id CAFh8B=mYGP1J8ueHKU40Dr4bNBUPJj1rTbZ6Wpg4gR30uCRunA@mail.gmail.com
Whole thread Raw
In response to Re: BUG #15346: Replica fails to start after the crash  (Alexander Kukushkin <cyberdemn@gmail.com>)
Responses Re: BUG #15346: Replica fails to start after the crash
Re: BUG #15346: Replica fails to start after the crash
List pgsql-bugs
Hi,


btree_xlog_delete_get_latestRemovedXid is callled only when the
postgres is already running InHotStandby
And what this function is actually doing, it reads index page from
disk and iterates over IndexTuple records to get corresponding
HeapPages.
The problem is that xlrec->nitems = 182 in my case, but on the index
page there are only 56 tuples and the rest of the page is filled with
zeros:
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00007f07a2be2801 in __GI_abort () at abort.c:79
#2  0x000055938000f586 in errfinish (dummy=0) at elog.c:557
#3  0x0000559380011d95 in elog_finish (elevel=22, fmt=0x55938007c010
"WAL contains references to invalid pages") at elog.c:1378
#4  0x000055937fbd4e8f in log_invalid_page (node=...,
forkno=MAIN_FORKNUM, blkno=179503104, present=0 '\000') at
xlogutils.c:95
#5  0x000055937fbd58e7 in XLogReadBufferExtended (rnode=...,
forknum=MAIN_FORKNUM, blkno=179503104, mode=RBM_NORMAL) at
xlogutils.c:470
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
#7  0x000055937fb8eaa1 in btree_xlog_delete (record=0x55938151b438) at
nbtxlog.c:688
#8  0x000055937fb8f870 in btree_redo (record=0x55938151b438) at nbtxlog.c:1018
#9  0x000055937fbc6101 in StartupXLOG () at xlog.c:6967
#10 0x000055937fe2da0f in StartupProcessMain () at startup.c:216
#11 0x000055937fbda28f in AuxiliaryProcessMain (argc=2,
argv=0x7ffe88f10df0) at bootstrap.c:419
#12 0x000055937fe2c957 in StartChildProcess (type=StartupProcess) at
postmaster.c:5300
#13 0x000055937fe27482 in PostmasterMain (argc=16,
argv=0x5593814ed690) at postmaster.c:1321
#14 0x000055937fd7152e in main (argc=16, argv=0x5593814ed690) at main.c:228
(gdb) frame 6
#6  0x000055937fb8e8e7 in btree_xlog_delete_get_latestRemovedXid
(record=0x55938151b438) at nbtxlog.c:599
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
(gdb) l 587
582              * Loop through the deleted index items to obtain the
TransactionId from
583              * the heap items they point to.
584              */
585             unused = (OffsetNumber *) ((char *) xlrec + SizeOfBtreeDelete);
586
587             for (i = 0; i < xlrec->nitems; i++)
588             {
589                     /*
590                      * Identify the index tuple about to be deleted
591                      */
(gdb) l
592                     iitemid = PageGetItemId(ipage, unused[i]);
593                     itup = (IndexTuple) PageGetItem(ipage, iitemid);
594
595                     /*
596                      * Locate the heap page that the index tuple points at
597                      */
598                     hblkno = ItemPointerGetBlockNumber(&(itup->t_tid));
599                     hbuffer = XLogReadBufferExtended(xlrec->hnode,
MAIN_FORKNUM, hblkno, RBM_NORMAL);
600                     if (!BufferIsValid(hbuffer))
601                     {
(gdb) p i
$45 = 53
(gdb) p unused[i]
$46 = 57
(gdb) p *iitemid
$47 = {lp_off = 0, lp_flags = 0, lp_len = 0}

Therefore iitemid points to the beginning of the index page, what
doesn't make sense.

Why the number of tuples in the xlog is greater than the number of
tuples on the index page?
Because this page was already overwritten and its LSN is HIGHER than
the current LSN!

(gdb) p *(PageHeader)ipage
$49 = {pd_lsn = {xlogid = 2739, xrecoff = 1455373160}, pd_checksum =
3042, pd_flags = 0, pd_lower = 248, pd_upper = 5936, pd_special =
8176, pd_pagesize_version = 8196, pd_prune_xid = 0, pd_linp =
0x7f06ed9e9598}

Index page LSN: AB3/56BF3B68
Current LSN: AB3/50323E78

Is there a way to recover from such a situation? Should the postgres
in such case do comparison of LSNs and if the LSN on the page is
higher than the current LSN simply return InvalidTransactionId?
Apparently, if there are no connections open postgres simply is not
running this code and it seems ok.


Regards,
--
Alexander Kukushkin


pgsql-bugs by date:

Previous
From: Amit Kapila
Date:
Subject: Re: Re: Re: Bug: ERROR: invalid cache ID: 42 CONTEXT: parallel worker
Next
From: PG Bug reporting form
Date:
Subject: BUG #15352: postgresql FDW error "ERROR: ORDER BY position 0 is notin select list"