Re:Re:Standby got fatal after the crash recovery - Mailing list pgsql-hackers

From Thunder
Subject Re:Re:Standby got fatal after the crash recovery
Date
Msg-id 48ddd7b5.27a5.170e6713788.Coremail.thunder1@126.com
Whole thread Raw
In response to Re:Standby got fatal after the crash recovery  (Thunder <thunder1@126.com>)
Responses Re: Re:Standby got fatal after the crash recovery  (Michael Paquier <michael@paquier.xyz>)
List pgsql-hackers
The slru error detail is the following.
DETAIL:  Could not read from file "/data/pg_xact/003C" at offset 221184: Success.

I think read /data/pg_xact/003C from offset 221184 and the return value was 0.





At 2020-03-17 10:36:03, "Thunder" <thunder1@126.com> wrote:

Appreciate any suggestion for this issue?
Is there something i misunderstand?






At 2020-03-17 00:33:36, "Thunder" <thunder1@126.com> wrote:

Hello hackers:

Our standby node got fatal after the crash recovery. The fatal error was caused in slru module,  i changed log level from ERROR to PANIC and got the following stack.

(gdb) bt
#0  0x00007f0cc47a1277 in raise () from /lib64/libc.so.6
#1  0x00007f0cc47a2968 in abort () from /lib64/libc.so.6
#2  0x0000000000a48347 in errfinish (dummy=dummy@entry=0) at elog.c:616
#3  0x00000000005315dd in SlruReportIOError (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, xid=xid@entry=63800060) at slru.c:1175
#4  0x0000000000533152 in SimpleLruReadPage (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, write_ok=write_ok@entry=true, xid=xid@entry=63800060) at slru.c:610
#5  0x0000000000533350 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=pageno@entry=1947, xid=xid@entry=63800060) at slru.c:680
#6  0x00000000005293fd in TransactionIdGetStatus (xid=xid@entry=63800060, lsn=lsn@entry=0x7ffd17fc5130) at clog.c:661
#7  0x000000000053574a in TransactionLogFetch (transactionId=63800060) at transam.c:79
#8  TransactionIdDidCommit (transactionId=transactionId@entry=63800060) at transam.c:129
#9 0x00000000004f1295 in HeapTupleHeaderAdvanceLatestRemovedXid (tuple=0x2aab27e936e0, latestRemovedXid=latestRemovedXid@entry=0x7ffd17fc51b0) at heapam.c:7672
#10 0x00000000005103e0 in btree_xlog_delete_get_latestRemovedXid (record=record@entry=0x4636c98) at nbtxlog.c:656
#11 0x0000000000510a19 in btree_xlog_delete (record=0x4636c98) at nbtxlog.c:707
#12 btree_redo (record=0x4636c98) at nbtxlog.c:1048
#13 0x00000000005544a1 in StartupXLOG () at xlog.c:7825
#14 0x00000000008185be in StartupProcessMain () at startup.c:226
#15 0x000000000058de15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd17fc9430) at bootstrap.c:448
#16 0x0000000000813fe4 in StartChildProcess (type=StartupProcess) at postmaster.c:5804
#17 0x0000000000817eb0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x45ed6e0) at postmaster.c:1461
#18 0x00000000004991f4 in main (argc=3, argv=0x45ed6e0) at main.c:232
(gdb) p /x *record
$10 = {wal_segment_size = 0x40000000, read_page = 0x54f920, system_identifier = 0x5e6e6ea4af938064, private_data = 0x7ffd17fc5390, ReadRecPtr = 0xe41e8fb28, EndRecPtr = 0xe41e8fbb0, decoded_record = 0x4634390, main_data = 0x4634c88, main_data_len = 0x54,
  main_data_bufsz = 0x1000, record_origin = 0x0, blocks = {{in_use = 0x1, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e81}, forknum = 0x0, blkno = 0x55f, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x4632bc1, hole_offset = 0x40,
      hole_length = 0x1fb0, bimg_len = 0x50, bimg_info = 0x5, has_data = 0x0, data = 0x4656938, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e2b}, forknum = 0x0, blkno = 0x3b5f, flags = 0x80,
      has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x46468e8, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968,
        relNode = 0x7e5c370}, forknum = 0x0, blkno = 0x2c3, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {
      in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e77}, forknum = 0x0, blkno = 0xa8a, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0,
      has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x0, dbNode = 0x0, relNode = 0x0}, forknum = 0x0, blkno = 0x0, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0,
      bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0} <repeats 29 times>}, max_block_id = 0x0, readBuf = 0x4632868, readLen = 0x2000, readSegNo = 0x39, readOff = 0x1e8e000, readPageTLI = 0x1, latestPagePtr = 0xe41e8e000,
  latestPageTLI = 0x1, currRecPtr = 0xe41e8fb28, currTLI = 0x0, currTLIValidUntil = 0x0, nextTLI = 0x0, readRecordBuf = 0x4638888, readRecordBufSize = 0xa000, errormsg_buf = 0x4634878, noPayload = 0x0, polar_logindex_meta_size = 0x2e}
(gdb) p /x minRecoveryPoint
$11 = 0xe41bbbfd8
(gdb) p reachedConsistency
$12 = true
(gdb) p standbyState
$13 = STANDBY_SNAPSHOT_READY
(gdb) p ArchiveRecoveryRequested
$14 = true

After the crash the standby redo started from 0xDBE1241D8, and reached consistency at 0xE41BBBFD8 because of previous minRecoveryPoint. It did not repaly all WAL record after the crash.
From the crash stack we see that it was reading clog to check xid= 63800060 status.
But in wal file we see that xid= 63800060 was first created by xlog record which lsn=0xE42C22D68.

rmgr: Heap        len (rec/tot):     79/    79, tx:   63800060, lsn: E/42C22D68, prev E/42C22D40, desc: UPDATE off 45 xmax 63800060 ; new off 56 xmax 0, blkref #0: rel 1663/122780008/122781225 blk 14313
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C22DB8, prev E/42C22D68, desc: INSERT_LEAF off 200, blkref #0: rel 1663/122780008/122781297 blk 2803
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C22DF8, prev E/42C22DB8, desc: INSERT_LEAF off 333, blkref #0: rel 1663/122780008/122781313 blk 1375
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C22E38, prev E/42C22DF8, desc: INSERT_LEAF off 259, blkref #0: rel 1663/122780008/132582066 blk 1417
rmgr: Heap        len (rec/tot):    197/   197, tx:   63800060, lsn: E/42C23628, prev E/42C23600, desc: HOT_UPDATE off 35 xmax 63800060 ; new off 55 xmax 0, blkref #0: rel 1663/122780008/122781222 blk 14320
rmgr: Heap        len (rec/tot):     54/    54, tx:   63800060, lsn: E/42C23CF0, prev E/42C23CB0, desc: DELETE off 2 KEYS_UPDATED , blkref #0: rel 1663/122780008/122781230 blk 14847
rmgr: Heap        len (rec/tot):    253/   253, tx:   63800060, lsn: E/42C260E8, prev E/42C260A8, desc: INSERT off 11, blkref #0: rel 1663/122780008/122781230 blk 30300
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C26308, prev E/42C262C8, desc: INSERT_LEAF off 362, blkref #0: rel 1663/122780008/122781290 blk 2925
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C266F0, prev E/42C266B0, desc: INSERT_LEAF off 369, blkref #0: rel 1663/122780008/122781315 blk 1377
rmgr: Btree       len (rec/tot):     64/    64, tx:   63800060, lsn: E/42C26AE8, prev E/42C26AC0, desc: INSERT_LEAF off 308, blkref #0: rel 1663/122780008/132498288 blk 583
rmgr: Transaction len (rec/tot):     34/    34, tx:   63800060, lsn: E/42C271A8, prev E/42C27168, desc: COMMIT 2020-03-16 09:56:21.540818 CST
rmgr: Heap2       len (rec/tot):     90/    90, tx:          0, lsn: E/4351D3A0, prev E/4351D360, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781225 blk 14313
rmgr: Heap2       len (rec/tot):     96/    96, tx:          0, lsn: E/4381C898, prev E/4381C860, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781222 blk 14320

Is it caused by inconsistency clog and data page ,like from minRecoveryPoint 0xE41BBBFD8 to 0xE42C22D68  dirty page was flushed to the storage but clog was not flushed and then crashed?

BRS
Ray


 



 



 

pgsql-hackers by date:

Previous
From: Fujii Masao
Date:
Subject: Re: RecoveryWalAll and RecoveryWalStream wait events
Next
From: Fujii Masao
Date:
Subject: Re: comments on elements of xlogctldata