Re:Re:Standby got fatal after the crash recovery - Mailing list pgsql-hackers
From | Thunder |
---|---|
Subject | Re:Re:Standby got fatal after the crash recovery |
Date | |
Msg-id | 48ddd7b5.27a5.170e6713788.Coremail.thunder1@126.com Whole thread Raw |
In response to | Re:Standby got fatal after the crash recovery (Thunder <thunder1@126.com>) |
Responses |
Re: Re:Standby got fatal after the crash recovery
|
List | pgsql-hackers |
The slru error detail is the following.
DETAIL: Could not read from file "/data/pg_xact/003C" at offset 221184: Success.
I think read /data/pg_xact/003C from offset 221184 and the return value was 0.
At 2020-03-17 10:36:03, "Thunder" <thunder1@126.com> wrote:
Appreciate any suggestion for this issue?Is there something i misunderstand?
At 2020-03-17 00:33:36, "Thunder" <thunder1@126.com> wrote:
Hello hackers:Our standby node got fatal after the crash recovery. The fatal error was caused in slru module, i changed log level from ERROR to PANIC and got the following stack.(gdb) bt#0 0x00007f0cc47a1277 in raise () from /lib64/libc.so.6#1 0x00007f0cc47a2968 in abort () from /lib64/libc.so.6#2 0x0000000000a48347 in errfinish (dummy=dummy@entry=0) at elog.c:616#3 0x00000000005315dd in SlruReportIOError (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, xid=xid@entry=63800060) at slru.c:1175#4 0x0000000000533152 in SimpleLruReadPage (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, write_ok=write_ok@entry=true, xid=xid@entry=63800060) at slru.c:610#5 0x0000000000533350 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=pageno@entry=1947, xid=xid@entry=63800060) at slru.c:680#6 0x00000000005293fd in TransactionIdGetStatus (xid=xid@entry=63800060, lsn=lsn@entry=0x7ffd17fc5130) at clog.c:661#7 0x000000000053574a in TransactionLogFetch (transactionId=63800060) at transam.c:79#8 TransactionIdDidCommit (transactionId=transactionId@entry=63800060) at transam.c:129#9 0x00000000004f1295 in HeapTupleHeaderAdvanceLatestRemovedXid (tuple=0x2aab27e936e0, latestRemovedXid=latestRemovedXid@entry=0x7ffd17fc51b0) at heapam.c:7672#10 0x00000000005103e0 in btree_xlog_delete_get_latestRemovedXid (record=record@entry=0x4636c98) at nbtxlog.c:656#11 0x0000000000510a19 in btree_xlog_delete (record=0x4636c98) at nbtxlog.c:707#12 btree_redo (record=0x4636c98) at nbtxlog.c:1048#13 0x00000000005544a1 in StartupXLOG () at xlog.c:7825#14 0x00000000008185be in StartupProcessMain () at startup.c:226#15 0x000000000058de15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd17fc9430) at bootstrap.c:448#16 0x0000000000813fe4 in StartChildProcess (type=StartupProcess) at postmaster.c:5804#17 0x0000000000817eb0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x45ed6e0) at postmaster.c:1461#18 0x00000000004991f4 in main (argc=3, argv=0x45ed6e0) at main.c:232(gdb) p /x *record$10 = {wal_segment_size = 0x40000000, read_page = 0x54f920, system_identifier = 0x5e6e6ea4af938064, private_data = 0x7ffd17fc5390, ReadRecPtr = 0xe41e8fb28, EndRecPtr = 0xe41e8fbb0, decoded_record = 0x4634390, main_data = 0x4634c88, main_data_len = 0x54,main_data_bufsz = 0x1000, record_origin = 0x0, blocks = {{in_use = 0x1, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e81}, forknum = 0x0, blkno = 0x55f, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x4632bc1, hole_offset = 0x40,hole_length = 0x1fb0, bimg_len = 0x50, bimg_info = 0x5, has_data = 0x0, data = 0x4656938, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e2b}, forknum = 0x0, blkno = 0x3b5f, flags = 0x80,has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x46468e8, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968,relNode = 0x7e5c370}, forknum = 0x0, blkno = 0x2c3, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e77}, forknum = 0x0, blkno = 0xa8a, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0,has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x0, dbNode = 0x0, relNode = 0x0}, forknum = 0x0, blkno = 0x0, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0,bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0} <repeats 29 times>}, max_block_id = 0x0, readBuf = 0x4632868, readLen = 0x2000, readSegNo = 0x39, readOff = 0x1e8e000, readPageTLI = 0x1, latestPagePtr = 0xe41e8e000,latestPageTLI = 0x1, currRecPtr = 0xe41e8fb28, currTLI = 0x0, currTLIValidUntil = 0x0, nextTLI = 0x0, readRecordBuf = 0x4638888, readRecordBufSize = 0xa000, errormsg_buf = 0x4634878, noPayload = 0x0, polar_logindex_meta_size = 0x2e}(gdb) p /x minRecoveryPoint$11 = 0xe41bbbfd8(gdb) p reachedConsistency$12 = true(gdb) p standbyState$13 = STANDBY_SNAPSHOT_READY(gdb) p ArchiveRecoveryRequested$14 = trueAfter the crash the standby redo started from 0xDBE1241D8, and reached consistency at 0xE41BBBFD8 because of previous minRecoveryPoint. It did not repaly all WAL record after the crash.From the crash stack we see that it was reading clog to check xid= 63800060 status.But in wal file we see that xid= 63800060 was first created by xlog record which lsn=0xE42C22D68.rmgr: Heap len (rec/tot): 79/ 79, tx: 63800060, lsn: E/42C22D68, prev E/42C22D40, desc: UPDATE off 45 xmax 63800060 ; new off 56 xmax 0, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DB8, prev E/42C22D68, desc: INSERT_LEAF off 200, blkref #0: rel 1663/122780008/122781297 blk 2803rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DF8, prev E/42C22DB8, desc: INSERT_LEAF off 333, blkref #0: rel 1663/122780008/122781313 blk 1375rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22E38, prev E/42C22DF8, desc: INSERT_LEAF off 259, blkref #0: rel 1663/122780008/132582066 blk 1417rmgr: Heap len (rec/tot): 197/ 197, tx: 63800060, lsn: E/42C23628, prev E/42C23600, desc: HOT_UPDATE off 35 xmax 63800060 ; new off 55 xmax 0, blkref #0: rel 1663/122780008/122781222 blk 14320rmgr: Heap len (rec/tot): 54/ 54, tx: 63800060, lsn: E/42C23CF0, prev E/42C23CB0, desc: DELETE off 2 KEYS_UPDATED , blkref #0: rel 1663/122780008/122781230 blk 14847rmgr: Heap len (rec/tot): 253/ 253, tx: 63800060, lsn: E/42C260E8, prev E/42C260A8, desc: INSERT off 11, blkref #0: rel 1663/122780008/122781230 blk 30300rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26308, prev E/42C262C8, desc: INSERT_LEAF off 362, blkref #0: rel 1663/122780008/122781290 blk 2925rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C266F0, prev E/42C266B0, desc: INSERT_LEAF off 369, blkref #0: rel 1663/122780008/122781315 blk 1377rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26AE8, prev E/42C26AC0, desc: INSERT_LEAF off 308, blkref #0: rel 1663/122780008/132498288 blk 583rmgr: Transaction len (rec/tot): 34/ 34, tx: 63800060, lsn: E/42C271A8, prev E/42C27168, desc: COMMIT 2020-03-16 09:56:21.540818 CSTrmgr: Heap2 len (rec/tot): 90/ 90, tx: 0, lsn: E/4351D3A0, prev E/4351D360, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Heap2 len (rec/tot): 96/ 96, tx: 0, lsn: E/4381C898, prev E/4381C860, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781222 blk 14320Is it caused by inconsistency clog and data page ,like from minRecoveryPoint 0xE41BBBFD8 to 0xE42C22D68 dirty page was flushed to the storage but clog was not flushed and then crashed?BRSRay
pgsql-hackers by date: