Thread: Standby got fatal after the crash recovery
Hello hackers:
Our standby node got fatal after the crash recovery. The fatal error was caused in slru module, i changed log level from ERROR to PANIC and got the following stack.
(gdb) bt
#0 0x00007f0cc47a1277 in raise () from /lib64/libc.so.6
#1 0x00007f0cc47a2968 in abort () from /lib64/libc.so.6
#2 0x0000000000a48347 in errfinish (dummy=dummy@entry=0) at elog.c:616
#3 0x00000000005315dd in SlruReportIOError (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, xid=xid@entry=63800060) at slru.c:1175
#4 0x0000000000533152 in SimpleLruReadPage (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, write_ok=write_ok@entry=true, xid=xid@entry=63800060) at slru.c:610
#5 0x0000000000533350 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=pageno@entry=1947, xid=xid@entry=63800060) at slru.c:680
#6 0x00000000005293fd in TransactionIdGetStatus (xid=xid@entry=63800060, lsn=lsn@entry=0x7ffd17fc5130) at clog.c:661
#7 0x000000000053574a in TransactionLogFetch (transactionId=63800060) at transam.c:79
#8 TransactionIdDidCommit (transactionId=transactionId@entry=63800060) at transam.c:129
#9 0x00000000004f1295 in HeapTupleHeaderAdvanceLatestRemovedXid (tuple=0x2aab27e936e0, latestRemovedXid=latestRemovedXid@entry=0x7ffd17fc51b0) at heapam.c:7672
#10 0x00000000005103e0 in btree_xlog_delete_get_latestRemovedXid (record=record@entry=0x4636c98) at nbtxlog.c:656
#11 0x0000000000510a19 in btree_xlog_delete (record=0x4636c98) at nbtxlog.c:707
#12 btree_redo (record=0x4636c98) at nbtxlog.c:1048
#13 0x00000000005544a1 in StartupXLOG () at xlog.c:7825
#14 0x00000000008185be in StartupProcessMain () at startup.c:226
#15 0x000000000058de15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd17fc9430) at bootstrap.c:448
#16 0x0000000000813fe4 in StartChildProcess (type=StartupProcess) at postmaster.c:5804
#17 0x0000000000817eb0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x45ed6e0) at postmaster.c:1461
#18 0x00000000004991f4 in main (argc=3, argv=0x45ed6e0) at main.c:232
(gdb) p /x *record
$10 = {wal_segment_size = 0x40000000, read_page = 0x54f920, system_identifier = 0x5e6e6ea4af938064, private_data = 0x7ffd17fc5390, ReadRecPtr = 0xe41e8fb28, EndRecPtr = 0xe41e8fbb0, decoded_record = 0x4634390, main_data = 0x4634c88, main_data_len = 0x54,
main_data_bufsz = 0x1000, record_origin = 0x0, blocks = {{in_use = 0x1, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e81}, forknum = 0x0, blkno = 0x55f, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x4632bc1, hole_offset = 0x40,
hole_length = 0x1fb0, bimg_len = 0x50, bimg_info = 0x5, has_data = 0x0, data = 0x4656938, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e2b}, forknum = 0x0, blkno = 0x3b5f, flags = 0x80,
has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x46468e8, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968,
relNode = 0x7e5c370}, forknum = 0x0, blkno = 0x2c3, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {
in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e77}, forknum = 0x0, blkno = 0xa8a, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0,
has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x0, dbNode = 0x0, relNode = 0x0}, forknum = 0x0, blkno = 0x0, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0,
bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0} <repeats 29 times>}, max_block_id = 0x0, readBuf = 0x4632868, readLen = 0x2000, readSegNo = 0x39, readOff = 0x1e8e000, readPageTLI = 0x1, latestPagePtr = 0xe41e8e000,
latestPageTLI = 0x1, currRecPtr = 0xe41e8fb28, currTLI = 0x0, currTLIValidUntil = 0x0, nextTLI = 0x0, readRecordBuf = 0x4638888, readRecordBufSize = 0xa000, errormsg_buf = 0x4634878, noPayload = 0x0, polar_logindex_meta_size = 0x2e}
(gdb) p /x minRecoveryPoint
$11 = 0xe41bbbfd8
(gdb) p reachedConsistency
$12 = true
(gdb) p standbyState
$13 = STANDBY_SNAPSHOT_READY
(gdb) p ArchiveRecoveryRequested
$14 = true
After the crash the standby redo started from 0xDBE1241D8, and reached consistency at 0xE41BBBFD8 because of previous minRecoveryPoint. It did not repaly all WAL record after the crash.
From the crash stack we see that it was reading clog to check xid= 63800060 status.
But in wal file we see that xid= 63800060 was first created by xlog record which lsn=0xE42C22D68.
rmgr: Heap len (rec/tot): 79/ 79, tx: 63800060, lsn: E/42C22D68, prev E/42C22D40, desc: UPDATE off 45 xmax 63800060 ; new off 56 xmax 0, blkref #0: rel 1663/122780008/122781225 blk 14313
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DB8, prev E/42C22D68, desc: INSERT_LEAF off 200, blkref #0: rel 1663/122780008/122781297 blk 2803
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DF8, prev E/42C22DB8, desc: INSERT_LEAF off 333, blkref #0: rel 1663/122780008/122781313 blk 1375
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22E38, prev E/42C22DF8, desc: INSERT_LEAF off 259, blkref #0: rel 1663/122780008/132582066 blk 1417
rmgr: Heap len (rec/tot): 197/ 197, tx: 63800060, lsn: E/42C23628, prev E/42C23600, desc: HOT_UPDATE off 35 xmax 63800060 ; new off 55 xmax 0, blkref #0: rel 1663/122780008/122781222 blk 14320
rmgr: Heap len (rec/tot): 54/ 54, tx: 63800060, lsn: E/42C23CF0, prev E/42C23CB0, desc: DELETE off 2 KEYS_UPDATED , blkref #0: rel 1663/122780008/122781230 blk 14847
rmgr: Heap len (rec/tot): 253/ 253, tx: 63800060, lsn: E/42C260E8, prev E/42C260A8, desc: INSERT off 11, blkref #0: rel 1663/122780008/122781230 blk 30300
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26308, prev E/42C262C8, desc: INSERT_LEAF off 362, blkref #0: rel 1663/122780008/122781290 blk 2925
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C266F0, prev E/42C266B0, desc: INSERT_LEAF off 369, blkref #0: rel 1663/122780008/122781315 blk 1377
rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26AE8, prev E/42C26AC0, desc: INSERT_LEAF off 308, blkref #0: rel 1663/122780008/132498288 blk 583
rmgr: Transaction len (rec/tot): 34/ 34, tx: 63800060, lsn: E/42C271A8, prev E/42C27168, desc: COMMIT 2020-03-16 09:56:21.540818 CST
rmgr: Heap2 len (rec/tot): 90/ 90, tx: 0, lsn: E/4351D3A0, prev E/4351D360, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781225 blk 14313
rmgr: Heap2 len (rec/tot): 96/ 96, tx: 0, lsn: E/4381C898, prev E/4381C860, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781222 blk 14320
Is it caused by inconsistency clog and data page ,like from minRecoveryPoint 0xE41BBBFD8 to 0xE42C22D68 dirty page was flushed to the storage but clog was not flushed and then crashed?
BRS
Ray
Appreciate any suggestion for this issue?
Is there something i misunderstand?
At 2020-03-17 00:33:36, "Thunder" <thunder1@126.com> wrote:
Hello hackers:Our standby node got fatal after the crash recovery. The fatal error was caused in slru module, i changed log level from ERROR to PANIC and got the following stack.(gdb) bt#0 0x00007f0cc47a1277 in raise () from /lib64/libc.so.6#1 0x00007f0cc47a2968 in abort () from /lib64/libc.so.6#2 0x0000000000a48347 in errfinish (dummy=dummy@entry=0) at elog.c:616#3 0x00000000005315dd in SlruReportIOError (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, xid=xid@entry=63800060) at slru.c:1175#4 0x0000000000533152 in SimpleLruReadPage (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, write_ok=write_ok@entry=true, xid=xid@entry=63800060) at slru.c:610#5 0x0000000000533350 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=pageno@entry=1947, xid=xid@entry=63800060) at slru.c:680#6 0x00000000005293fd in TransactionIdGetStatus (xid=xid@entry=63800060, lsn=lsn@entry=0x7ffd17fc5130) at clog.c:661#7 0x000000000053574a in TransactionLogFetch (transactionId=63800060) at transam.c:79#8 TransactionIdDidCommit (transactionId=transactionId@entry=63800060) at transam.c:129#9 0x00000000004f1295 in HeapTupleHeaderAdvanceLatestRemovedXid (tuple=0x2aab27e936e0, latestRemovedXid=latestRemovedXid@entry=0x7ffd17fc51b0) at heapam.c:7672#10 0x00000000005103e0 in btree_xlog_delete_get_latestRemovedXid (record=record@entry=0x4636c98) at nbtxlog.c:656#11 0x0000000000510a19 in btree_xlog_delete (record=0x4636c98) at nbtxlog.c:707#12 btree_redo (record=0x4636c98) at nbtxlog.c:1048#13 0x00000000005544a1 in StartupXLOG () at xlog.c:7825#14 0x00000000008185be in StartupProcessMain () at startup.c:226#15 0x000000000058de15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd17fc9430) at bootstrap.c:448#16 0x0000000000813fe4 in StartChildProcess (type=StartupProcess) at postmaster.c:5804#17 0x0000000000817eb0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x45ed6e0) at postmaster.c:1461#18 0x00000000004991f4 in main (argc=3, argv=0x45ed6e0) at main.c:232(gdb) p /x *record$10 = {wal_segment_size = 0x40000000, read_page = 0x54f920, system_identifier = 0x5e6e6ea4af938064, private_data = 0x7ffd17fc5390, ReadRecPtr = 0xe41e8fb28, EndRecPtr = 0xe41e8fbb0, decoded_record = 0x4634390, main_data = 0x4634c88, main_data_len = 0x54,main_data_bufsz = 0x1000, record_origin = 0x0, blocks = {{in_use = 0x1, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e81}, forknum = 0x0, blkno = 0x55f, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x4632bc1, hole_offset = 0x40,hole_length = 0x1fb0, bimg_len = 0x50, bimg_info = 0x5, has_data = 0x0, data = 0x4656938, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e2b}, forknum = 0x0, blkno = 0x3b5f, flags = 0x80,has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x46468e8, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968,relNode = 0x7e5c370}, forknum = 0x0, blkno = 0x2c3, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e77}, forknum = 0x0, blkno = 0xa8a, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0,has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x0, dbNode = 0x0, relNode = 0x0}, forknum = 0x0, blkno = 0x0, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0,bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0} <repeats 29 times>}, max_block_id = 0x0, readBuf = 0x4632868, readLen = 0x2000, readSegNo = 0x39, readOff = 0x1e8e000, readPageTLI = 0x1, latestPagePtr = 0xe41e8e000,latestPageTLI = 0x1, currRecPtr = 0xe41e8fb28, currTLI = 0x0, currTLIValidUntil = 0x0, nextTLI = 0x0, readRecordBuf = 0x4638888, readRecordBufSize = 0xa000, errormsg_buf = 0x4634878, noPayload = 0x0, polar_logindex_meta_size = 0x2e}(gdb) p /x minRecoveryPoint$11 = 0xe41bbbfd8(gdb) p reachedConsistency$12 = true(gdb) p standbyState$13 = STANDBY_SNAPSHOT_READY(gdb) p ArchiveRecoveryRequested$14 = trueAfter the crash the standby redo started from 0xDBE1241D8, and reached consistency at 0xE41BBBFD8 because of previous minRecoveryPoint. It did not repaly all WAL record after the crash.From the crash stack we see that it was reading clog to check xid= 63800060 status.But in wal file we see that xid= 63800060 was first created by xlog record which lsn=0xE42C22D68.rmgr: Heap len (rec/tot): 79/ 79, tx: 63800060, lsn: E/42C22D68, prev E/42C22D40, desc: UPDATE off 45 xmax 63800060 ; new off 56 xmax 0, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DB8, prev E/42C22D68, desc: INSERT_LEAF off 200, blkref #0: rel 1663/122780008/122781297 blk 2803rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DF8, prev E/42C22DB8, desc: INSERT_LEAF off 333, blkref #0: rel 1663/122780008/122781313 blk 1375rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22E38, prev E/42C22DF8, desc: INSERT_LEAF off 259, blkref #0: rel 1663/122780008/132582066 blk 1417rmgr: Heap len (rec/tot): 197/ 197, tx: 63800060, lsn: E/42C23628, prev E/42C23600, desc: HOT_UPDATE off 35 xmax 63800060 ; new off 55 xmax 0, blkref #0: rel 1663/122780008/122781222 blk 14320rmgr: Heap len (rec/tot): 54/ 54, tx: 63800060, lsn: E/42C23CF0, prev E/42C23CB0, desc: DELETE off 2 KEYS_UPDATED , blkref #0: rel 1663/122780008/122781230 blk 14847rmgr: Heap len (rec/tot): 253/ 253, tx: 63800060, lsn: E/42C260E8, prev E/42C260A8, desc: INSERT off 11, blkref #0: rel 1663/122780008/122781230 blk 30300rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26308, prev E/42C262C8, desc: INSERT_LEAF off 362, blkref #0: rel 1663/122780008/122781290 blk 2925rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C266F0, prev E/42C266B0, desc: INSERT_LEAF off 369, blkref #0: rel 1663/122780008/122781315 blk 1377rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26AE8, prev E/42C26AC0, desc: INSERT_LEAF off 308, blkref #0: rel 1663/122780008/132498288 blk 583rmgr: Transaction len (rec/tot): 34/ 34, tx: 63800060, lsn: E/42C271A8, prev E/42C27168, desc: COMMIT 2020-03-16 09:56:21.540818 CSTrmgr: Heap2 len (rec/tot): 90/ 90, tx: 0, lsn: E/4351D3A0, prev E/4351D360, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Heap2 len (rec/tot): 96/ 96, tx: 0, lsn: E/4381C898, prev E/4381C860, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781222 blk 14320Is it caused by inconsistency clog and data page ,like from minRecoveryPoint 0xE41BBBFD8 to 0xE42C22D68 dirty page was flushed to the storage but clog was not flushed and then crashed?BRSRay
The slru error detail is the following.
DETAIL: Could not read from file "/data/pg_xact/003C" at offset 221184: Success.
I think read /data/pg_xact/003C from offset 221184 and the return value was 0.
At 2020-03-17 10:36:03, "Thunder" <thunder1@126.com> wrote:
Appreciate any suggestion for this issue?Is there something i misunderstand?
At 2020-03-17 00:33:36, "Thunder" <thunder1@126.com> wrote:
Hello hackers:Our standby node got fatal after the crash recovery. The fatal error was caused in slru module, i changed log level from ERROR to PANIC and got the following stack.(gdb) bt#0 0x00007f0cc47a1277 in raise () from /lib64/libc.so.6#1 0x00007f0cc47a2968 in abort () from /lib64/libc.so.6#2 0x0000000000a48347 in errfinish (dummy=dummy@entry=0) at elog.c:616#3 0x00000000005315dd in SlruReportIOError (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, xid=xid@entry=63800060) at slru.c:1175#4 0x0000000000533152 in SimpleLruReadPage (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=1947, write_ok=write_ok@entry=true, xid=xid@entry=63800060) at slru.c:610#5 0x0000000000533350 in SimpleLruReadPage_ReadOnly (ctl=ctl@entry=0xfbad00 <ClogCtlData>, pageno=pageno@entry=1947, xid=xid@entry=63800060) at slru.c:680#6 0x00000000005293fd in TransactionIdGetStatus (xid=xid@entry=63800060, lsn=lsn@entry=0x7ffd17fc5130) at clog.c:661#7 0x000000000053574a in TransactionLogFetch (transactionId=63800060) at transam.c:79#8 TransactionIdDidCommit (transactionId=transactionId@entry=63800060) at transam.c:129#9 0x00000000004f1295 in HeapTupleHeaderAdvanceLatestRemovedXid (tuple=0x2aab27e936e0, latestRemovedXid=latestRemovedXid@entry=0x7ffd17fc51b0) at heapam.c:7672#10 0x00000000005103e0 in btree_xlog_delete_get_latestRemovedXid (record=record@entry=0x4636c98) at nbtxlog.c:656#11 0x0000000000510a19 in btree_xlog_delete (record=0x4636c98) at nbtxlog.c:707#12 btree_redo (record=0x4636c98) at nbtxlog.c:1048#13 0x00000000005544a1 in StartupXLOG () at xlog.c:7825#14 0x00000000008185be in StartupProcessMain () at startup.c:226#15 0x000000000058de15 in AuxiliaryProcessMain (argc=argc@entry=2, argv=argv@entry=0x7ffd17fc9430) at bootstrap.c:448#16 0x0000000000813fe4 in StartChildProcess (type=StartupProcess) at postmaster.c:5804#17 0x0000000000817eb0 in PostmasterMain (argc=argc@entry=3, argv=argv@entry=0x45ed6e0) at postmaster.c:1461#18 0x00000000004991f4 in main (argc=3, argv=0x45ed6e0) at main.c:232(gdb) p /x *record$10 = {wal_segment_size = 0x40000000, read_page = 0x54f920, system_identifier = 0x5e6e6ea4af938064, private_data = 0x7ffd17fc5390, ReadRecPtr = 0xe41e8fb28, EndRecPtr = 0xe41e8fbb0, decoded_record = 0x4634390, main_data = 0x4634c88, main_data_len = 0x54,main_data_bufsz = 0x1000, record_origin = 0x0, blocks = {{in_use = 0x1, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e81}, forknum = 0x0, blkno = 0x55f, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x4632bc1, hole_offset = 0x40,hole_length = 0x1fb0, bimg_len = 0x50, bimg_info = 0x5, has_data = 0x0, data = 0x4656938, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e2b}, forknum = 0x0, blkno = 0x3b5f, flags = 0x80,has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x46468e8, data_len = 0x0, data_bufsz = 0x2000}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968,relNode = 0x7e5c370}, forknum = 0x0, blkno = 0x2c3, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x67f, dbNode = 0x7517968, relNode = 0x7517e77}, forknum = 0x0, blkno = 0xa8a, flags = 0x80, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0, bimg_len = 0x0, bimg_info = 0x0,has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0}, {in_use = 0x0, rnode = {spcNode = 0x0, dbNode = 0x0, relNode = 0x0}, forknum = 0x0, blkno = 0x0, flags = 0x0, has_image = 0x0, apply_image = 0x0, bkp_image = 0x0, hole_offset = 0x0, hole_length = 0x0,bimg_len = 0x0, bimg_info = 0x0, has_data = 0x0, data = 0x0, data_len = 0x0, data_bufsz = 0x0} <repeats 29 times>}, max_block_id = 0x0, readBuf = 0x4632868, readLen = 0x2000, readSegNo = 0x39, readOff = 0x1e8e000, readPageTLI = 0x1, latestPagePtr = 0xe41e8e000,latestPageTLI = 0x1, currRecPtr = 0xe41e8fb28, currTLI = 0x0, currTLIValidUntil = 0x0, nextTLI = 0x0, readRecordBuf = 0x4638888, readRecordBufSize = 0xa000, errormsg_buf = 0x4634878, noPayload = 0x0, polar_logindex_meta_size = 0x2e}(gdb) p /x minRecoveryPoint$11 = 0xe41bbbfd8(gdb) p reachedConsistency$12 = true(gdb) p standbyState$13 = STANDBY_SNAPSHOT_READY(gdb) p ArchiveRecoveryRequested$14 = trueAfter the crash the standby redo started from 0xDBE1241D8, and reached consistency at 0xE41BBBFD8 because of previous minRecoveryPoint. It did not repaly all WAL record after the crash.From the crash stack we see that it was reading clog to check xid= 63800060 status.But in wal file we see that xid= 63800060 was first created by xlog record which lsn=0xE42C22D68.rmgr: Heap len (rec/tot): 79/ 79, tx: 63800060, lsn: E/42C22D68, prev E/42C22D40, desc: UPDATE off 45 xmax 63800060 ; new off 56 xmax 0, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DB8, prev E/42C22D68, desc: INSERT_LEAF off 200, blkref #0: rel 1663/122780008/122781297 blk 2803rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22DF8, prev E/42C22DB8, desc: INSERT_LEAF off 333, blkref #0: rel 1663/122780008/122781313 blk 1375rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C22E38, prev E/42C22DF8, desc: INSERT_LEAF off 259, blkref #0: rel 1663/122780008/132582066 blk 1417rmgr: Heap len (rec/tot): 197/ 197, tx: 63800060, lsn: E/42C23628, prev E/42C23600, desc: HOT_UPDATE off 35 xmax 63800060 ; new off 55 xmax 0, blkref #0: rel 1663/122780008/122781222 blk 14320rmgr: Heap len (rec/tot): 54/ 54, tx: 63800060, lsn: E/42C23CF0, prev E/42C23CB0, desc: DELETE off 2 KEYS_UPDATED , blkref #0: rel 1663/122780008/122781230 blk 14847rmgr: Heap len (rec/tot): 253/ 253, tx: 63800060, lsn: E/42C260E8, prev E/42C260A8, desc: INSERT off 11, blkref #0: rel 1663/122780008/122781230 blk 30300rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26308, prev E/42C262C8, desc: INSERT_LEAF off 362, blkref #0: rel 1663/122780008/122781290 blk 2925rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C266F0, prev E/42C266B0, desc: INSERT_LEAF off 369, blkref #0: rel 1663/122780008/122781315 blk 1377rmgr: Btree len (rec/tot): 64/ 64, tx: 63800060, lsn: E/42C26AE8, prev E/42C26AC0, desc: INSERT_LEAF off 308, blkref #0: rel 1663/122780008/132498288 blk 583rmgr: Transaction len (rec/tot): 34/ 34, tx: 63800060, lsn: E/42C271A8, prev E/42C27168, desc: COMMIT 2020-03-16 09:56:21.540818 CSTrmgr: Heap2 len (rec/tot): 90/ 90, tx: 0, lsn: E/4351D3A0, prev E/4351D360, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781225 blk 14313rmgr: Heap2 len (rec/tot): 96/ 96, tx: 0, lsn: E/4381C898, prev E/4381C860, desc: CLEAN remxid 63800060, blkref #0: rel 1663/122780008/122781222 blk 14320Is it caused by inconsistency clog and data page ,like from minRecoveryPoint 0xE41BBBFD8 to 0xE42C22D68 dirty page was flushed to the storage but clog was not flushed and then crashed?BRSRay
On Tue, Mar 17, 2020 at 11:02:24AM +0800, Thunder wrote: > The slru error detail is the following. > DETAIL: Could not read from file "/data/pg_xact/003C" at offset 221184: Success. > > I think read /data/pg_xact/003C from offset 221184 and the return value was 0. What is the version of PostgreSQL you are using here? You have not mentioned this information in any of the three emails you sent on this thread. -- Michael
Attachment
Sorry.
We are using pg11, and cloned from tag REL_11_BETA2.
At 2020-03-17 11:43:51, "Michael Paquier" <michael@paquier.xyz> wrote: >On Tue, Mar 17, 2020 at 11:02:24AM +0800, Thunder wrote: >> The slru error detail is the following. >> DETAIL: Could not read from file "/data/pg_xact/003C" at offset 221184: Success. >> >> I think read /data/pg_xact/003C from offset 221184 and the return value was 0. > >What is the version of PostgreSQL you are using here? You have not >mentioned this information in any of the three emails you sent on this >thread. >-- >Michael
On 2020/03/17 12:53, Thunder wrote: > Sorry. > We are using pg11, and cloned from tag REL_11_BETA2. In that case you should upgrade to the current version in the PostgreSQL 11 series (at the time of writing 11.7). Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Mar 18, 2020 at 11:19:22AM +0900, Ian Barwick wrote: > On 2020/03/17 12:53, Thunder wrote: > > Sorry. > > We are using pg11, and cloned from tag REL_11_BETA2. > > In that case you should upgrade to the current version > in the PostgreSQL 11 series (at the time of writing 11.7). Definitely. The closest things I can see in this area are 9a1bd8 and c34f80 which had symptoms similar to what you have mentioned here with btree_xlog_delete(), but that's past history. -- Michael