Thread: crash 11.5~
A daily report crashed repeatedly this morning running pg11.4. I compiled 11.5 and reproduced it there, too, so I'm including backtrace with -O0. I'm trying to dig further into it, but it seems to be crashing under load, but not when I try to narrow down to a single report, which seem to run to completion when run individually. postmaster[6750]: segfault at 7f4a527545bc ip 00000000004834ae sp 00007ffd547e2760 error 4 in postgres[400000+6e6000] postmaster[29786]: segfault at 7f49a968d000 ip 00007f49b24abb0f sp 00007ffd547e2268 error 4 in libc-2.12.so (deleted)[7f49b2422000+18b000] Core was generated by `postgres: telsasoft ts 192.168.122.11(35454) SELECT '. Program terminated with signal 11, Segmentation fault. #0 0x00000000004877df in slot_deform_tuple (slot=0xe204638, natts=3) at heaptuple.c:1465 #1 0x0000000000487f5b in slot_getsomeattrs (slot=0xe204638, attnum=3) at heaptuple.c:1673 #2 0x00000000006d5aae in ExecInterpExpr (state=0xe257948, econtext=0xe204578, isnull=0x7ffff5f38077) at execExprInterp.c:443 #3 0x00000000006ed221 in ExecEvalExprSwitchContext (state=0xe257948, econtext=0xe204578, isNull=0x7ffff5f38077) at ../../../src/include/executor/executor.h:313 #4 0x00000000006ed30e in ExecQual (state=0xe257948, econtext=0xe204578) at ../../../src/include/executor/executor.h:382 #5 0x00000000006ed5f4 in ExecScan (node=0x252edc8, accessMtd=0x71c54c <SeqNext>, recheckMtd=0x71c623 <SeqRecheck>) at execScan.c:190 #6 0x000000000071c670 in ExecSeqScan (pstate=0x252edc8) at nodeSeqscan.c:129 #7 0x00000000006eb855 in ExecProcNodeInstr (node=0x252edc8) at execProcnode.c:461 #8 0x00000000006f6555 in ExecProcNode (node=0x252edc8) at ../../../src/include/executor/executor.h:247 #9 0x00000000006f6a05 in ExecAppend (pstate=0xe20e7b0) at nodeAppend.c:294 #10 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e7b0) at execProcnode.c:461 #11 0x0000000000708a96 in ExecProcNode (node=0xe20e7b0) at ../../../src/include/executor/executor.h:247 #12 0x0000000000709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20e7b0, hjstate=0xe20e4d8, hashvalue=0x7ffff5f3827c) atnodeHashjoin.c:821 #13 0x0000000000708f8d in ExecHashJoinImpl (pstate=0xe20e4d8, parallel=false) at nodeHashjoin.c:355 #14 0x00000000007094ca in ExecHashJoin (pstate=0xe20e4d8) at nodeHashjoin.c:565 #15 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e4d8) at execProcnode.c:461 #16 0x000000000071d9da in ExecProcNode (node=0xe20e4d8) at ../../../src/include/executor/executor.h:247 #17 0x000000000071db26 in ExecSort (pstate=0xe20e3c0) at nodeSort.c:107 #18 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e3c0) at execProcnode.c:461 #19 0x00000000006eb826 in ExecProcNodeFirst (node=0xe20e3c0) at execProcnode.c:445 #20 0x00000000006f74c5 in ExecProcNode (node=0xe20e3c0) at ../../../src/include/executor/executor.h:247 #21 0x00000000006f79bb in fetch_input_tuple (aggstate=0xe20df70) at nodeAgg.c:406 #22 0x00000000006f9d11 in agg_retrieve_direct (aggstate=0xe20df70) at nodeAgg.c:1755 #23 0x00000000006f98fa in ExecAgg (pstate=0xe20df70) at nodeAgg.c:1570 #24 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20df70) at execProcnode.c:461 #25 0x00000000006eb826 in ExecProcNodeFirst (node=0xe20df70) at execProcnode.c:445 #26 0x0000000000708a96 in ExecProcNode (node=0xe20df70) at ../../../src/include/executor/executor.h:247 #27 0x0000000000709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20df70, hjstate=0x2778580, hashvalue=0x7ffff5f3865c) atnodeHashjoin.c:821 #28 0x0000000000708f8d in ExecHashJoinImpl (pstate=0x2778580, parallel=false) at nodeHashjoin.c:355 #29 0x00000000007094ca in ExecHashJoin (pstate=0x2778580) at nodeHashjoin.c:565 #30 0x00000000006eb855 in ExecProcNodeInstr (node=0x2778580) at execProcnode.c:461 #31 0x00000000006eb826 in ExecProcNodeFirst (node=0x2778580) at execProcnode.c:445 #32 0x0000000000702a9b in ExecProcNode (node=0x2778580) at ../../../src/include/executor/executor.h:247 #33 0x0000000000702d12 in MultiExecPrivateHash (node=0x27783a8) at nodeHash.c:164 #34 0x0000000000702c8c in MultiExecHash (node=0x27783a8) at nodeHash.c:114 #35 0x00000000006eb8f7 in MultiExecProcNode (node=0x27783a8) at execProcnode.c:501 #36 0x0000000000708e2a in ExecHashJoinImpl (pstate=0x2777210, parallel=false) at nodeHashjoin.c:290 #37 0x00000000007094ca in ExecHashJoin (pstate=0x2777210) at nodeHashjoin.c:565 #38 0x00000000006eb855 in ExecProcNodeInstr (node=0x2777210) at execProcnode.c:461 #39 0x00000000006eb826 in ExecProcNodeFirst (node=0x2777210) at execProcnode.c:445 #40 0x000000000071d9da in ExecProcNode (node=0x2777210) at ../../../src/include/executor/executor.h:247 #41 0x000000000071db26 in ExecSort (pstate=0x2776ce0) at nodeSort.c:107 #42 0x00000000006eb855 in ExecProcNodeInstr (node=0x2776ce0) at execProcnode.c:461 #43 0x00000000006eb826 in ExecProcNodeFirst (node=0x2776ce0) at execProcnode.c:445 #44 0x00000000006f74c5 in ExecProcNode (node=0x2776ce0) at ../../../src/include/executor/executor.h:247 #45 0x00000000006f79bb in fetch_input_tuple (aggstate=0x2776ec8) at nodeAgg.c:406 #46 0x00000000006f9d11 in agg_retrieve_direct (aggstate=0x2776ec8) at nodeAgg.c:1755 #47 0x00000000006f98fa in ExecAgg (pstate=0x2776ec8) at nodeAgg.c:1570 #48 0x00000000006eb855 in ExecProcNodeInstr (node=0x2776ec8) at execProcnode.c:461 #49 0x00000000006eb826 in ExecProcNodeFirst (node=0x2776ec8) at execProcnode.c:445 #50 0x00000000006e0129 in ExecProcNode (node=0x2776ec8) at ../../../src/include/executor/executor.h:247 #51 0x00000000006e2abd in ExecutePlan (estate=0x27769d8, planstate=0x2776ec8, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0xa66bef0, execute_once=true)at execMain.c:1723 #52 0x00000000006e0710 in standard_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 #53 0x00007fd37e757633 in pgss_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at pg_stat_statements.c:892 #54 0x00007fd37e2d86ef in explain_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at auto_explain.c:281 #55 0x00000000006e0511 in ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true) atexecMain.c:305 #56 0x00000000008cfebe in PortalRunSelect (portal=0x21d4bf8, forward=true, count=0, dest=0xa66bef0) at pquery.c:932 #57 0x00000000008cfb4c in PortalRun (portal=0x21d4bf8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0xa66bef0,altdest=0xa66bef0, completionTag=0x7ffff5f39100 "") at pquery.c:773 #58 0x00000000008c9b22 in exec_simple_query (query_string=0xe2fbd80 "--BEGIN SQL\nSELECT site_office AS site_gran,\n\tsite_location||'.'||sect_numAS bs,\n\tgsm_site_name_to_sect_name(site_name, sect_num, sect_name) AS sitename,\n\tperiodAS start_time,\n\t(maybe_div(sum(rrc_su"...) at postgres.c:1145 #59 0x00000000008cddf3 in PostgresMain (argc=1, argv=0x218fef8, dbname=0x215da48 "ts", username=0x218fed0 "telsasoft") atpostgres.c:4182 #60 0x000000000082a098 in BackendRun (port=0x21867d0) at postmaster.c:4358 #61 0x0000000000829806 in BackendStartup (port=0x21867d0) at postmaster.c:4030 #62 0x0000000000825cab in ServerLoop () at postmaster.c:1707 #63 0x00000000008255dd in PostmasterMain (argc=3, argv=0x215b9a0) at postmaster.c:1380 #64 0x000000000074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228 (gdb) #0 0x00000000004877df in slot_deform_tuple (slot=0xe204638, natts=3) at heaptuple.c:1465 thisatt = 0x5265db0 tuple = 0xe838d40 tupleDesc = 0x5265d20 values = 0xe204698 isnull = 0xe2057e8 tup = 0x7fd3eb1b8058 hasnulls = false attnum = 1 tp = 0x7fd3eb1b80bd <Address 0x7fd3eb1b80bd out of bounds> off = 583369983 bp = 0x7fd3eb1b806f <Address 0x7fd3eb1b806f out of bounds> slow = true #1 0x0000000000487f5b in slot_getsomeattrs (slot=0xe204638, attnum=3) at heaptuple.c:1673 tuple = 0xe838d40 attno = 3 __func__ = "slot_getsomeattrs" #2 0x00000000006d5aae in ExecInterpExpr (state=0xe257948, econtext=0xe204578, isnull=0x7ffff5f38077) at execExprInterp.c:443 op = 0xe257530 resultslot = 0x0 innerslot = 0x0 outerslot = 0x0 scanslot = 0xe204638 dispatch_table = {0x6d5a0d, 0x6d5a32, 0x6d5a61, 0x6d5a93, 0x6d5ac5, 0x6d5b74, 0x6d5c23, 0x6d5cd2, 0x6d5da3, 0x6d5e74,0x6d5f45, 0x6d5f7c, 0x6d6056, 0x6d6130, 0x6d620a, 0x6d627e, 0x6d6347, 0x6d638f, 0x6d640e, 0x6d64ec, 0x6d6523, 0x6d655a,0x6d6568, 0x6d65e9, 0x6d6657, 0x6d6665, 0x6d66e6, 0x6d6754, 0x6d6790, 0x6d6821, 0x6d6856, 0x6d68b4, 0x6d6915, 0x6d6986, 0x6d69ca, 0x6d6a14, 0x6d6a4b, 0x6d6a82,0x6d6acb, 0x6d6b3b, 0x6d6bab, 0x6d6bf4, 0x6d6c2b, 0x6d6c62, 0x6d6ca1, 0x6d6dc3, 0x6d6e2e, 0x6d704c, 0x6d7161, 0x6d726f,0x6d7376, 0x6d73a6, 0x6d73d6, 0x6d7406, 0x6d7436, 0x6d746d, 0x6d749d, 0x6d7625, 0x6d7724, 0x6d7754, 0x6d778b, 0x6d77c2, 0x6d77f9, 0x6d7862, 0x6d7892, 0x6d78c2,0x6d6d32, 0x6d7959, 0x6d7989, 0x6d78f2, 0x6d7929, 0x6d79b9, 0x6d79e9, 0x6d7a98, 0x6d7ac8, 0x6d7b77, 0x6d7bae, 0x6d7be5,0x6d7c40, 0x6d7cf5, 0x6d7d9a, 0x6d7ead, 0x6d7f52, 0x6d80b4, 0x6d8261, 0x6d8298, 0x6d82cf} #3 0x00000000006ed221 in ExecEvalExprSwitchContext (state=0xe257948, econtext=0xe204578, isNull=0x7ffff5f38077) at ../../../src/include/executor/executor.h:313 retDatum = 140737319764160 oldContext = 0x27768c0 #4 0x00000000006ed30e in ExecQual (state=0xe257948, econtext=0xe204578) at ../../../src/include/executor/executor.h:382 ret = 3018996923 isnull = false #5 0x00000000006ed5f4 in ExecScan (node=0x252edc8, accessMtd=0x71c54c <SeqNext>, recheckMtd=0x71c623 <SeqRecheck>) at execScan.c:190 slot = 0xe204638 econtext = 0xe204578 qual = 0xe257948 projInfo = 0xe257498 #6 0x000000000071c670 in ExecSeqScan (pstate=0x252edc8) at nodeSeqscan.c:129 ---Type <return> to continue, or q <return> to quit--- node = 0x252edc8 #7 0x00000000006eb855 in ExecProcNodeInstr (node=0x252edc8) at execProcnode.c:461 result = 0x702ace #8 0x00000000006f6555 in ExecProcNode (node=0x252edc8) at ../../../src/include/executor/executor.h:247 No locals. #9 0x00000000006f6a05 in ExecAppend (pstate=0xe20e7b0) at nodeAppend.c:294 subnode = 0x252edc8 result = 0xe258508 node = 0xe20e7b0 #10 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e7b0) at execProcnode.c:461 result = 0x20e257058 #11 0x0000000000708a96 in ExecProcNode (node=0xe20e7b0) at ../../../src/include/executor/executor.h:247 No locals. #12 0x0000000000709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20e7b0, hjstate=0xe20e4d8, hashvalue=0x7ffff5f3827c) atnodeHashjoin.c:821 hashtable = 0xe8387c8 curbatch = 0 slot = 0x0 #13 0x0000000000708f8d in ExecHashJoinImpl (pstate=0xe20e4d8, parallel=false) at nodeHashjoin.c:355 node = 0xe20e4d8 outerNode = 0xe20e7b0 hashNode = 0xe258720 joinqual = 0x0 otherqual = 0x0 econtext = 0xe20e6f0 hashtable = 0xe8387c8 outerTupleSlot = 0xe257058 hashvalue = 3018996923 batchno = 0 parallel_state = 0x0 __func__ = "ExecHashJoinImpl" #14 0x00000000007094ca in ExecHashJoin (pstate=0xe20e4d8) at nodeHashjoin.c:565 No locals. #15 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e4d8) at execProcnode.c:461 result = 0x20c4 #16 0x000000000071d9da in ExecProcNode (node=0xe20e4d8) at ../../../src/include/executor/executor.h:247 No locals. #17 0x000000000071db26 in ExecSort (pstate=0xe20e3c0) at nodeSort.c:107 plannode = 0xa6309c0 outerNode = 0xe20e4d8 ---Type <return> to continue, or q <return> to quit--- tupDesc = 0xe2598a0 node = 0xe20e3c0 estate = 0x27769d8 dir = ForwardScanDirection tuplesortstate = 0x2766998 slot = 0xe25b8b8 #18 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20e3c0) at execProcnode.c:461 result = 0x4785d0 #19 0x00000000006eb826 in ExecProcNodeFirst (node=0xe20e3c0) at execProcnode.c:445 No locals. #20 0x00000000006f74c5 in ExecProcNode (node=0xe20e3c0) at ../../../src/include/executor/executor.h:247 No locals. #21 0x00000000006f79bb in fetch_input_tuple (aggstate=0xe20df70) at nodeAgg.c:406 slot = 0xe20e240 #22 0x00000000006f9d11 in agg_retrieve_direct (aggstate=0xe20df70) at nodeAgg.c:1755 node = 0xa6315d0 econtext = 0xe20e300 tmpcontext = 0xe20e1a8 peragg = 0x252e198 pergroups = 0xe7cdd40 outerslot = 0x7ffff5f38520 firstSlot = 0xe263e70 result = 0x0 hasGroupingSets = false numGroupingSets = 1 currentSet = 222632768 nextSetSize = 0 numReset = 1 i = 1 #23 0x00000000006f98fa in ExecAgg (pstate=0xe20df70) at nodeAgg.c:1570 node = 0xe20df70 result = 0x0 #24 0x00000000006eb855 in ExecProcNodeInstr (node=0xe20df70) at execProcnode.c:461 result = 0x4785d0 #25 0x00000000006eb826 in ExecProcNodeFirst (node=0xe20df70) at execProcnode.c:445 No locals. #26 0x0000000000708a96 in ExecProcNode (node=0xe20df70) at ../../../src/include/executor/executor.h:247 No locals. #27 0x0000000000709b0d in ExecHashJoinOuterGetTuple (outerNode=0xe20df70, hjstate=0x2778580, hashvalue=0x7ffff5f3865c) atnodeHashjoin.c:821 #28 0x0000000000708f8d in ExecHashJoinImpl (pstate=0x2778580, parallel=false) at nodeHashjoin.c:355 node = 0x2778580 outerNode = 0xe20df70 hashNode = 0xe7e2560 joinqual = 0x0 otherqual = 0x0 econtext = 0x2778798 hashtable = 0xe837df0 outerTupleSlot = 0x4040 hashvalue = 0 batchno = 10926240 parallel_state = 0x0 __func__ = "ExecHashJoinImpl" #29 0x00000000007094ca in ExecHashJoin (pstate=0x2778580) at nodeHashjoin.c:565 No locals. #30 0x00000000006eb855 in ExecProcNodeInstr (node=0x2778580) at execProcnode.c:461 result = 0x4785d0 #31 0x00000000006eb826 in ExecProcNodeFirst (node=0x2778580) at execProcnode.c:445 No locals. #32 0x0000000000702a9b in ExecProcNode (node=0x2778580) at ../../../src/include/executor/executor.h:247 No locals. #33 0x0000000000702d12 in MultiExecPrivateHash (node=0x27783a8) at nodeHash.c:164 outerNode = 0x2778580 hashkeys = 0xe7ebf78 hashtable = 0xe837cd8 slot = 0x80000000001 econtext = 0x27784c0 hashvalue = 0 #34 0x0000000000702c8c in MultiExecHash (node=0x27783a8) at nodeHash.c:114 No locals. #35 0x00000000006eb8f7 in MultiExecProcNode (node=0x27783a8) at execProcnode.c:501 result = 0x4 __func__ = "MultiExecProcNode" #36 0x0000000000708e2a in ExecHashJoinImpl (pstate=0x2777210, parallel=false) at nodeHashjoin.c:290 node = 0x2777210 outerNode = 0x27774e8 ---Type <return> to continue, or q <return> to quit--- hashNode = 0x27783a8 joinqual = 0x0 otherqual = 0x0 econtext = 0x2777428 hashtable = 0xe837cd8 outerTupleSlot = 0x7ffff5f388ac hashvalue = 0 batchno = 0 parallel_state = 0x0 __func__ = "ExecHashJoinImpl" #37 0x00000000007094ca in ExecHashJoin (pstate=0x2777210) at nodeHashjoin.c:565 No locals. #38 0x00000000006eb855 in ExecProcNodeInstr (node=0x2777210) at execProcnode.c:461 result = 0x4785d0 #39 0x00000000006eb826 in ExecProcNodeFirst (node=0x2777210) at execProcnode.c:445 No locals. #40 0x000000000071d9da in ExecProcNode (node=0x2777210) at ../../../src/include/executor/executor.h:247 No locals. #41 0x000000000071db26 in ExecSort (pstate=0x2776ce0) at nodeSort.c:107 plannode = 0xa632c18 outerNode = 0x2777210 tupDesc = 0xe7e6980 node = 0x2776ce0 estate = 0x27769d8 dir = ForwardScanDirection tuplesortstate = 0x24feec8 slot = 0x219bfa0 #42 0x00000000006eb855 in ExecProcNodeInstr (node=0x2776ce0) at execProcnode.c:461 result = 0x4785d0 #43 0x00000000006eb826 in ExecProcNodeFirst (node=0x2776ce0) at execProcnode.c:445 No locals. #44 0x00000000006f74c5 in ExecProcNode (node=0x2776ce0) at ../../../src/include/executor/executor.h:247 No locals. #45 0x00000000006f79bb in fetch_input_tuple (aggstate=0x2776ec8) at nodeAgg.c:406 slot = 0x2776df8 #46 0x00000000006f9d11 in agg_retrieve_direct (aggstate=0x2776ec8) at nodeAgg.c:1755 node = 0xa633eb0 econtext = 0x2777150 tmpcontext = 0x2776bf0 ---Type <return> to continue, or q <return> to quit--- peragg = 0xe824520 pergroups = 0xe832e88 outerslot = 0x7ffff5f38b40 firstSlot = 0xe81f6b8 result = 0x7ffff5f38b10 hasGroupingSets = false numGroupingSets = 1 currentSet = 243398904 nextSetSize = 0 numReset = 1 i = 1 #47 0x00000000006f98fa in ExecAgg (pstate=0x2776ec8) at nodeAgg.c:1570 node = 0x2776ec8 result = 0x0 #48 0x00000000006eb855 in ExecProcNodeInstr (node=0x2776ec8) at execProcnode.c:461 result = 0x4785d0 #49 0x00000000006eb826 in ExecProcNodeFirst (node=0x2776ec8) at execProcnode.c:445 No locals. #50 0x00000000006e0129 in ExecProcNode (node=0x2776ec8) at ../../../src/include/executor/executor.h:247 No locals. #51 0x00000000006e2abd in ExecutePlan (estate=0x27769d8, planstate=0x2776ec8, use_parallel_mode=false, operation=CMD_SELECT,sendTuples=true, numberTuples=0, direction=ForwardScanDirection, dest=0xa66bef0, execute_once=true)at execMain.c:1723 slot = 0x150 current_tuple_count = 0 #52 0x00000000006e0710 in standard_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at execMain.c:364 estate = 0x27769d8 operation = CMD_SELECT dest = 0xa66bef0 sendTuples = true oldcontext = 0xdeb5e20 __func__ = "standard_ExecutorRun" #53 0x00007fd37e757633 in pgss_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at pg_stat_statements.c:892 save_exception_stack = 0x7ffff5f38dd0 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {233529144, 9176610699338068304, 4687312, 140737319770032, 0, 0, 9176610699377914192,9151536278965596496}, __mask_was_saved = 0, __saved_mask = {__val = {4294967302, 243497128, 4295032832,140737319767408, 7255973, 140737319767408, 68726688595, 41380312, 174276272, 174278840, 41381576, 0, 41381576, 16, 174278840, 140737319767696}}}} #54 0x00007fd37e2d86ef in explain_ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true)at auto_explain.c:281 save_exception_stack = 0x7ffff5f38fa0 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {233529144, 9176610699451314512, 4687312, 140737319770032, 0, 0, 9176610699340165456,9151535941362266448}, __mask_was_saved = 0, __saved_mask = {__val = {10926707, 24, 41371824, 41371824,238009728, 41372168, 8600905475, 35241888, 336, 243497664, ---Type <return> to continue, or q <return> to quit--- 41380032, 243498000, 233528864, 336, 174505712, 140737319767760}}}} #55 0x00000000006e0511 in ExecutorRun (queryDesc=0xdeb5f38, direction=ForwardScanDirection, count=0, execute_once=true) atexecMain.c:305 No locals. #56 0x00000000008cfebe in PortalRunSelect (portal=0x21d4bf8, forward=true, count=0, dest=0xa66bef0) at pquery.c:932 queryDesc = 0xdeb5f38 direction = ForwardScanDirection nprocessed = 233528864 __func__ = "PortalRunSelect" #57 0x00000000008cfb4c in PortalRun (portal=0x21d4bf8, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0xa66bef0,altdest=0xa66bef0, completionTag=0x7ffff5f39100 "") at pquery.c:773 save_exception_stack = 0x7ffff5f39240 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {174505712, 9176610698715214160, 4687312, 140737319770032, 0, 0, 9176610699432440144,-9176589804060397232}, __mask_was_saved = 0, __saved_mask = {__val = {233529080, 2405784, 21306251184,128, 13, 616, 174505656, 140737319768192, 10970883, 480, 112, 174505712, 34999312, 174505824, 233528864, 112}}}} result = false nprocessed = 4687312 saveTopTransactionResourceOwner = 0x2197a20 saveTopTransactionContext = 0x27748b0 saveActivePortal = 0x0 saveResourceOwner = 0x2197a20 savePortalContext = 0x0 saveMemoryContext = 0x27748b0 __func__ = "PortalRun" #58 0x00000000008c9b22 in exec_simple_query (query_string=0xe2fbd80 "--BEGIN SQL\nSELECT site_office AS site_gran,\n\tsite_location||'.'||sect_numAS bs,\n\tgsm_site_name_to_sect_name(site_name, sect_num, sect_name) AS sitename,\n\tperiodAS start_time,\n\t(maybe_div(sum(rrc_su"...) at postgres.c:1145 parsetree = 0xd4f1c38 portal = 0x21d4bf8 snapshot_set = true commandTag = 0xc5ed4d "SELECT" completionTag = "\000\000\000\000\002\000\000\000\n\000\000\000\000\000\000\000P\221\363\365\377\177\000\000\017s\245\000\000\000\000\000\036\020\000\000\000\000\000\000\000\f\026\002\030\020\000\000\200\275/\016\000\000\000\000(\r\026\002\006\000\000" querytree_list = 0xe230d58 plantree_list = 0xa66beb8 receiver = 0xa66bef0 format = 0 dest = DestRemote oldcontext = 0x27748b0 parsetree_list = 0xd4f1c98 parsetree_item = 0xd4f1c70 save_log_statement_stats = false was_logged = false ---Type <return> to continue, or q <return> to quit--- use_implicit_block = false msec_str = "\220\221\363\365\377\177\000\000&C\245\000\000\000\000\000\006\000\000\000\030\020\000\000\200\275/\016\000\000\000" __func__ = "exec_simple_query" #59 0x00000000008cddf3 in PostgresMain (argc=1, argv=0x218fef8, dbname=0x215da48 "ts", username=0x218fed0 "telsasoft") atpostgres.c:4182 query_string = 0xe2fbd80 "--BEGIN SQL\nSELECT site_office AS site_gran,\n\tsite_location||'.'||sect_num AS bs,\n\tgsm_site_name_to_sect_name(site_name,sect_num, sect_name) AS sitename,\n\tperiod AS start_time,\n\t(maybe_div(sum(rrc_su"... firstchar = 81 input_message = {data = 0xe2fbd80 "--BEGIN SQL\nSELECT site_office AS site_gran,\n\tsite_location||'.'||sect_numAS bs,\n\tgsm_site_name_to_sect_name(site_name, sect_num, sect_name) AS sitename,\n\tperiodAS start_time,\n\t(maybe_div(sum(rrc_su"..., len = 4121, maxlen = 8192, cursor = 4121} local_sigjmp_buf = {{__jmpbuf = {35192568, 9176610698734088528, 4687312, 140737319770032, 0, 0, 9176610698788614480,-9176589805133483696}, __mask_was_saved = 1, __saved_mask = {__val = {0, 0, 4494597, 0, 0, 0, 10923965,1024, 35138464, 140737319768848, 10928134, 35138464, 34999312, 34999312, 35138440, 8}}}} send_ready_for_query = false disable_idle_in_transaction_timeout = false __func__ = "PostgresMain" #60 0x000000000082a098 in BackendRun (port=0x21867d0) at postmaster.c:4358 av = 0x218fef8 maxac = 2 ac = 1 secs = 618513625 usecs = 637903 i = 1 __func__ = "BackendRun" #61 0x0000000000829806 in BackendStartup (port=0x21867d0) at postmaster.c:4030 bn = 0x2185ef0 pid = 0 __func__ = "BackendStartup" #62 0x0000000000825cab in ServerLoop () at postmaster.c:1707 port = 0x21867d0 i = 0 rmask = {fds_bits = {8, 0 <repeats 15 times>}} selres = 1 now = 1565198425 readmask = {fds_bits = {56, 0 <repeats 15 times>}} nSockets = 6 last_lockfile_recheck_time = 1565198368 last_touch_time = 1565197325 __func__ = "ServerLoop" #63 0x00000000008255dd in PostmasterMain (argc=3, argv=0x215b9a0) at postmaster.c:1380 opt = -1 status = 0 ---Type <return> to continue, or q <return> to quit--- userDoption = 0x217d500 "/var/lib/pgsql/11/data" listen_addr_saved = true i = 64 output_config_variable = 0x0 __func__ = "PostmasterMain" #64 0x000000000074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228 do_check_root = true
I checked this still happens with max_parallel_workers_per_gather=0. Now, I just reproduced using SELECT * FROM that table: (gdb) p thisatt->attrelid $4 = 2015128626 ts=# SELECT 2015128626::regclass; regclass | child.huawei_umts_ucell_201908 (gdb) p thisatt->attnum $1 = 2 (gdb) p attnum # For earlier crash this is 1.... $2 = 1 ts=# \dt+ child.huawei_umts_ucell_201908 child | huawei_umts_ucell_201908 | table | telsasoft | 612 MB | \d+ child.huawei_umts_ucell_201908 bsc6900ucell | text | | not null | | extended | | ne_name | text | | not null | | extended | | [...] Partition of: huawei_umts_ucell_metrics FOR VALUES FROM ('2019-08-01 00:00:00') TO ('2019-09-01 00:00:00') Partition constraint: ((start_time IS NOT NULL) AND (start_time >= '2019-08-01 00:00:00'::timestamp without time zone) AND(start_time < '2019-09-01 00:00:00'::timestamp without time zone)) Indexes: "huawei_umts_ucell_201908_unique_idx" UNIQUE, btree (start_time, device_id, bsc6900ucell, ne_name, interval_seconds) "huawei_umts_ucell_201908_idx" brin (start_time) WITH (autosummarize='true') "huawei_umts_ucell_201908_site_idx" btree (site_id) Check constraints: "huawei_umts_ucell_201908_start_time_check" CHECK (start_time >= '2019-08-01 00:00:00'::timestamp without time zone ANDstart_time < '2019-09-01 00:00:00'::timestamp without time zone) Statistics objects: "child"."huawei_umts_ucell_201908" (ndistinct) ON bsc6900ucell, ne_name, start_time, interval_seconds, device_id FROMchild.huawei_umts_ucell_201908 Options: autovacuum_analyze_threshold=2, autovacuum_analyze_scale_factor=0.005 ts=# SELECT COUNT(1) FROM pg_attribute WHERE attrelid='child.huawei_umts_ucell_201908'::regclass; count | 560 ts=# SELECT * FROM pg_attribute WHERE attrelid='child.huawei_umts_ucell_201908'::regclass AND attnum>=0 ORDER BY attnum; attrelid | 2015128626 attname | bsc6900ucell atttypid | 25 attstattarget | -1 attlen | -1 attnum | 1 attndims | 0 attcacheoff | -1 atttypmod | -1 attbyval | f attstorage | x attalign | i attnotnull | t atthasdef | f atthasmissing | f attidentity | attisdropped | f attislocal | f attinhcount | 1 attcollation | 100 attacl | attoptions | attfdwoptions | attmissingval | The only interesting thing about the parent/siblings is that the previous month was partitioned with "daily" granularity. I adjusted our threshold for that, so that August is partitioned monthly: [...] child.huawei_umts_ucell_20190730 FOR VALUES FROM ('2019-07-30 00:00:00') TO ('2019-07-31 00:00:00'), child.huawei_umts_ucell_20190731 FOR VALUES FROM ('2019-07-31 00:00:00') TO ('2019-08-01 00:00:00'), child.huawei_umts_ucell_201908 FOR VALUES FROM ('2019-08-01 00:00:00') TO ('2019-09-01 00:00:00') Program terminated with signal 11, Segmentation fault. #0 0x00000000004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at heaptuple.c:1465 1465 off = att_align_pointer(off, thisatt->attalign, -1, (gdb) bt #0 0x00000000004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at heaptuple.c:1465 #1 0x0000000000487e4b in slot_getallattrs (slot=0x22b5860) at heaptuple.c:1626 #2 0x000000000048aab9 in printtup (slot=0x22b5860, self=0x2161b28) at printtup.c:392 #3 0x00000000008d0346 in RunFromStore (portal=0x21d5638, direction=ForwardScanDirection, count=0, dest=0x2161b28) at pquery.c:1106 #4 0x00000000008cfe88 in PortalRunSelect (portal=0x21d5638, forward=true, count=0, dest=0x2161b28) at pquery.c:928 #5 0x00000000008cfb4c in PortalRun (portal=0x21d5638, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2161b28,altdest=0x2161b28, completionTag=0x7ffff5f39100 "") at pquery.c:773 #6 0x00000000008c9b22 in exec_simple_query (query_string=0x2160d28 "FETCH FORWARD 999 FROM _psql_cursor") at postgres.c:1145 #7 0x00000000008cddf3 in PostgresMain (argc=1, argv=0x218bfb0, dbname=0x218beb0 "ts", username=0x218be90 "pryzbyj") at postgres.c:4182 #8 0x000000000082a098 in BackendRun (port=0x2181ac0) at postmaster.c:4358 #9 0x0000000000829806 in BackendStartup (port=0x2181ac0) at postmaster.c:4030 #10 0x0000000000825cab in ServerLoop () at postmaster.c:1707 #11 0x00000000008255dd in PostmasterMain (argc=3, argv=0x215b9a0) at postmaster.c:1380 #12 0x000000000074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228 #0 0x00000000004877df in slot_deform_tuple (slot=0x22b5860, natts=554) at heaptuple.c:1465 thisatt = 0x22dec50 tuple = 0x22b58a0 tupleDesc = 0x22debc0 values = 0x22b58c0 isnull = 0x22b6a10 tup = 0x263f9a0 hasnulls = false attnum = 1 tp = 0x263fa05 "\374\023\026\213^s#\347(\235=\326\321\067\032\245\321B\026}܋FS\375\244\003\065\336\277;\252O\006\065\320\353\211}F\237\373B\243\357J~\270\"\230ƣ\024xǍ\334\377\202\277S\031\375\351\003\220{\004" off = 583369983 bp = 0x263f9b7 "\270\027$U}\232\246\235\004\255\331\033\006Qp\376E\316h\376\363\247\366Նgy7\311E\224~F\274\023ϋ%\216,\221\331@\024\363\233\070\275\004\254L\217t\262X\227\352\346\347\371\070\321ш\221\350fc\316\r\356\351h\275\213\230\360\203\374\023\026\213^s#\347(\235=\326\321\067\032\245\321B\026}܋FS\375\244\003\065\336\277;\252O\006\065\320\353\211}F\237\373B\243\357J~\270\"\230ƣ\024xǍ\334\377\202\277S\031\375\351\003\220{\004" slow = true #1 0x0000000000487e4b in slot_getallattrs (slot=0x22b5860) at heaptuple.c:1626 tdesc_natts = 554 attnum = 554 tuple = 0x22b58a0 __func__ = "slot_getallattrs" #2 0x000000000048aab9 in printtup (slot=0x22b5860, self=0x2161b28) at printtup.c:392 typeinfo = 0x22debc0 myState = 0x2161b28 oldcontext = 0xa8e29a buf = 0x2161b78 natts = 554 i = 0 ---Type <return> to continue, or q <return> to quit--- #3 0x00000000008d0346 in RunFromStore (portal=0x21d5638, direction=ForwardScanDirection, count=0, dest=0x2161b28) at pquery.c:1106 oldcontext = 0x21931a0 ok = true forward = true current_tuple_count = 879 slot = 0x22b5860 #4 0x00000000008cfe88 in PortalRunSelect (portal=0x21d5638, forward=true, count=0, dest=0x2161b28) at pquery.c:928 queryDesc = 0x0 direction = ForwardScanDirection nprocessed = 35205816 __func__ = "PortalRunSelect" #5 0x00000000008cfb4c in PortalRun (portal=0x21d5638, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x2161b28,altdest=0x2161b28, completionTag=0x7ffff5f39100 "") at pquery.c:773 save_exception_stack = 0x7ffff5f39240 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {35003176, 9176610698715214160, 4687312, 140737319770032, 0, 0, 9176610699432440144,-9176589804060397232}, __mask_was_saved = 0, __saved_mask = {__val = { 36636288, 4352, 35002176, 1112, 1176, 0, 35003120, 140737319768192, 10970883, 35488536, 112, 35003176, 34999312,35003288, 35205536, 112}}}} result = false nprocessed = 4687312 saveTopTransactionResourceOwner = 0x2187b88 saveTopTransactionContext = 0x21f5d90 saveActivePortal = 0x0 saveResourceOwner = 0x2187b88 savePortalContext = 0x0 saveMemoryContext = 0x21f5d90 __func__ = "PortalRun" #6 0x00000000008c9b22 in exec_simple_query (query_string=0x2160d28 "FETCH FORWARD 999 FROM _psql_cursor") at postgres.c:1145 [...]
Just found this, although I'm not sure what to do about it. If it's corrupt table data, I can restore from backup. ts=# VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908; INFO: 00000: aggressively vacuuming "child.huawei_umts_ucell_201908" LOCATION: lazy_scan_heap, vacuumlazy.c:502 ERROR: XX001: found xmin 73850277 from before relfrozenxid 111408920 LOCATION: heap_prepare_freeze_tuple, heapam.c:6853 I confirmed I updated to 11.4 immediately on its release: [pryzbyj@database ~]$ ls -ltc /usr/pgsql-11/bin/postgres -rwxr-xr-x. 1 root root 7291736 Jun 20 07:09 /usr/pgsql-11/bin/postgres That table would've been created shortly after midnight on Aug 1 when we loaded first data for the month. So it was created and processed only on pg11.4, although the parent has probably been around since pg_upgrade last October. Here's usable looking bt #0 heap_prepare_freeze_tuple (tuple=0x7fd3d8fa0058, relfrozenxid=111408920, relminmxid=1846, cutoff_xid=111658731, cutoff_multi=1846, frz=0x223e930, totally_frozen_p=0x7ffff5f3848f) at heapam.c:6832 changed = 211 xmax_already_frozen = 127 xmin_frozen = false freeze_xmax = false xid = 0 __func__ = "heap_prepare_freeze_tuple" #1 0x00000000006c8e6e in lazy_scan_heap (onerel=0x7fd37e4f8fa0, options=13, vacrelstats=0x223e790, Irel=0x223e8d8, nindexes=3, aggressive=true) at vacuumlazy.c:1151 tuple_totally_frozen = false itemid = 0x7fd3d8f9e918 buf = 168313 page = 0x7fd3d8f9e900 "\020:" offnum = 1 maxoff = 3 hastup = true nfrozen = 0 freespace = 35907664 all_frozen = true tupgone = false prev_dead_count = 0 all_visible_according_to_vm = true all_visible = true has_dead_tuples = false visibility_cutoff_xid = 73850277 nblocks = 75494 blkno = 34915 tuple = {t_len = 2212, t_self = {ip_blkid = {bi_hi = 0, bi_lo = 34915}, ip_posid = 1}, t_tableOid = 2015128626, t_data = 0x7fd3d8fa0058} relname = 0x7fd37e4f6ac8 "huawei_umts_ucell_201908" relfrozenxid = 111408920 relminmxid = 1846 empty_pages = 0 vacuumed_pages = 0 next_fsm_block_to_vacuum = 0 num_tuples = 1 live_tuples = 1 tups_vacuumed = 0 nkeep = 0 nunused = 0 indstats = 0x223e850 i = 32512 ru0 = {tv = {tv_sec = 1565204981, tv_usec = 609217}, ru = {ru_utime = { tv_sec = 13, tv_usec = 865892}, ru_stime = {tv_sec = 1, tv_usec = 960701}, ru_maxrss = 136988, ru_ixrss = 0, ru_idrss = 0, ru_isrss = 0, ru_minflt = 48841, ru_majflt = 1, ru_nswap = 0, ru_inblock = 196152, ru_oublock = 292656, ru_msgsnd = 0, ru_msgrcv = 0, ru_nsignals = 0, ru_nvcsw = 849, ru_nivcsw = 6002}} vmbuffer = 4267 next_unskippable_block = 34916 skipping_blocks = false frozen = 0x223e930 buf = {data = 0x0, len = -168590068, maxlen = 32767, cursor = -168589904} initprog_index = {0, 1, 5} initprog_val = {1, 75494, 21968754} __func__ = "lazy_scan_heap" #2 0x00000000006c77c7 in lazy_vacuum_rel (onerel=0x7fd37e4f8fa0, options=13, params=0x7ffff5f38a70, bstrategy=0x22b8cf8) at vacuumlazy.c:265 vacrelstats = 0x223e790 Irel = 0x223e8d8 nindexes = 3 ru0 = {tv = {tv_sec = 140733193388036, tv_usec = 281474985219646}, ru = {ru_utime = {tv_sec = 12933152, tv_usec = 35123496}, ru_stime = {tv_sec = 140737319765996, tv_usec = 4687312}, ru_maxrss = 0, ru_ixrss = -1887226286461968708, ru_idrss = 35351240, ru_isrss = 35249392, ru_minflt = 140737319765776, ru_majflt = 35249392, ru_nswap = 20, ru_inblock = 0, ru_oublock = 140737319765936, ru_msgsnd = 10990471, ru_msgrcv = 140546333970336, ru_nsignals = 35249264, ru_nvcsw = 4, ru_nivcsw = 4687312}} starttime = 0 secs = 72057594037927936 usecs = 2015128626 read_rate = 6.9533474784178538e-310 write_rate = 6.9439115263673584e-310 aggressive = true scanned_all_unfrozen = false xidFullScanLimit = 111658735 mxactFullScanLimit = 1846 new_rel_pages = 0 new_rel_allvisible = 36424984 new_live_tuples = 6.9533474784194348e-310 new_frozen_xid = 9080197 new_min_multi = 0 __func__ = "lazy_vacuum_rel" #3 0x00000000006c72b3 in vacuum_rel (relid=2015128626, relation=0x2161820, options=13, params=0x7ffff5f38a70) at vacuum.c:1557 lmode = 4 onerel = 0x7fd37e4f8fa0 onerelid = {relId = 2015128626, dbId = 16564} toast_relid = 2015128673 save_userid = 18712 save_sec_context = 0 save_nestlevel = 2 rel_lock = true __func__ = "vacuum_rel" #4 0x00000000006c5939 in vacuum (options=13, relations=0x22b8e70, params=0x7ffff5f38a70, bstrategy=0x22b8cf8, isTopLevel=true) at vacuum.c:340 vrel = 0x22b8e10 cur = 0x22b8e48 save_exception_stack = 0x7ffff5f38c20 save_context_stack = 0x0 local_sigjmp_buf = {{__jmpbuf = {35002584, 9176610699604406608, 4687312, 140737319770032, 0, 0, 9176610699520520528, -9176587885370396336}, __mask_was_saved = 0, __saved_mask = { __val = {35906192, 140737319766400, 10964897, 6853, 35905936, 208, 35890568, 21474836680, 35890776, 140737319766560, 10927612, 208, 35890288, 35890288, 35890504, 35890544}}}} in_vacuum = true stmttype = 0xc03841 "VACUUM" in_outer_xact = false use_own_xacts = true __func__ = "vacuum" #5 0x00000000006c54e0 in ExecVacuum (vacstmt=0x2161910, isTopLevel=true) at vacuum.c:141 params = {freeze_min_age = 0, freeze_table_age = 0, multixact_freeze_min_age = 0, multixact_freeze_table_age = 0, is_wraparound = false, log_min_duration = -1} __func__ = "ExecVacuum" errfinish one seems to be not useful? (gdb) bt #0 errfinish (dummy=0) at elog.c:415 #1 0x000000000054aefd in ShowTransactionStateRec ( str=0xafed12 "StartTransaction", s=0xf9efa0) at xact.c:5157 #2 0x000000000054ad5c in ShowTransactionState ( str=0xafed12 "StartTransaction") at xact.c:5130 #3 0x00000000005470c8 in StartTransaction () at xact.c:1961 #4 0x0000000000547db9 in StartTransactionCommand () at xact.c:2734 #5 0x00000000008cbe65 in start_xact_command () at postgres.c:2500 #6 0x00000000008c9772 in exec_simple_query ( query_string=0x2160d28 "VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908;") at postgres.c:948 #7 0x00000000008cddf3 in PostgresMain (argc=1, argv=0x218c768, dbname=0x218c5d0 "ts", username=0x218c5b0 "pryzbyj") at postgres.c:4182 #8 0x000000000082a098 in BackendRun (port=0x21818d0) at postmaster.c:4358 #9 0x0000000000829806 in BackendStartup (port=0x21818d0) at postmaster.c:4030 #10 0x0000000000825cab in ServerLoop () at postmaster.c:1707 #11 0x00000000008255dd in PostmasterMain (argc=3, argv=0x215b9a0) at postmaster.c:1380 #12 0x000000000074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228 #0 errfinish (dummy=0) at elog.c:415 edata = 0xfe7b80 elevel = 0 oldcontext = 0x215df90 econtext = 0xa4d11a __func__ = "errfinish" #1 0x000000000054aefd in ShowTransactionStateRec ( str=0xafed12 "StartTransaction", s=0xf9efa0) at xact.c:5157 buf = {data = 0x22462d8 "", len = 0, maxlen = 1024, cursor = 0} __func__ = "ShowTransactionStateRec" #2 0x000000000054ad5c in ShowTransactionState ( str=0xafed12 "StartTransaction") at xact.c:5130 No locals. #3 0x00000000005470c8 in StartTransaction () at xact.c:1961 s = 0xf9efa0 vxid = {backendId = 5, localTransactionId = 16} #4 0x0000000000547db9 in StartTransactionCommand () at xact.c:2734 s = 0xf9efa0 __func__ = "StartTransactionCommand" #5 0x00000000008cbe65 in start_xact_command () at postgres.c:2500 No locals. I looked at this briefly: ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 0)); lsn | 3A49/66F44310 checksum | 0 flags | 4 lower | 36 upper | 1432 special | 8192 pagesize | 8192 version | 4 prune_xid | 0 In case someone wants me to look at bits and pieces, I saved a copy of: relfilenode | 2015128626
Hi, On 2019-08-07 14:29:28 -0500, Justin Pryzby wrote: > Just found this, although I'm not sure what to do about it. If it's corrupt > table data, I can restore from backup. > > ts=# VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908; > INFO: 00000: aggressively vacuuming "child.huawei_umts_ucell_201908" > LOCATION: lazy_scan_heap, vacuumlazy.c:502 > ERROR: XX001: found xmin 73850277 from before relfrozenxid 111408920 > LOCATION: heap_prepare_freeze_tuple, heapam.c:6853 Ugh :( We really need to add a error context to vacuumlazy that shows which block is being processed. > I confirmed I updated to 11.4 immediately on its release: > [pryzbyj@database ~]$ ls -ltc /usr/pgsql-11/bin/postgres > -rwxr-xr-x. 1 root root 7291736 Jun 20 07:09 /usr/pgsql-11/bin/postgres Any chance you might not have restarted postgres at that time? > That table would've been created shortly after midnight on Aug 1 when we loaded > first data for the month. So it was created and processed only on pg11.4, > although the parent has probably been around since pg_upgrade last October. I don't think there's a way the parent could play a role. > Here's usable looking bt > > #0 heap_prepare_freeze_tuple (tuple=0x7fd3d8fa0058, relfrozenxid=111408920, > relminmxid=1846, cutoff_xid=111658731, cutoff_multi=1846, frz=0x223e930, > totally_frozen_p=0x7ffff5f3848f) at heapam.c:6832 > changed = 211 > xmax_already_frozen = 127 > xmin_frozen = false > freeze_xmax = false > xid = 0 > __func__ = "heap_prepare_freeze_tuple" Hm. That's a backtrace to what precisely? Are you sure it's the the erroring call to heap_prepare_freeze_tuple? Because I think that's at the function's start - which is why some of the stack variables have bogus contents. I think you'd need to set the breakpoint to heapam.c:6850 to be sure to catch the error (while the error message heapam.c:6853, that's because the line macro in some compilers expands to the end of the statement). > errfinish one seems to be not useful? > > (gdb) bt > #0 errfinish (dummy=0) at elog.c:415 > #1 0x000000000054aefd in ShowTransactionStateRec ( > str=0xafed12 "StartTransaction", s=0xf9efa0) at xact.c:5157 > #2 0x000000000054ad5c in ShowTransactionState ( > str=0xafed12 "StartTransaction") at xact.c:5130 > #3 0x00000000005470c8 in StartTransaction () at xact.c:1961 > #4 0x0000000000547db9 in StartTransactionCommand () at xact.c:2734 > #5 0x00000000008cbe65 in start_xact_command () at postgres.c:2500 > #6 0x00000000008c9772 in exec_simple_query ( > query_string=0x2160d28 "VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908;") at postgres.c:948 > #7 0x00000000008cddf3 in PostgresMain (argc=1, argv=0x218c768, > dbname=0x218c5d0 "ts", username=0x218c5b0 "pryzbyj") at postgres.c:4182 > #8 0x000000000082a098 in BackendRun (port=0x21818d0) at postmaster.c:4358 > #9 0x0000000000829806 in BackendStartup (port=0x21818d0) at postmaster.c:4030 > #10 0x0000000000825cab in ServerLoop () at postmaster.c:1707 > #11 0x00000000008255dd in PostmasterMain (argc=3, argv=0x215b9a0) > at postmaster.c:1380 > #12 0x000000000074ba30 in main (argc=3, argv=0x215b9a0) at main.c:228 That's probably just the errfinish for a debug message. Might be easier to set a breakpoint to pg_re_throw(). > I looked at this briefly: > > ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 0)); > lsn | 3A49/66F44310 > checksum | 0 > flags | 4 > lower | 36 > upper | 1432 > special | 8192 > pagesize | 8192 > version | 4 > prune_xid | 0 That's not necessarily the target block though, right? It'd be useful to get this for the block with the corruption, and perhaps one before/after. If the backtrace I commented on at the top is correct, the relevant tuple was at "blkno = 34915" (visible a frame or two up). Also heap_page_items() for those pages would be useful - you might want to not include the t_data column, as it has raw tuple data. Could you show * pg_controldata * SELECT oid, datname, datfrozenxid, datminmxid FROM pg_database; That should be easy to redact if necessary. It'd be useful to know which database the corrupted table is in. It'd also be helpful to get something roughly like SELECT oid, oid::regclass, relkind, reltoastrelid, relfrozenxid, age(relfrozenxid), relminmxid, mxid_age(relminmxid), relpages, (SELECT txid_current()) FROM pg_class WHERE relfrozenxid <> 0; But I'm not sure if including the tablenames would be a problematic for your case. Possibly a bit hard to infer "usage" correlations without, but probably still worthwhile. Greetings, Andres Freund
On Wed, Aug 07, 2019 at 04:51:54PM -0700, Andres Freund wrote: > Hi, > > On 2019-08-07 14:29:28 -0500, Justin Pryzby wrote: > > Just found this, although I'm not sure what to do about it. If it's corrupt > > table data, I can restore from backup. In the meantime, I've renamed+uninherited the table and restored from backup, with the broken table preserved. However, the previous days' backup did this, so I used Monday's backup. [pryzbyj@database ~]$ time sudo -u postgres pg_restore /srv/cdrperfbackup/ts/2019-08-06/curtables/ -t huawei_umts_ucell_201908--verbose -d ts --host /tmp pg_restore: connecting to database for restore pg_restore: creating TABLE "child.huawei_umts_ucell_201908" pg_restore: INFO: partition constraint for table "huawei_umts_ucell_201908" is implied by existing constraints pg_restore: processing data for table "child.huawei_umts_ucell_201908" pg_restore: [archiver (db)] Error while PROCESSING TOC: pg_restore: [archiver (db)] Error from TOC entry 62710; 0 2015128626 TABLE DATA huawei_umts_ucell_201908 telsasoft pg_restore: [archiver (db)] COPY failed for table "huawei_umts_ucell_201908": ERROR: invalid byte sequence for encoding"UTF8": 0xe7 0x28 0x9d CONTEXT: COPY huawei_umts_ucell_201908, line 104746 Also, I found core with this BT from my own manual invocation of ANALYZE. (gdb) bt #0 0x00000039674324f5 in raise () from /lib64/libc.so.6 #1 0x0000003967433cd5 in abort () from /lib64/libc.so.6 #2 0x0000000000a3c4b2 in ExceptionalCondition (conditionName=0xaad9e7 "!(j > attnum)", errorType=0xaad84c "FailedAssertion",fileName=0xaad840 "heaptuple.c", lineNumber=582) at assert.c:54 #3 0x00000000004856a7 in nocachegetattr (tuple=0x1c116b08, attnum=1, tupleDesc=0x23332b0) at heaptuple.c:582 #4 0x000000000062981d in std_fetch_func (stats=0x22700c8, rownum=54991, isNull=0x7ffff5f3844f) at analyze.c:1718 #5 0x000000000062aa99 in compute_scalar_stats (stats=0x22700c8, fetchfunc=0x62942b <std_fetch_func>, samplerows=120000,totalrows=227970) at analyze.c:2370 #6 0x00000000006270db in do_analyze_rel (onerel=0x7fd37e2afa50, options=2, params=0x7ffff5f38a70, va_cols=0x0, acquirefunc=0x627e9e<acquire_sample_rows>, relpages=75494, inh=false, in_outer_xact=false, elevel=13) at analyze.c:579 #7 0x0000000000626616 in analyze_rel (relid=2015128626, relation=0x2161820, options=2, params=0x7ffff5f38a70, va_cols=0x0,in_outer_xact=false, bstrategy=0x21f7848) at analyze.c:310 #8 0x00000000006c59b2 in vacuum (options=2, relations=0x21f79c0, params=0x7ffff5f38a70, bstrategy=0x21f7848, isTopLevel=true)at vacuum.c:357 #9 0x00000000006c54e0 in ExecVacuum (vacstmt=0x2161910, isTopLevel=true) at vacuum.c:141 #10 0x00000000008d1d7e in standard_ProcessUtility (pstmt=0x21619d0, queryString=0x2160d28 "ANALYZE child. huawei_umts_ucell_201908;",context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x2161cc0, completionTag=0x7ffff5f39100 "") at utility.c:670 #3 0x00000000004856a7 in nocachegetattr (tuple=0x1c116b08, attnum=1, tupleDesc=0x23332b0) at heaptuple.c:582 natts = 554 j = 1 tup = 0x1c116b20 tp = 0x1c116b85 "\374\023\026\213^s#\347(\235=\326\321\067\032\245\321B\026}܋FS\375\244\003\065\336\277;\252O\006\065\320\353\211}F\237\373B\243\357J~\270\"\230ƣ\024xǍ\334\377\202\277S\031\375\351\003\220{\004" bp = 0x1c116b37 "\270\027$U}\232\246\235\004\255\331\033\006Qp\376E\316h\376\363\247\366Նgy7\311E\224~F\274\023ϋ%\216,\221\331@\024\363\233\070\275\004\254L\217t\262X\227\352\346\347\371\070\321ш\221\350fc\316\r\356\351h\275\213\230\360\203\374\023\026\213^s#\347(\235=\326\321\067\032\245\321B\026}܋FS\375\244\003\065\336\277;\252O\006\065\320\353\211}F\237\373B\243\357J~\270\"\230ƣ\024xǍ\334\377\202\277S\031\375\351\003\220{\004" slow = false off = -1 #4 0x000000000062981d in std_fetch_func (stats=0x22700c8, rownum=54991, isNull=0x7ffff5f3844f) at analyze.c:1718 attnum = 2 tuple = 0x1c116b08 tupDesc = 0x23332b0 #5 0x000000000062aa99 in compute_scalar_stats (stats=0x22700c8, fetchfunc=0x62942b <std_fetch_func>, samplerows=120000,totalrows=227970) at analyze.c:2370 value = 546120480 isnull = false i = 54991 null_cnt = 0 nonnull_cnt = 54991 toowide_cnt = 0 ---Type <return> to continue, or q <return> to quit--- total_width = 274955 is_varlena = true is_varwidth = true corr_xysum = 3.9525251667299724e-323 ssup = {ssup_cxt = 0x22ae080, ssup_collation = 100, ssup_reverse = false, ssup_nulls_first = false, ssup_attno =0, ssup_extra = 0x22ae5b0, comparator = 0xa06720 <varstrfastcmp_locale>, abbreviate = false, abbrev_converter = 0, abbrev_abort = 0, abbrev_full_comparator = 0} values = 0x204f9970 values_cnt = 54991 tupnoLink = 0x206ce5c0 track = 0x22ae198 track_cnt = 0 num_mcv = 100 num_bins = 100 mystats = 0x2270490 > > ts=# VACUUM FREEZE VERBOSE child.huawei_umts_ucell_201908; > > INFO: 00000: aggressively vacuuming "child.huawei_umts_ucell_201908" > > LOCATION: lazy_scan_heap, vacuumlazy.c:502 > > ERROR: XX001: found xmin 73850277 from before relfrozenxid 111408920 > > LOCATION: heap_prepare_freeze_tuple, heapam.c:6853 > > Ugh :( > > We really need to add a error context to vacuumlazy that shows which > block is being processed. > > I confirmed I updated to 11.4 immediately on its release: > > [pryzbyj@database ~]$ ls -ltc /usr/pgsql-11/bin/postgres > > -rwxr-xr-x. 1 root root 7291736 Jun 20 07:09 /usr/pgsql-11/bin/postgres > > Any chance you might not have restarted postgres at that time? I don't think so, nagios would've been yelling at me. Also..I found I have a log of pg_settings, which shows server_version updated 2019-06-20 07:05:01.425645. Is there a bug in 11.3 which could explain it ? > > Here's usable looking bt > > > > #0 heap_prepare_freeze_tuple (tuple=0x7fd3d8fa0058, relfrozenxid=111408920, > > relminmxid=1846, cutoff_xid=111658731, cutoff_multi=1846, frz=0x223e930, > > totally_frozen_p=0x7ffff5f3848f) at heapam.c:6832 > > changed = 211 > > xmax_already_frozen = 127 > > xmin_frozen = false > > freeze_xmax = false > > xid = 0 > > __func__ = "heap_prepare_freeze_tuple" > > Hm. That's a backtrace to what precisely? Are you sure it's the the > erroring call to heap_prepare_freeze_tuple? Because I think that's at > the function's start - which is why some of the stack variables have > bogus contents. Yes, I had just done this: b heap_prepare_freeze_tuple. > I think you'd need to set the breakpoint to heapam.c:6850 to be sure to > catch the error (while the error message heapam.c:6853, that's because > the line macro in some compilers expands to the end of the statement). I did 6848 and got; #0 heap_prepare_freeze_tuple (tuple=0x7fd403a02058, relfrozenxid=111408920, relminmxid=1846, cutoff_xid=111665057, cutoff_multi=1846,frz=0x22aa6f8, totally_frozen_p=0x7ffff5f3848f) at heapam.c:6849 changed = false xmax_already_frozen = false xmin_frozen = false freeze_xmax = false xid = 73850277 __func__ = "heap_prepare_freeze_tuple" > > errfinish one seems to be not useful? > > > > (gdb) bt > > #0 errfinish (dummy=0) at elog.c:415 > > #1 0x000000000054aefd in ShowTransactionStateRec ( > > str=0xafed12 "StartTransaction", s=0xf9efa0) at xact.c:5157 > > [...] > > That's probably just the errfinish for a debug message. Might be easier > to set a breakpoint to pg_re_throw(). Right, that works. (gdb) bt #0 pg_re_throw () at elog.c:1730 #1 0x0000000000a3cb1e in errfinish (dummy=0) at elog.c:467 #2 0x00000000004e72c4 in heap_prepare_freeze_tuple (tuple=0x7fd403a02058, relfrozenxid=111408920, relminmxid=1846, cutoff_xid=111665065,cutoff_multi=1846, frz=0x2260558, totally_frozen_p=0x7ffff5f3848f) at heapam.c:6850 #3 0x00000000006c8e6e in lazy_scan_heap (onerel=0x7fd37d5f99c8, options=13, vacrelstats=0x22603b8, Irel=0x2260500, nindexes=3,aggressive=true) at vacuumlazy.c:1151 > > I looked at this briefly: > > > > ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 0)); > > lsn | 3A49/66F44310 > > checksum | 0 > > flags | 4 > > lower | 36 > > upper | 1432 > > special | 8192 > > pagesize | 8192 > > version | 4 > > prune_xid | 0 > > That's not necessarily the target block though, right? It'd be useful > to get this for the block with the corruption, and perhaps one > before/after. If the backtrace I commented on at the top is correct, > the relevant tuple was at "blkno = 34915" (visible a frame or two > up). Ah, right. ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 34914)); lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid ---------------+----------+-------+-------+-------+---------+----------+---------+----------- 3A4A/4F2192B8 | 0 | 0 | 36 | 1520 | 8192 | 8192 | 4 | 0 ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 34915)); 3A4A/4F21C268 | 0 | 0 | 36 | 1520 | 8192 | 8192 | 4 | 0 ts=# SELECT * FROM page_header(get_raw_page('child.huawei_umts_ucell_201908', 34916)); 3A4A/4F21DC90 | 0 | 0 | 36 | 1496 | 8192 | 8192 | 4 | 0 > Also heap_page_items() for those pages would be useful - you might want > to not include the t_data column, as it has raw tuple data. ts=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oidFROM heap_page_items(get_raw_page('child.huawei_umts_ucell_201908', 34914)); lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits... | t_oid 1 | 5952 | 1 | 2236 | 111659903 | 0 | 0 | (34914,1) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111001111111111111111111111011111111111111111011111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 2 | 3736 | 1 | 2212 | 111659903 | 0 | 0 | (34914,2) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111000111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 3 | 1520 | 1 | 2216 | 111659903 | 0 | 0 | (34914,3) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111000111111111111111111111011111111111111111011111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | ts=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oidFROM heap_page_items(get_raw_page('child.huawei_umts_ucell_201908', 34915)); 1 | 5976 | 1 | 2212 | 111659903 | 0 | 0 | (34915,1) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111000111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 2 | 3744 | 1 | 2232 | 111659903 | 0 | 0 | (34915,2) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111001111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 3 | 1520 | 1 | 2220 | 111659903 | 0 | 0 | (34915,3) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111000111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | ts=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oidFROM heap_page_items(get_raw_page('child.huawei_umts_ucell_201908', 34916)); 1 | 5952 | 1 | 2236 | 111659903 | 0 | 0 | (34916,1) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111001111111111111111111111011111111111111111011111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 2 | 3736 | 1 | 2212 | 111659903 | 0 | 0 | (34916,2) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111000111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000001111111111111111111111111111111111111111111100111111000000 | 3 | 1496 | 1 | 2240 | 111659903 | 0 | 0 | (34916,3) | 554 | 2311 | 96 | 11111111111111111111111111111111111111011100111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001111111111111111111111001111111111111111111111011111111111111111010111111101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001100000001111111111111111111111111111111111111111111100111111000000 | > Could you show > * pg_controldata [pryzbyj@database ~]$ sudo -u postgres ./src/postgresql.bin/bin/pg_controldata -D /var/lib/pgsql/11/data [sudo] password for pryzbyj: pg_control version number: 1100 Catalog version number: 201809051 Database system identifier: 6616072377454496350 Database cluster state: in production pg_control last modified: Wed 07 Aug 2019 06:32:05 PM MDT Latest checkpoint location: 3A4C/D1BB5BD8 Latest checkpoint's REDO location: 3A4C/CCC1DF90 Latest checkpoint's REDO WAL file: 0000000100003A4C000000CC Latest checkpoint's TimeLineID: 1 Latest checkpoint's PrevTimeLineID: 1 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:111664949 Latest checkpoint's NextOID: 2069503852 Latest checkpoint's NextMultiXactId: 1846 Latest checkpoint's NextMultiOffset: 3891 Latest checkpoint's oldestXID: 33682470 Latest checkpoint's oldestXID's DB: 16564 Latest checkpoint's oldestActiveXID: 111664945 Latest checkpoint's oldestMultiXid: 1484 Latest checkpoint's oldestMulti's DB: 16564 Latest checkpoint's oldestCommitTsXid:0 Latest checkpoint's newestCommitTsXid:0 Time of latest checkpoint: Wed 07 Aug 2019 06:31:35 PM MDT Fake LSN counter for unlogged rels: 0/1 Minimum recovery ending location: 0/0 Min recovery ending loc's timeline: 0 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: replica wal_log_hints setting: off max_connections setting: 200 max_worker_processes setting: 8 max_prepared_xacts setting: 0 max_locks_per_xact setting: 128 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Data page checksum version: 0 Mock authentication nonce: ff863744b99b849579d561c5117157c78a8ced0563df824f3035f70810fec534 > * SELECT oid, datname, datfrozenxid, datminmxid FROM pg_database; ts=# SELECT oid, datname, datfrozenxid, datminmxid FROM pg_database; oid | datname | datfrozenxid | datminmxid -------+-----------+--------------+------------ 13529 | template0 | 103564559 | 1846 16400 | template1 | 52362404 | 1486 16443 | postgres | 46060229 | 1485 16564 | ts | 33682470 | 1484 (4 rows) > That should be easy to redact if necessary. It'd be useful to know which > database the corrupted table is in. > > It'd also be helpful to get something roughly like > SELECT oid, oid::regclass, relkind, reltoastrelid, relfrozenxid, age(relfrozenxid), relminmxid, mxid_age(relminmxid), relpages,(SELECT txid_current()) FROM pg_class WHERE relfrozenxid <> 0; > But I'm not sure if including the tablenames would be a problematic for > your case. Possibly a bit hard to infer "usage" correlations without, > but probably still worthwhile. I'll send to you individually, I don't think it's secret, but it's 60k lines long... Its OID verifies to me that the table was created on 8/1, and not 7/1, which was possible. Week-old backup also verifies that. Thanks, Justin