Thread: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Justin Pryzby
Date:
This is crashing repeatedly during insert/update immediately after upgrading an instance to v14, from v13.3. In case it matters, the cluster was originally initdb at 13.2. TRAP: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", Line: 1062, PID: 28580) postgres: telsasoft ts 127.0.0.1(52250) INSERT(ExceptionalCondition+0x8d)[0x967d1d] postgres: telsasoft ts 127.0.0.1(52250) INSERT(_bt_swap_posting+0x2cd)[0x507cdd] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x509a14] postgres: telsasoft ts 127.0.0.1(52250) INSERT(_bt_doinsert+0xcb7)[0x50d0b7] postgres: telsasoft ts 127.0.0.1(52250) INSERT(btinsert+0x52)[0x5130f2] postgres: telsasoft ts 127.0.0.1(52250) INSERT(ExecInsertIndexTuples+0x231)[0x687b81] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x6b8718] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x6b9297] postgres: telsasoft ts 127.0.0.1(52250) INSERT(standard_ExecutorRun+0x142)[0x688b32] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x82da8a] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x82e673] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x82e936] postgres: telsasoft ts 127.0.0.1(52250) INSERT(PortalRun+0x2eb)[0x82ec8b] postgres: telsasoft ts 127.0.0.1(52250) INSERT(PostgresMain+0x1f97)[0x82c777] postgres: telsasoft ts 127.0.0.1(52250) INSERT[0x48f71a] postgres: telsasoft ts 127.0.0.1(52250) INSERT(PostmasterMain+0x1138)[0x794c98] postgres: telsasoft ts 127.0.0.1(52250) INSERT(main+0x6f2)[0x491292] < 2021-06-27 23:46:43.257 CAT >DETAIL: Failed process was running: INSERT INTO alarms(... #3 0x0000000000507cdd in _bt_swap_posting (newitem=newitem@entry=0x2011c00, oposting=oposting@entry=0x7f6de58e2a78, postingoff=postingoff@entry=62)at nbtdedup.c:1062 nhtids = <optimized out> replacepos = 0x2011dac "" nposting = 0x2011c28 __func__ = "_bt_swap_posting" #4 0x0000000000509a14 in _bt_insertonpg (rel=rel@entry=0x7f6dfd3cd628, itup_key=itup_key@entry=0x2011b40, buf=15, cbuf=cbuf@entry=0,stack=stack@entry=0x2011bd8, itup=0x2011c00, itup@entry=0x200d608, itemsz=16, newitemoff=2, postingoff=62, split_only_page=split_only_page@entry=false) at nbtinsert.c:1174 itemid = 0x7f6de58e0e1c page = 0x7f6de58e0e00 "\200" opaque = 0x7f6de58e2df0 isleaf = true isroot = false isrightmost = false isonly = false oposting = 0x7f6de58e2a78 origitup = <optimized out> nposting = 0x0 __func__ = "_bt_insertonpg" #5 0x000000000050d0b7 in _bt_doinsert (rel=rel@entry=0x7f6dfd3cd628, itup=itup@entry=0x200d608, checkUnique=checkUnique@entry=UNIQUE_CHECK_NO,indexUnchanged=indexUnchanged@entry=false, heapRel=heapRel@entry=0x7f6dfd48ba80) at nbtinsert.c:257 newitemoff = 0 is_unique = false insertstate = {itup = 0x200d608, itemsz = 16, itup_key = 0x2011b40, buf = 15, bounds_valid = true, low = 2, stricthigh= 3, postingoff = 62} itup_key = <optimized out> checkingunique = <optimized out> #6 0x00000000005130f2 in btinsert (rel=0x7f6dfd3cd628, values=<optimized out>, isnull=<optimized out>, ht_ctid=0x212e250,heapRel=0x7f6dfd48ba80, checkUnique=UNIQUE_CHECK_NO, indexUnchanged=false, indexInfo=0x200d2e8) at nbtree.c:199 result = <optimized out> itup = 0x200d608 #7 0x0000000000687b81 in ExecInsertIndexTuples (resultRelInfo=resultRelInfo@entry=0x200cd90, slot=slot@entry=0x212e220,estate=estate@entry=0x212c6b0, update=update@entry=false, noDupErr=noDupErr@entry=false, specConflict=specConflict@entry=0x0, arbiterIndexes=arbiterIndexes@entry=0x0) at execIndexing.c:415 (gdb) p *newitem $2 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 22}, ip_posid = 4}, t_info = 32784} (gdb) p *oposting $3 = {t_tid = {ip_blkid = {bi_hi = 0, bi_lo = 16}, ip_posid = 8333}, t_info = 41824} I will save a copy of the data dir and see if reindexing helps. Let me know if there's anything else I can provide. -- Justin
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Peter Geoghegan
Date:
Can you please amcheck all of the indexes?
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Peter Geoghegan
Date:
Would also be good to get a raw binary copy of the page image in question. Hopefully the data isn't confidential. Same gdb procedure as before.
Thanks
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Peter Geoghegan
Date:
Did you also change the kernel on upgrade? I recall that that was a factor on the other recent bug thread.
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Justin Pryzby
Date:
On Sun, Jun 27, 2021 at 03:08:13PM -0700, Peter Geoghegan wrote: > Can you please amcheck all of the indexes? ts=# SELECT bt_index_check('child.alarms_null_alarm_clear_time_idx'::regclass); ERROR: item order invariant violated for index "alarms_null_alarm_clear_time_idx" DETAIL: Lower index tid=(1,77) (points to heap tid=(29,9)) higher index tid=(1,78) (points to heap tid=(29,9)) page lsn=80/4B9C69D0. ts=# SELECT itemoffset, ctid ,itemlen, nulls, vars, dead, htid FROM bt_page_items('child.alarms_null_alarm_clear_time_idx',1); itemoffset | ctid | itemlen | nulls | vars | dead | htid ------------+-----------+---------+-------+------+------+--------- ... 77 | (29,9) | 16 | t | f | f | (29,9) 78 | (29,9) | 16 | t | f | f | (29,9) ts=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oidFROM heap_page_items(get_raw_page('child.alarms_null', 29)); lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits | t_oid ----+--------+----------+--------+--------+----------+----------+---------+-------------+------------+--------+------------------------------------------+------- 1 | 6680 | 1 | 1512 | 88669 | 27455486 | 44 | (29,1) | 8225 | 10691 | 32 | 1100001111111111111111101111111110000000| 2 | 6 | 2 | 0 | | | | | | | | | 3 | 5168 | 1 | 1512 | 87374 | 27455479 | 37 | (29,3) | 8225 | 10691 | 32 | 1100001111111111111111101111111110000000| 4 | 4192 | 1 | 976 | 148104 | 27574887 | 0 | (29,4) | 8225 | 10695 | 32 | 1100001111111111111111101111111110000000| 5 | 10 | 2 | 0 | | | | | | | | | 6 | 3216 | 1 | 976 | 148137 | 27574888 | 0 | (29,6) | 40993 | 10695 | 32 | 1100001111111111111111101111111110000000| 7 | 8 | 2 | 0 | | | | | | | | | 8 | 2240 | 1 | 976 | 47388 | 27574858 | 7 | (29,8) | 40993 | 10695 | 32 | 1100001111111111111111101111111110000000| 9 | 0 | 3 | 0 | | | | | | | | | 10 | 1264 | 1 | 976 | 148935 | 27574889 | 0 | (29,10) | 40993 | 10695 | 32 | 1100001111111111111111101111111110000000| 11 | 0 | 3 | 0 | | | | | | | | | 12 | 0 | 3 | 0 | | | | | | | | | (12 rows) (gdb) fr 4 #4 0x0000000000509a14 in _bt_insertonpg (rel=rel@entry=0x7f6dfd3cd628, itup_key=itup_key@entry=0x2011b40, buf=15, cbuf=cbuf@entry=0,stack=stack@entry=0x2011bd8, itup=0x2011c00, itup@entry=0x200d608, itemsz=16, newitemoff=2, postingoff=62, split_only_page=split_only_page@entry=false) at nbtinsert.c:1174 1174 in nbtinsert.c (gdb) p page $5 = 0x7f6de58e0e00 "\200" (gdb) dump binary memory /tmp/dump_block.page page (page + 8192) ts=# SELECT lp, lp_off, lp_flags, lp_len, t_xmin, t_xmax, t_field3, t_ctid, t_infomask2, t_infomask, t_hoff, t_bits, t_oidFROM heap_page_items(pg_read_binary_file('/tmp/dump_block.page')) WHERE t_xmin IS NOT NULL; lp | lp_off | lp_flags | lp_len | t_xmin | t_xmax | t_field3 | t_ctid | t_infomask2 | t_infomask | t_hoff | t_bits| t_oid -----+--------+----------+--------+---------+------------+----------+--------+-------------+------------+--------+--------+------- 1 | 8152 | 1 | 24 | 1048576 | 2685931521 | 0 | (0,0) | 0 | 120 | 1 | | 2 | 7288 | 1 | 864 | 1048576 | 2740985997 | 0 | (0,0) | 0 | 1 | 0 | | 67 | 6368 | 1 | 920 | 1048576 | 2744656022 | 0 | (0,0) | 33 | 4 | 0 | | 137 | 5056 | 1 | 1312 | 1048576 | 2770346200 | 0 | (0,0) | 69 | 6 | 0 | | 142 | 4608 | 1 | 448 | 1048576 | 2713722952 | 0 | (0,0) | 107 | 4 | 0 | | (5 rows) I didn't change the kernel here, nor on the previous bug report - it was going to be my "next step", until I found the stuck autovacuum, and I mentioned it for context, but probably just confused things. -- Justin
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Justin Pryzby
Date:
I've just realized that this VM has a strange storage configuration. It's using LVM thin pools, which I don't use anywhere else. Someone else set this up, and I think I've literally never used pools before. Some time ago, the pool ran out of space, and I ran LVM repair on it. It seems very possible that's the issue. A latent problem might've been tickled by pg_upgrade --link. That said, the relevant table is the active "alarms" table, and it would've gotten plenty of DML with no issue for months running v13. Feel free to dismiss this report if it seems dubious. -- Justin
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Peter Geoghegan
Date:
On Sun, Jun 27, 2021 at 6:47 PM Justin Pryzby <pryzby@telsasoft.com> wrote: > I've just realized that this VM has a strange storage configuration. > > It's using LVM thin pools, which I don't use anywhere else. > Someone else set this up, and I think I've literally never used pools before. > Some time ago, the pool ran out of space, and I ran LVM repair on it. > It seems very possible that's the issue. > A latent problem might've been tickled by pg_upgrade --link. Anything is possible, of course, but even if this is a bug in Postgres it isn't particularly likely to be a bug in the nbtree code. We see clear signs of general corruption here, which is apparently not limited to the one page that you supplied to me privately -- since AFAICT that's not the page that amcheck throws the error on. > That said, the relevant table is the active "alarms" table, and it would've > gotten plenty of DML with no issue for months running v13. It might not have been visibly broken without assertions enabled, though. I sprinkled nbtdedup.c with these _bt_posting_valid() assertions just because it was easy. The assertions were bound to catch some problem sooner or later, and had acceptable overhead. -- Peter Geoghegan
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Peter Geoghegan
Date:
On Sun, Jun 27, 2021 at 11:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > > That said, the relevant table is the active "alarms" table, and it would've > > gotten plenty of DML with no issue for months running v13. > > It might not have been visibly broken without assertions enabled, > though. I sprinkled nbtdedup.c with these _bt_posting_valid() > assertions just because it was easy. The assertions were bound to > catch some problem sooner or later, and had acceptable overhead. Obviously nothing stops you from running amcheck on the original database that you're running in production. You won't need to have enabled assertions to catch the same problem that way. This seems like the best way to isolate the problem. I strongly suspect that it's the LVM issue for my own reasons: nothing changed during the Postgres 14 cycle that seems truly related. The index deletion stuff (commit d168b666823) might seem like an obvious possible culprit, but I consider it unlikely. I added many defensive assertions to that code too: _bt_bottomupdel_pass() also uses exactly the same kind of _bt_posting_valid() assertions directly. Plus _bt_delitems_delete_check() is highly defensive with assertions when it processes a posting list tuple. If there was a problem with any of that code it seems very likely that those assertions would have failed first. -- Peter Geoghegan
Re: pg14b2: FailedAssertion("_bt_posting_valid(nposting)", File: "nbtdedup.c", ...
From
Justin Pryzby
Date:
On Mon, Jun 28, 2021 at 01:42:25PM -0700, Peter Geoghegan wrote: > On Sun, Jun 27, 2021 at 11:08 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > That said, the relevant table is the active "alarms" table, and it would've > > > gotten plenty of DML with no issue for months running v13. > > > > It might not have been visibly broken without assertions enabled, > > though. I sprinkled nbtdedup.c with these _bt_posting_valid() > > assertions just because it was easy. The assertions were bound to > > catch some problem sooner or later, and had acceptable overhead. > > Obviously nothing stops you from running amcheck on the original > database that you're running in production. You won't need to have > enabled assertions to catch the same problem that way. This seems like > the best way to isolate the problem. I strongly suspect that it's the > LVM issue for my own reasons: nothing changed during the Postgres 14 > cycle that seems truly related. Sorry, but I didn't save the pre-upgrade cluster (just pg_dump). For now, I moved the table out of the way and re-created it. I could send you the whole relnode if you wanted to look more.. It seemed like almost any insert on the table caused it to crash. BTW, on a copy of the v14 cluster, both vacuum and reindex also resolved the issue (at least enough to avoid the crash). -- Justin