Thread: ERROR: posting list tuple with 20 items cannot be split at offset 168
Hi,
I upgraded postgres 13 to 14+231.pgdg18.04+1 during the weekend, virtual ubuntu 20.04.3.
this morning a user reported an error when trying to change a flag in our web app (Phoenix, ecto).
I traced it back to an index issue on a boolean column.
This is the table:
tranman_production=# \d freights
Table "public.freights"
Column | Type | Collation | Nullable | Default
--------------+-----------------------------+-----------+----------+--------------------------------------
id | integer | | not null | nextval('freights_id_seq'::regclass)
order_id | integer | | |
product_id | integer | | |
amount | integer | | |
reference_1 | character varying | | |
reference_2 | character varying | | |
reference_3 | character varying | | |
reference_4 | character varying | | |
status | integer | | | 0
created_at | timestamp without time zone | | not null |
updated_at | timestamp without time zone | | not null |
cmr | bigint | | |
amount_out | integer | | not null | 0
paused | boolean | | | false
cmr_received | boolean | | | false
Indexes:
"freights_pkey" PRIMARY KEY, btree (id)
"index_freights_on_cmr" btree (cmr)
"index_freights_on_cmr_received" btree (cmr_received)
"index_freights_on_order_id" btree (order_id)
"index_freights_on_product_id" btree (product_id)
"index_freights_on_reference_1" btree (reference_1)
"index_freights_on_reference_2" btree (reference_2)
"index_freights_on_reference_3" btree (reference_3)
"index_freights_on_reference_4" btree (reference_4)
"index_freights_on_status" btree (status)
Foreign-key constraints:
"fk_rails_21fdf332ee" FOREIGN KEY (product_id) REFERENCES products(id)
"fk_rails_af9f7b0831" FOREIGN KEY (order_id) REFERENCES orders(id)
Referenced by:
TABLE "freight_docs" CONSTRAINT "fk_rails_10448cfa4f" FOREIGN KEY (freight_id) REFERENCES freights(id)
TABLE "freight_events" CONSTRAINT "fk_rails_20c83ae774" FOREIGN KEY (freight_id) REFERENCES freights(id)
TABLE "rides" CONSTRAINT "fk_rails_a50483d9c3" FOREIGN KEY (freight_id) REFERENCES freights(id)
And this is what happens:
tranman_production=# update freights set cmr_received=false where id=49632;
ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168
LOCATION: _bt_swap_posting, nbtdedup.c:1037
If I drop the index index_freights_on_cmr_received, then the update succeeds.
If I recreate the index, the error resurfaces.
Note that it does not happen for every row in the database.
Sincerely,
Herman verschooten
> On Mon, Oct 25, 2021 at 11:58:48AM +0200, Herman verschooten wrote: > Hi, > > I upgraded postgres 13 to 14+231.pgdg18.04+1 during the weekend, virtual ubuntu 20.04.3. > > this morning a user reported an error when trying to change a flag in our web app (Phoenix, ecto). > I traced it back to an index issue on a boolean column. Hi, thanks for reporting the issue. > tranman_production=# update freights set cmr_received=false where id=49632; > ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168 > LOCATION: _bt_swap_posting, nbtdedup.c:1037 > > If I drop the index index_freights_on_cmr_received, then the update succeeds. > > If I recreate the index, the error resurfaces. > > Note that it does not happen for every row in the database. Interesting, during posting list split the postingoff is bigger than the number of elements in the posting list. Have you tried to run amcheck [1] on this index? As in the neighbour thread probably heapallindexed is important, as the index recreation doesn't help. Otherwise, is there a chance to use pageinspect [2] to take a look at the structure if the index? [1]: https://www.postgresql.org/docs/13/amcheck.html [2]: https://www.postgresql.org/docs/13/pageinspect.html
> On Mon, Oct 25, 2021 at 02:41:43PM +0200, Herman verschooten wrote: > I made a mistake apparently on checking the wrong thing. > someone on slack helped me. > > tranman_production=# SELECT bt_index_check(c.oid), c.relname, c.relpages > tranman_production-# FROM pg_index i > tranman_production-# JOIN pg_opclass op ON i.indclass[0] = op.oid > tranman_production-# JOIN pg_am am ON op.opcmethod = am.oid > tranman_production-# JOIN pg_class c ON i.indexrelid = c.oid > tranman_production-# JOIN pg_namespace n ON c.relnamespace = n.oid > tranman_production-# WHERE am.amname = 'btree' AND n.nspname = 'public' > tranman_production-# -- Don't check temp tables, which may be from another session: > tranman_production-# AND c.relpersistence != 't' > tranman_production-# -- Function may throw an error when this is omitted: > tranman_production-# AND i.indisready AND i.indisvalid > tranman_production-# ORDER BY c.relpages; > ERROR: item order invariant violated for index "index_products_on_group" > DETAIL: Lower index tid=(1,353) (points to heap tid=(19,8)) higher index tid=(1,354) (points to heap tid=(14,14)) pagelsn=5/CEC25DA8. > tranman_production=# SELECT bt_index_check(c.oid), c.relname, c.relpages > FROM pg_index i > JOIN pg_opclass op ON i.indclass[0] = op.oid > JOIN pg_am am ON op.opcmethod = am.oid > JOIN pg_class c ON i.indexrelid = c.oid > JOIN pg_namespace n ON c.relnamespace = n.oid > WHERE am.amname = 'btree' AND n.nspname = 'public' > -- Don't check temp tables, which may be from another session: > AND c.relpersistence != 't' > -- Function may throw an error when this is omitted: > AND i.indisready AND i.indisvalid > ORDER BY c.relpages; > ERROR: item order invariant violated for index "index_products_on_group" > DETAIL: Lower index tid=(1,353) (points to heap tid=(19,8)) higher index tid=(1,354) (points to heap tid=(14,14)) pagelsn=5/CEC25DA8. > > I dropped that index and ran it again, and kept on dropping indexes > > ERROR: item order invariant violated for index "index_cleanings_on_receipt" > DETAIL: Lower index tid=(39,21) (points to heap tid=(238,98)) higher index tid=(39,22) (points to heap tid=(243,7)) pagelsn=1/B853F98. > > ERROR: item order invariant violated for index "index_freights_on_reference_4" > DETAIL: Lower index tid=(34,51) (points to heap tid=(67,7)) higher index tid=(34,52) (points to heap tid=(84,2)) pagelsn=6/7EFB68. > > ERROR: item order invariant violated for index "index_freights_on_reference_3" > DETAIL: Lower index tid=(38,82) (points to heap tid=(362,75)) higher index tid=(38,83) (points to heap tid=(194,21)) pagelsn=4/FEA79090. > > ERROR: item order invariant violated for index "index_freights_on_reference_2" > DETAIL: Lower index tid=(10,44) (points to heap tid=(176,26)) higher index tid=(10,45) (points to heap tid=(215,60)) pagelsn=5/46A4D90. > > ERROR: item order invariant violated for index "index_freights_on_reference_1" > DETAIL: Lower index tid=(43,214) (points to heap tid=(112,46)) higher index tid=(43,215) (points to heap tid=(112,43))page lsn=5/C3999990. > > > Now no more errors are displayed, and I can run my initial update again, with the index created. > > I recreated the 6 indexes and everyhting still works. > > Herman Does it mean that everything is fine after recreating index_freights_on_reference_<d> indexes? Well, that would mean they were somehow corrupted, and reindexing is the way to go. Although it's not clear for me, why the problem was temporarily avoided by dropping another index on cmr_received. PS: One small tip, please reply to all in the mailing list, otherwise it would be harder for others to understand what happened with the discussion.
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Peter Geoghegan
Date:
On Mon, Oct 25, 2021 at 2:59 AM Herman verschooten <Herman@verschooten.net> wrote: > tranman_production=# update freights set cmr_received=false where id=49632; > ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168 > LOCATION: _bt_swap_posting, nbtdedup.c:1037 > > If I drop the index index_freights_on_cmr_received, then the update succeeds. What you see here is a defensive "can't happen" error that I added in commit 8f72bbac, and backpatched to Postgres 13.4, which came out on 2021-08-12. The goal of that error is to make a possible hard crash due to corruption into a slightly friendlier kind of failure (the error that you see here). Were you running 13.4 before the upgrade? If you were on 13.3 or earlier before the upgrade to 14, then it's possible that the problem was there all along, but is only now visible for the first time. -- Peter Geoghegan
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Herman verschooten
Date:
Hi Peter, Thanks for the info. I don’t recall the exact 13 version we were on. I can confirm that after dropping and recreating the 6 indexes everything is working fine. Thanks for all the help, both here and on slack, Herman > Op 25 okt. 2021, om 17:29 heeft Peter Geoghegan <pg@bowt.ie> het volgende geschreven: > > On Mon, Oct 25, 2021 at 2:59 AM Herman verschooten > <Herman@verschooten.net> wrote: >> tranman_production=# update freights set cmr_received=false where id=49632; >> ERROR: XX000: posting list tuple with 20 items cannot be split at offset 168 >> LOCATION: _bt_swap_posting, nbtdedup.c:1037 >> >> If I drop the index index_freights_on_cmr_received, then the update succeeds. > > What you see here is a defensive "can't happen" error that I added in > commit 8f72bbac, and backpatched to Postgres 13.4, which came out on > 2021-08-12. The goal of that error is to make a possible hard crash > due to corruption into a slightly friendlier kind of failure (the > error that you see here). Were you running 13.4 before the upgrade? > > If you were on 13.3 or earlier before the upgrade to 14, then it's > possible that the problem was there all along, but is only now visible > for the first time. > > -- > Peter Geoghegan
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Peter Geoghegan
Date:
On Mon, Oct 25, 2021 at 10:35 PM Herman verschooten <Herman@verschooten.net> wrote: > I can confirm that after dropping and recreating the 6 indexes everything is working fine. Did you also run VACUUM? You said that not all rows seem to be affected. Is there any pattern that you notice about the rows that are affected? I understand (from your remarks onSlack) that you found that the problem seemed to go away once you dropped the index named index_freights_on_cmr_received. But once you recreated the index from scratch, the problem came back. In other words, rebuilding the index didn't seem to make the "posting list tuple with 20 items cannot be split at offset 168" error go away, but dropping the index appeared to *temporarily* resolve the issue? Suggesting that the problem was either in the table, or in the code that builds a new index structure by scanning the table? > Thanks for all the help, both here and on slack, Thanks for the report! -- Peter Geoghegan
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Herman verschooten
Date:
Peter, No we just noticed that we were unable to update that 1 record, and only that Boolean field. We did not search for otherrecords that might have been affected. I did as yet not run VACUUM, I’ll look into that later today. Herman Verstuurd vanaf mijn iPad > Op 27 okt. 2021 om 03:45 heeft Peter Geoghegan <pg@bowt.ie> het volgende geschreven: > > On Mon, Oct 25, 2021 at 10:35 PM Herman verschooten > <Herman@verschooten.net> wrote: >> I can confirm that after dropping and recreating the 6 indexes everything is working fine. > > Did you also run VACUUM? > > You said that not all rows seem to be affected. Is there any pattern > that you notice about the rows that are affected? > > I understand (from your remarks onSlack) that you found that the > problem seemed to go away once you dropped the index named > index_freights_on_cmr_received. But once you recreated the index from > scratch, the problem came back. > > In other words, rebuilding the index didn't seem to make the "posting > list tuple with 20 items cannot be split at offset 168" error go away, > but dropping the index appeared to *temporarily* resolve the issue? > Suggesting that the problem was either in the table, or in the code > that builds a new index structure by scanning the table? > >> Thanks for all the help, both here and on slack, > > Thanks for the report! > > -- > Peter Geoghegan
Hi, On 2021-10-25 11:58:48 +0200, Herman verschooten wrote: > I upgraded postgres 13 to 14+231.pgdg18.04+1 during the weekend, virtual ubuntu 20.04.3. What filesystem is the postgres data stored on? > this morning a user reported an error when trying to change a flag in our web app (Phoenix, ecto). > I traced it back to an index issue on a boolean column. > > If I drop the index index_freights_on_cmr_received, then the update succeeds. > > If I recreate the index, the error resurfaces. Any chance you still have the WAL around from since the update? If we could scan the WAL to see what type of activity lead to the corruption, we might be able to narrow the cause down more easily. Greetings, Andres Freund
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Herman verschooten
Date:
Hi, The filesystem is EXT4 on a Proxmox container using LVM-thin. I have to check, but I believe there may still be a complete backup of the instance. Herman > Op 28 okt. 2021, om 23:47 heeft Andres Freund <andres@anarazel.de> het volgende geschreven: > > Hi, > > On 2021-10-25 11:58:48 +0200, Herman verschooten wrote: >> I upgraded postgres 13 to 14+231.pgdg18.04+1 during the weekend, virtual ubuntu 20.04.3. > > What filesystem is the postgres data stored on? > > >> this morning a user reported an error when trying to change a flag in our web app (Phoenix, ecto). >> I traced it back to an index issue on a boolean column. >> >> If I drop the index index_freights_on_cmr_received, then the update succeeds. >> >> If I recreate the index, the error resurfaces. > > Any chance you still have the WAL around from since the update? If we could > scan the WAL to see what type of activity lead to the corruption, we might be > able to narrow the cause down more easily. > > Greetings, > > Andres Freund
Re: ERROR: posting list tuple with 20 items cannot be split at offset 168
From
Peter Geoghegan
Date:
On Fri, Oct 29, 2021 at 1:47 AM Herman verschooten <Herman@verschooten.net> wrote: > The filesystem is EXT4 on a Proxmox container using LVM-thin. > > I have to check, but I believe there may still be a complete backup of the instance. Thanks for getting back about this. A fix for this bug will appear in Postgres 14.1. In the meantime I suggest that you avoid using parallel VACUUM. For example, any scripts that run VACUUM could be temporarily changed to make sure that parallelism won't be used. If the script says "vacuum my_table;", then that could be changed to "vacuum (parallel 0) my_table;", for example. Just until you can upgrade to the next point release. -- Peter Geoghegan