Re: Bug in amcheck? - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Bug in amcheck?
Date
Msg-id 8ce7937f-fd4d-44f6-8393-2658f0f11022@iki.fi
Whole thread Raw
In response to Re: Bug in amcheck?  (Mihail Nikalayeu <mihailnikalayeu@gmail.com>)
Responses Re: Bug in amcheck?
List pgsql-hackers
On 19/11/2025 00:19, Mihail Nikalayeu wrote:
> Hello!
> 
>> Originally I investigated the customer's problem with PG16. And have
>> reproduced it for pg16,. I checked that relevant amcheck code was not
>> changed since pg16, so I thought that the problem takes place for all
>> Postgres versions. But looks like it is not true.
> 
> I think it is still broken, but with less probability.
> Have you tried injection points on v16? Such a test case will make
> things much more clear.

Konstantin's original repro involved autovacuum and concurrent sessions. 
I was confused by that, because bt_index_parent_check() holds a 
ShareLock, which prevents it from running concurrently with vacuum. But 
this isn't a race condition as such, the issue arises whenever there is 
a half-dead page in the index. To end up with a half-dead page, you need 
to gracefully cancel/interrupt autovacuum while it's deleting a page. 
The repro's way of canceling autovacuum was very complicated. I didn't 
fully understand it, but I think the concurrent dropping/creating of 
tables would sometimes cause autovauum to be canceled.

Here's a much more straightforward repro. Apply this little patch:

diff --git a/src/backend/access/nbtree/nbtpage.c 
b/src/backend/access/nbtree/nbtpage.c
index 30b43a4dd18..c132fc90277 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2353,6 +2353,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer 
leafbuf, BlockNumber scanblkno,
       * Check here, as calling loops will have locks held, preventing
       * interrupts from being processed.
       */
+    if (random()  < INT32_MAX / 2)
+    {
+        elog(ERROR, "aborting page deletion");
+    }
+    else
+        elog(NOTICE, "unlinking halfdead page %u %u", 
BufferGetBlockNumber(leafbuf), scanblkno);
      CHECK_FOR_INTERRUPTS();

      /* Unlink the current top parent of the subtree */

Then run this:

postgres=# CREATE TABLE amchecktest (id int4);
CREATE TABLE
postgres=# INSERT INTO amchecktest SELECT g from generate_series(1, 
1000000) g;
INSERT 0 1000000
postgres=# CREATE INDEX on amchecktest (id);
CREATE INDEX
postgres=# DELETE FROM amchecktest WHERE id > 100000 AND id < 120000;
DELETE 19999
postgres=# -- this will hit the error added by the patch
VACUUM amchecktest;
ERROR:  aborting page deletion
CONTEXT:  while vacuuming index "amchecktest_id_idx" of relation 
"public.amchecktest"
postgres=# select bt_index_parent_check('amchecktest_id_idx');
ERROR:  mismatch between parent key and child high key in index 
"amchecktest_id_idx"
DETAIL:  Target block=3 child block=276 target page lsn=0/6ED0DB68.


To fix this, I guess we need to teach bt_index_parent_check() about 
half-dead pages. Anyone volunteer to write that patch?

- Heikki




pgsql-hackers by date:

Previous
From: Melanie Plageman
Date:
Subject: Re: Buffer locking is special (hints, checksums, AIO writes)
Next
From: Tom Lane
Date:
Subject: Re: The pgperltidy diffs in HEAD