Re: Bug in amcheck? - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: Bug in amcheck? |
| Date | |
| Msg-id | 8ce7937f-fd4d-44f6-8393-2658f0f11022@iki.fi Whole thread Raw |
| In response to | Re: Bug in amcheck? (Mihail Nikalayeu <mihailnikalayeu@gmail.com>) |
| Responses |
Re: Bug in amcheck?
|
| List | pgsql-hackers |
On 19/11/2025 00:19, Mihail Nikalayeu wrote:
> Hello!
>
>> Originally I investigated the customer's problem with PG16. And have
>> reproduced it for pg16,. I checked that relevant amcheck code was not
>> changed since pg16, so I thought that the problem takes place for all
>> Postgres versions. But looks like it is not true.
>
> I think it is still broken, but with less probability.
> Have you tried injection points on v16? Such a test case will make
> things much more clear.
Konstantin's original repro involved autovacuum and concurrent sessions.
I was confused by that, because bt_index_parent_check() holds a
ShareLock, which prevents it from running concurrently with vacuum. But
this isn't a race condition as such, the issue arises whenever there is
a half-dead page in the index. To end up with a half-dead page, you need
to gracefully cancel/interrupt autovacuum while it's deleting a page.
The repro's way of canceling autovacuum was very complicated. I didn't
fully understand it, but I think the concurrent dropping/creating of
tables would sometimes cause autovauum to be canceled.
Here's a much more straightforward repro. Apply this little patch:
diff --git a/src/backend/access/nbtree/nbtpage.c
b/src/backend/access/nbtree/nbtpage.c
index 30b43a4dd18..c132fc90277 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -2353,6 +2353,12 @@ _bt_unlink_halfdead_page(Relation rel, Buffer
leafbuf, BlockNumber scanblkno,
* Check here, as calling loops will have locks held, preventing
* interrupts from being processed.
*/
+ if (random() < INT32_MAX / 2)
+ {
+ elog(ERROR, "aborting page deletion");
+ }
+ else
+ elog(NOTICE, "unlinking halfdead page %u %u",
BufferGetBlockNumber(leafbuf), scanblkno);
CHECK_FOR_INTERRUPTS();
/* Unlink the current top parent of the subtree */
Then run this:
postgres=# CREATE TABLE amchecktest (id int4);
CREATE TABLE
postgres=# INSERT INTO amchecktest SELECT g from generate_series(1,
1000000) g;
INSERT 0 1000000
postgres=# CREATE INDEX on amchecktest (id);
CREATE INDEX
postgres=# DELETE FROM amchecktest WHERE id > 100000 AND id < 120000;
DELETE 19999
postgres=# -- this will hit the error added by the patch
VACUUM amchecktest;
ERROR: aborting page deletion
CONTEXT: while vacuuming index "amchecktest_id_idx" of relation
"public.amchecktest"
postgres=# select bt_index_parent_check('amchecktest_id_idx');
ERROR: mismatch between parent key and child high key in index
"amchecktest_id_idx"
DETAIL: Target block=3 child block=276 target page lsn=0/6ED0DB68.
To fix this, I guess we need to teach bt_index_parent_check() about
half-dead pages. Anyone volunteer to write that patch?
- Heikki
pgsql-hackers by date: