Re: Orphan page in _bt_split - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | Re: Orphan page in _bt_split |
Date | |
Msg-id | CAH2-Wz=RVy+uFn4SAmu3bNzkJfqOhhnAgamBbQZyMMRz09GeSw@mail.gmail.com Whole thread Raw |
In response to | Orphan page in _bt_split (Konstantin Knizhnik <knizhnik@garret.ru>) |
Responses |
Re: Orphan page in _bt_split
|
List | pgsql-hackers |
On Mon, Sep 1, 2025 at 1:03 AM Konstantin Knizhnik <knizhnik@garret.ru> wrote: > So we should zero this page in case of reporting error to "not confuse > vacuum. > It is done for all places where this function reports error before > entering critical section and wal-logging this page. Right. > _bp_getbuf just reads page in a shared buffer. But ReadBuffer actually > can invoke half of Postgres code (locating buffer, evicting victim, > writing to SMGR,..., So there are a lot of places where error can be > thrown (including even such error as statement timeout). > And in this case current transaction will be aborted without zeroing the > right page. > So vacuum will be confused. ... > 2025-09-01 07:46:37.459 EEST [89290] ERROR: @@@ Terminated > 2025-09-01 07:46:37.459 EEST [89290] STATEMENT: insert into t3 values > (generate_series(1,1000000),random()*1000000,0); > 2025-09-01 07:46:42.391 EEST [89406] LOG: failed to re-find parent key > in index "t3_sk_idx" for deletion target page 4 > 2025-09-01 07:46:42.391 EEST [89406] CONTEXT: while vacuuming index > "t3_sk_idx" of relation "public.t3" > TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406 > This code is not changed for quite long time so I wonder why nobody > noticed this error before? The assertion failure + log output that you've shown is from _bt_lock_subtree_parent. I'm aware that that error appears from time to time in the field. When we hit this log message in a non-assert build, VACUUM should still be able to finish successfully. My assumption up until now has been that the corruption underlying these "failed to re-find parent" LOG messages is generally due to breaking changes to collations (page deletion uses an insertion scan key to "re-find" a downlink to the page undergoing deletion), and generic corruption. But it now seems possible that some of them were due to the problem that you've highlighted. The problem that you've highlighted happens late within _bt_split, when the contents of the new right sibling page (everything except its LSN) has already been written to the page. So I'm fairly confident that any real world problems would show themselves as "failed to re-find parent" LOG message from VACUUM (there are no downlinks or sibling pointers that point to the new right page, so only VACUUM will ever be able to read the page). > Proposed patch is attached. > It creates copy of right page in the same way as for left (original) page. > And updates the page only in critical section. I remember that when I worked on what became commit 9b42e71376 back in 2019 (which fixed a similar problem caused by the INCLUDE index patch), Tom suggested that we do things this way defensively (without being specifically aware of the _bt_getbuf hazard). That does seem like the best approach. I'm a little bit worried about the added overhead of allocating an extra BLCKSZ buffer for this. I wonder if it would make sense to use BLCKSZ arrays of char for both of the temp pages. -- Peter Geoghegan
pgsql-hackers by date: