Orphan page in _bt_split - Mailing list pgsql-hackers

From Konstantin Knizhnik
Subject Orphan page in _bt_split
Date
Msg-id 566dacaf-5751-47e4-abc6-73de17a5d42a@garret.ru
Whole thread Raw
Responses Re: Orphan page in _bt_split
Re: Orphan page in _bt_split
List pgsql-hackers
Hi hacker,

The function _bt_split creates new (right) page to split into.
The comment says:

     /*
      * Acquire a new right page to split into, now that left page has a new
      * high key.  From here on, it's not okay to throw an error without
      * zeroing rightpage first.  This coding rule ensures that we won't
      * confuse future VACUUM operations, which might otherwise try to 
re-find
      * a downlink to a leftover junk page as the page undergoes deletion.
      *
      * It would be reasonable to start the critical section just after 
the new
      * rightpage buffer is acquired instead; that would allow us to avoid
      * leftover junk pages without bothering to zero rightpage. We do 
it this
      * way because it avoids an unnecessary PANIC when either origpage 
or its
      * existing sibling page are corrupt.
      */

So we should zero this page in case of reporting error to "not confuse 
vacuum.
It is done for all places where this function reports error before 
entering critical section and wal-logging this page.

But before it there is call to _bt_getbuf:

     /*
      * We have to grab the original right sibling (if any) and update 
its prev
      * link.  We are guaranteed that this is deadlock-free, since we couple
      * the locks in the standard order: left to right.
      */
     if (!isrightmost)
     {
         sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);

_bp_getbuf just reads page in a shared buffer. But ReadBuffer actually 
can invoke half of Postgres code (locating buffer, evicting victim, 
writing to SMGR,..., So there are a lot of places where error can be 
thrown (including even such error as statement timeout).
And in this case current transaction will be aborted without zeroing the 
right page.
So vacuum will be confused.

It can be quite easily demonstrated by inserting error here:

     /*
      * We have to grab the original right sibling (if any) and update 
its prev
      * link.  We are guaranteed that this is deadlock-free, since we couple
      * the locks in the standard order: left to right.
      */
     if (!isrightmost)
     {
         sbuf = _bt_getbuf(rel, oopaque->btpo_next, BT_WRITE);
         elog(ERROR, "@@@ Terminated");


And doing something like this:

postgres=# create table t3(pk integer primary key, sk integer, payload 
integer);
CREATE TABLE
postgres=# create index on t3(sk);
CREATE INDEX
postgres=# insert into t3 values 
(generate_series(1,1000000),random()*1000000,0);
ERROR:  @@@ Terminated
postgres=# vacuum t3;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
server closed the connection unexpectedly
     This probably means the server terminated abnormally
     before or while processing the request.


log:

2025-09-01 07:46:37.459 EEST [89290] ERROR:  @@@ Terminated
2025-09-01 07:46:37.459 EEST [89290] STATEMENT:  insert into t3 values 
(generate_series(1,1000000),random()*1000000,0);
2025-09-01 07:46:42.391 EEST [89406] LOG:  failed to re-find parent key 
in index "t3_sk_idx" for deletion target page 4
2025-09-01 07:46:42.391 EEST [89406] CONTEXT:  while vacuuming index 
"t3_sk_idx" of relation "public.t3"
TRAP: failed Assert("false"), File: "nbtpage.c", Line: 2849, PID: 89406
0   postgres                            0x0000000104bf2f64 
ExceptionalCondition + 216
1   postgres                            0x00000001043ace94 
_bt_lock_subtree_parent + 248
2   postgres                            0x00000001043aaf98 
_bt_mark_page_halfdead + 508
3   postgres                            0x00000001043aaaec _bt_pagedel + 
1068
4   postgres                            0x00000001043b56cc btvacuumpage 
+ 2116
5   postgres                            0x00000001043b4dd4 btvacuumscan 
+ 564


This code is not changed for quite long time so I wonder why nobody 
noticed this error before?

Proposed patch is attached.
It creates copy of right page in the same way as for left (original) page.
And updates the page only in critical section.



Attachment

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Conflict detection for update_deleted in logical replication
Next
From: Michael Paquier
Date:
Subject: Re: Update outdated references to SLRU ControlLock