_bt_split(), and the risk of OOM before its critical section - Mailing list pgsql-hackers

From Peter Geoghegan
Subject _bt_split(), and the risk of OOM before its critical section
Date
Msg-id CAH2-WzkcWT_-NH7EeL=Az4efg0KCV+wArygW8zKB=+HoP=VWMw@mail.gmail.com
Whole thread Raw
Responses Re: _bt_split(), and the risk of OOM before its critical section  (Peter Geoghegan <pg@bowt.ie>)
Re: _bt_split(), and the risk of OOM before its critical section  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Commit 8fa30f906be reduced the elevel of a number of "can't happen"
errors from PANIC to ERROR. These were all critical-section-adjacent
errors involved in nbtree page splits, and nbtree page deletion. It
also established the following convention within _bt_split(), which
allowed Tom to keep the length of the critical section just as short
as it had always been:

/*
 * origpage is the original page to be split.  leftpage is a temporary
 * buffer that receives the left-sibling data, which will be copied back
 * into origpage on success.  rightpage is the new page that receives the
 * right-sibling data.  If we fail before reaching the critical section,
 * origpage hasn't been modified and leftpage is only workspace. In
 * principle we shouldn't need to worry about rightpage either, because it
 * hasn't been linked into the btree page structure; but to avoid leaving
 * possibly-confusing junk behind, we are careful to rewrite rightpage as
 * zeroes before throwing any error.
 */

The INCLUDE indexes work looks like it subtly broke this, because it
allocated memory after the initialization of the right page --
allocating memory can always fail. On the other hand, even when
8fa30f906be went in back in 2010 this "rule" was arguably broken,
because we were already calling PageGetTempPage() after the right
sibling page is initialized, which palloc()s a full BLCKSZ, which is
far more than truncation is every likely to allocate.

On the other other hand, it seems to me that the PageGetTempPage()
thing might have been okay, because it happens before the high key is
inserted on the new right buffer page. The same cannot be said for the
way we generate a new high key for the left/old page via suffix
truncation, which happens to occur after the right buffer page is
first modified by inserted its high key (the original/left page's
original high key). I think that there may be a risk that VACUUM's
page deletion code will get confused by finding an errant right
sibling page from a failed page split when there is a high key. If so,
that would be a risk that was introduced in Postgres 11, and made much
more likely in practice in Postgres 12. (I haven't got as far as doing
an analysis of the risks to page deletion, though. The "fastpath"
rightmost page insertion optimization that was also added to Postgres
11 seems like it also might need to be considered here.)

-- 
Peter Geoghegan



pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Attempt to consolidate reading of XLOG page
Next
From: Robert Haas
Date:
Subject: Re: Attempt to consolidate reading of XLOG page