Re: _bt_split(), and the risk of OOM before its critical section - Mailing list pgsql-hackers

From Tom Lane
Subject Re: _bt_split(), and the risk of OOM before its critical section
Date
Msg-id 17768.1557181751@sss.pgh.pa.us
Whole thread Raw
In response to _bt_split(), and the risk of OOM before its critical section  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: _bt_split(), and the risk of OOM before its critical section  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Peter Geoghegan <pg@bowt.ie> writes:
> Commit 8fa30f906be reduced the elevel of a number of "can't happen"
> errors from PANIC to ERROR. These were all critical-section-adjacent
> errors involved in nbtree page splits, and nbtree page deletion. It
> also established the following convention within _bt_split(), which
> allowed Tom to keep the length of the critical section just as short
> as it had always been:

> /*
>  * origpage is the original page to be split.  leftpage is a temporary
>  * buffer that receives the left-sibling data, which will be copied back
>  * into origpage on success.  rightpage is the new page that receives the
>  * right-sibling data.  If we fail before reaching the critical section,
>  * origpage hasn't been modified and leftpage is only workspace. In
>  * principle we shouldn't need to worry about rightpage either, because it
>  * hasn't been linked into the btree page structure; but to avoid leaving
>  * possibly-confusing junk behind, we are careful to rewrite rightpage as
>  * zeroes before throwing any error.
>  */

> The INCLUDE indexes work looks like it subtly broke this, because it
> allocated memory after the initialization of the right page --
> allocating memory can always fail.

Yeah, as _bt_split is currently coded, _bt_truncate has to be a "no
errors" function, which it isn't.  The pfree for its result is being
done in an ill-chosen place, too.

Another problem now that I look at it is that the _bt_getbuf for the right
sibling is probably not too safe.  And the _bt_vacuum_cycleid() call seems
a bit dangerous from this standpoint as well.

> On the other hand, even when
> 8fa30f906be went in back in 2010 this "rule" was arguably broken,
> because we were already calling PageGetTempPage() after the right
> sibling page is initialized, which palloc()s a full BLCKSZ, which is
> far more than truncation is every likely to allocate.

I'm not really concerned about that one because at that point the
right page is still in a freshly-pageinit'd state.  It's perhaps
not quite as nice as having it be zeroes, but it won't look like
it has any interesting data.  (But, having said that, we could
certainly reorder the code to construct the temp page first.)

In any case, once we've started to fill the ropaque area, it would really
be better if we don't call anything that could throw errors.

Maybe we should bite the bullet and use two temp pages, so that none
of the data ends up in the shared buffer arena until we reach the
critical section?  The extra copying is slightly annoying, but
it certainly seems like enforcing this invariant over such a
long stretch of code is not very maintainable.

            regards, tom lane



pgsql-hackers by date:

Previous
From: David Fetter
Date:
Subject: Re: range_agg
Next
From: Peter Geoghegan
Date:
Subject: Re: _bt_split(), and the risk of OOM before its critical section