Thread: VACUUM can finish an interrupted nbtree page split -- is that okay?

VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
_bt_lock_branch_parent() is used by VACUUM during page deletion, and
calls _bt_getstackbuf(), which always finishes incomplete page splits
for the parent page that it exclusive locks and returns. ISTM that
this may be problematic, since it contradicts the general rule that
VACUUM isn't supposed to finish incomplete page splits. According to
the nbtree README:

"It would seem natural to add the missing downlinks in VACUUM, but since
inserting a downlink might require splitting a page, it might fail if you
run out of disk space.  That would be bad during VACUUM - the reason for
running VACUUM in the first place might be that you run out of disk space,
and now VACUUM won't finish because you're out of disk space.  In contrast,
an insertion can require enlarging the physical file anyway."

I'm inclined to note this as an exception in the nbtree README, and
leave it at that. Interrupted internal page splits are probably very
rare in practice, so the operational risk of running out of disk space
like this is minimal.

FWIW, I notice that the logic that appears after the
_bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
must defend against interrupted splits in at least the
grandparent-of-leaf page, and maybe even the parent, so it's probably
not unworkable to not finish the split:

    /*
     * If the target is the rightmost child of its parent, then we can't
     * delete, unless it's also the only child.
     */
    if (poffset >= maxoff)
    {
        /* It's rightmost child... */
        if (poffset == P_FIRSTDATAKEY(opaque))
        {
            /*
             * It's only child, so safe if parent would itself be removable.
             * We have to check the parent itself, and then recurse to test
             * the conditions at the parent's parent.
             */
            if (P_RIGHTMOST(opaque) || P_ISROOT(opaque) ||
                P_INCOMPLETE_SPLIT(opaque))
            {
                _bt_relbuf(rel, pbuf);
                return false;
            }

Separately, I noticed another minor issue that appears a few lines
further down still:

            /*
             * Perform the same check on this internal level that
             * _bt_mark_page_halfdead performed on the leaf level.
             */
            if (_bt_is_page_halfdead(rel, *rightsib))
            {
                elog(DEBUG1, "could not delete page %u because its
right sibling %u is half-dead",
                     parent, *rightsib);
                return false;
            }

I thought that internal pages were never half-dead after Postgres 9.4.
If that happens, then the check within _bt_pagedel() will throw an
ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
this internal level _bt_is_page_halfdead() check contain a "can't
happen" error, or even a simple assertion?

-- 
Peter Geoghegan


Peter Geoghegan <pg@bowt.ie> writes:
> _bt_lock_branch_parent() is used by VACUUM during page deletion, and
> calls _bt_getstackbuf(), which always finishes incomplete page splits
> for the parent page that it exclusive locks and returns. ISTM that
> this may be problematic, since it contradicts the general rule that
> VACUUM isn't supposed to finish incomplete page splits. According to
> the nbtree README:

> "It would seem natural to add the missing downlinks in VACUUM, but since
> inserting a downlink might require splitting a page, it might fail if you
> run out of disk space.  That would be bad during VACUUM - the reason for
> running VACUUM in the first place might be that you run out of disk space,
> and now VACUUM won't finish because you're out of disk space.  In contrast,
> an insertion can require enlarging the physical file anyway."

> I'm inclined to note this as an exception in the nbtree README, and
> leave it at that. Interrupted internal page splits are probably very
> rare in practice, so the operational risk of running out of disk space
> like this is minimal.

Also, if your WAL is on the same filesystem as the data, the whole
thing is pretty much moot anyway since VACUUM is surely going to add
WAL output.  I concur with not sweating over this.

> FWIW, I notice that the logic that appears after the
> _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
> must defend against interrupted splits in at least the
> grandparent-of-leaf page, and maybe even the parent, so it's probably
> not unworkable to not finish the split:

-ETOOMANYNEGATIVES ... can't quite parse your point here?

> I thought that internal pages were never half-dead after Postgres 9.4.
> If that happens, then the check within _bt_pagedel() will throw an
> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
> this internal level _bt_is_page_halfdead() check contain a "can't
> happen" error, or even a simple assertion?

I think that code is there to deal with the possibility of finding
an old half-dead page.  Don't know that it's safe to remove it yet.

            regards, tom lane


Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Fri, Mar 1, 2019 at 4:41 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > FWIW, I notice that the logic that appears after the
> > _bt_lock_branch_parent() call to _bt_getstackbuf() anticipates that it
> > must defend against interrupted splits in at least the
> > grandparent-of-leaf page, and maybe even the parent, so it's probably
> > not unworkable to not finish the split:
>
> -ETOOMANYNEGATIVES ... can't quite parse your point here?

Sorry.   :-)

My point was that it actually seems feasible to not do the split,
making the quoted paragraph from nbtree README correct as-is. But,
since we're happy to continue to finish the occasional interrupted
internal page split within VACUUM anyway, that isn't an important
point.

> I think that code is there to deal with the possibility of finding
> an old half-dead page.  Don't know that it's safe to remove it yet.

I don't know that it is either. My first instinct is to assume that
it's fine to remove the code, since, as I said, we're treating
internal pages that are half-dead as being ipso facto corrupt -- we'll
throw an error before too long anyway. However, "internal + half dead"
was a valid state for an nbtree  page prior to 9.4, and we make no
distinction about that (versioning nbtree indexes to deal with
cross-version incompatibilities came only in Postgres v11). Trying to
analyze whether or not it's truly safe to just not do this seems very
difficult, and I don't think that it's performance critical. This is a
problem only because it's distracting and confusing.

I favor keeping the test, but having it throw a
ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A
comment could point out that the test is historical/defensive, and
probably isn't actually necessary. What do you think of that idea?

-- 
Peter Geoghegan


Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Fri, Mar 1, 2019 at 5:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> I favor keeping the test, but having it throw a
> ERRCODE_INDEX_CORRUPTED error, just like _bt_pagedel() does already. A
> comment could point out that the test is historical/defensive, and
> probably isn't actually necessary. What do you think of that idea?

Actually, while 9.4 did indeed start treating "internal + half dead"
pages as corrupt, it didn't exactly remove the *concept* of a half
dead internal page. I think that the cross check (the one referenced
by comments above the corresponding leaf/_bt_mark_page_halfdead() call
to _bt_is_page_halfdead()) might have problems in the event of an
interrupted *multi-level* page deletion. I wonder, is there a subtle
bug here that bugfix commit 8da31837803 didn't quite manage to
prevent? (This commit added both of the _bt_is_page_halfdead()
checks.)

(Thinks some more...)

Actually, I think that bugfix commit 8da31837803 works despite
possible "logically half dead internal pages", because in the event of
such an internal page the sibling would actually have to be the
*cousin* of the original parent (half dead/leaf page parent), not the
"true sibling" (otherwise, cousin's multi-level page deletion should
never have taken place). I think that we'll end up doing the right
thing with the downlinks in the grandparent page, despite there being
an interrupted multi-level deletion in the cousin's subtree. Since
cousin *atomically* removed its downlink in our shared *grandparent*
(not its parent) at the same time some leaf page was initially marked
half-dead, everything works out.

Page deletion is painfully complicated. Seems wise to keep the
internal page test, out of sheer paranoia, while making it an error as
suggested earlier. I will definitely want to think about it some more,
though.

-- 
Peter Geoghegan


Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
>             /*
>              * Perform the same check on this internal level that
>              * _bt_mark_page_halfdead performed on the leaf level.
>              */
>             if (_bt_is_page_halfdead(rel, *rightsib))

> I thought that internal pages were never half-dead after Postgres 9.4.
> If that happens, then the check within _bt_pagedel() will throw an
> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
> this internal level _bt_is_page_halfdead() check contain a "can't
> happen" error, or even a simple assertion?

I think that we should get rid of this code on HEAD shortly, because
it's effectively dead code. You don't have to know anything about
B-Trees to see why this must be true: VACUUM is specifically checking
if an internal page is half-dead here, even though it's already
treating half-dead internal pages as ipso facto corrupt in higher
level code (it's the first thing we check in _bt_pagedel()). This is
clearly contradictory. If there is a half-dead internal page, then
there is no danger of VACUUM not complaining loudly that you need to
REINDEX. This has been true since the page deletion overhaul that went
into 9.4.

Attached patch removes the internal page check, and adds a comment
that explains why it's sufficient to check on the leaf level alone.
Admittedly, it's much easier to understand why the
_bt_is_page_halfdead() internal page check is useless than it is to
understand why replacing it with this comment is helpful. My
observation that a half-dead leaf page is representative of the
subtree whose root the leaf page has stored as its "top parent" is
hardly controversial, though -- that's the whole basis of multi-level
page deletion. If you *visualize* how multi-level deletion works, and
consider its rightmost-in-subtree restriction, then it isn't hard to
see why everything works out with just the leaf level
right-sibling-is-half-dead check:

We can only have two adjacent "skinny" pending-deletion subtrees in
cases where the removed check might seem to be helpful -- each page is
both the leftmost and the rightmost on its level in its subtree. It's
okay to just check if the leaf is half-dead because it "owns" exactly
the same range in the keyspace as the internal pages up to and
including its top parent, if any, and because it is marked half-dead
by the same atomic operation that does initial removal of downlinks in
an ancestor page.

I'm fine with waiting until we branch-off v12 before pushing the
patch, even though it seems low risk.

--
Peter Geoghegan

Attachment

Re: VACUUM can finish an interrupted nbtree page split -- is thatokay?

From
Heikki Linnakangas
Date:
On 05/05/2019 01:38, Peter Geoghegan wrote:
> On Fri, Mar 1, 2019 at 3:59 PM Peter Geoghegan <pg@bowt.ie> wrote:
>>              /*
>>               * Perform the same check on this internal level that
>>               * _bt_mark_page_halfdead performed on the leaf level.
>>               */
>>              if (_bt_is_page_halfdead(rel, *rightsib))
> 
>> I thought that internal pages were never half-dead after Postgres 9.4.
>> If that happens, then the check within _bt_pagedel() will throw an
>> ERRCODE_INDEX_CORRUPTED error, and tell the DBA to REINDEX. Shouldn't
>> this internal level _bt_is_page_halfdead() check contain a "can't
>> happen" error, or even a simple assertion?
> 
> I think that we should get rid of this code on HEAD shortly, because
> it's effectively dead code. You don't have to know anything about
> B-Trees to see why this must be true: VACUUM is specifically checking
> if an internal page is half-dead here, even though it's already
> treating half-dead internal pages as ipso facto corrupt in higher
> level code (it's the first thing we check in _bt_pagedel()). This is
> clearly contradictory. If there is a half-dead internal page, then
> there is no danger of VACUUM not complaining loudly that you need to
> REINDEX. This has been true since the page deletion overhaul that went
> into 9.4.

I don't understand that reasoning. Yes, _bt_pagedel() will complain if 
it finds a half-dead internal page. But how does that mean that 
_bt_lock_branch_parent() can't encounter one?

- Heikki



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I don't understand that reasoning. Yes, _bt_pagedel() will complain if
> it finds a half-dead internal page. But how does that mean that
> _bt_lock_branch_parent() can't encounter one?

I suppose that in theory it could, but only if you allow that any
possible state could be found -- it doesn't seem any more likely than
any other random illegal state.

Even when it happens, we'll get a "failed to re-find parent key" error
message when we go a bit further. Isn't that a logical outcome?

Actually, maybe we won't get that error, because we're talking about a
corrupt index, and all bets are off -- no reason to think that the
half-dead internal page would be consistent with other pages in any
way. But even then, you'll go on to report it in the usual way, since
VACUUM scans nbtree indexes in physical order.
-- 
Peter Geoghegan



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Tue, May 7, 2019 at 9:59 AM Peter Geoghegan <pg@bowt.ie> wrote:
> On Tue, May 7, 2019 at 12:27 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> > I don't understand that reasoning. Yes, _bt_pagedel() will complain if
> > it finds a half-dead internal page. But how does that mean that
> > _bt_lock_branch_parent() can't encounter one?
>
> I suppose that in theory it could, but only if you allow that any
> possible state could be found -- it doesn't seem any more likely than
> any other random illegal state.

To be fair, I suppose that the code made more sense when it first went
in, because at the time there was a chance that there could be
leftover half-dead internal pages. But, that was a long time ago now.

I wonder why the code wasn't complaining about corruption loudly, like
the top level code, instead of treating half-dead pages as a
legitimate reason to not proceed with multi-level page deletion. That
would have been overkill, but it would have made much more sense IMV.

I would like to proceed with pushing this patch to HEAD in the next
few days, since it's clearly removing code that can't be useful. Are
there any objections?

-- 
Peter Geoghegan



Peter Geoghegan <pg@bowt.ie> writes:
> To be fair, I suppose that the code made more sense when it first went
> in, because at the time there was a chance that there could be
> leftover half-dead internal pages. But, that was a long time ago now.

Is there a good reason to assume there are none left anywhere?

            regards, tom lane



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Mon, May 13, 2019 at 8:30 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Peter Geoghegan <pg@bowt.ie> writes:
> > To be fair, I suppose that the code made more sense when it first went
> > in, because at the time there was a chance that there could be
> > leftover half-dead internal pages. But, that was a long time ago now.
>
> Is there a good reason to assume there are none left anywhere?

That is not an assumption that the proposed patch rests upon, though
it is true that there are probably going to be virtually no half-dead
internal pages that make there way on to a Postgres 12 installation.
You'd have to do a CREATE INDEX on Postgres 9.3, and then not VACUUM
or REINDEX the index once it was on a 9.4+ installation. I suppose
that a 9.3 -> 12 upgrade is the most plausible scenario in which you
could actual get a half-dead internal page on a Postgres 12
installation.

Even when that happens, the index is already considered corrupt by
VACUUM, so the same VACUUM process that could in theory be adversely
affected by removing the half-dead internal page check will complain
about the page when it gets to it later on -- the user will be told to
REINDEX. And even then, we will never actually get to apply the check
that I propose to remove, since we're already checking the leaf page
sibling of the leaf-level target -- the leaf-level test that was added
by efada2b8e92 was clearly necessary. But it was also sufficient (no
equivalent internal half-dead right sibling test is needed): a 9.3-era
half-dead internal page cannot have more than one child, which must be
undergoing deletion as well.

If somebody doubted my rationale for why we don't need to do anything
more on internal page levels in installations where the user didn't
pg_upgrade from a version that's < 9.4, then they'd still have to
explain why we haven't heard of any problems in 5 years, and probably
offer some alternative fix that considers "logically half-dead
internal pages" (i.e. pages that are or will be the top parent in a
deletion chain). Because the code that I propose to remove obviously
cannot be doing much of anything for indexes built on 9.4+.

-- 
Peter Geoghegan



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Mon, May 13, 2019 at 9:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Even when that happens, the index is already considered corrupt by
> VACUUM, so the same VACUUM process that could in theory be adversely
> affected by removing the half-dead internal page check will complain
> about the page when it gets to it later on -- the user will be told to
> REINDEX. And even then, we will never actually get to apply the check
> that I propose to remove, since we're already checking the leaf page
> sibling of the leaf-level target -- the leaf-level test that was added
> by efada2b8e92 was clearly necessary. But it was also sufficient (no
> equivalent internal half-dead right sibling test is needed): a 9.3-era
> half-dead internal page cannot have more than one child, which must be
> undergoing deletion as well.

Actually, now that I look back at how page deletion worked 5+ years
ago, I realize that I have this slightly wrong: the leaf level check
is not sufficient to figure out if the parent's right sibling is
pending deletion (which is represented explicitly as a half-dead
internal page prior to 9.4). All the same, I'm going to push ahead
with this patch. Bugfix commit efada2b8e92 was always about a bug in
9.4 -- it had nothing to do with 9.3. And, in the unlikely event that
there is a problem on a pg_upgrade'd 9.3 -> 12 database that happens
to have half-dead internal pages, we'll still get a useful, correct
error from VACUUM one way or another. It might be slightly less
friendly as error messages about corruption go, but that seems
acceptable to me.

-- 
Peter Geoghegan



Re: VACUUM can finish an interrupted nbtree page split -- is that okay?

From
Peter Geoghegan
Date:
On Thu, May 16, 2019 at 1:05 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Actually, now that I look back at how page deletion worked 5+ years
> ago, I realize that I have this slightly wrong: the leaf level check
> is not sufficient to figure out if the parent's right sibling is
> pending deletion (which is represented explicitly as a half-dead
> internal page prior to 9.4). All the same, I'm going to push ahead
> with this patch. Bugfix commit efada2b8e92 was always about a bug in
> 9.4 -- it had nothing to do with 9.3.

I meant bugfix commit 8da31837803 (commit efada2b8e92 was the commit
that had the bug in question).


-- 
Peter Geoghegan