Re: Possible duplicate release of buffer lock. - Mailing list pgsql-hackers

From Kyotaro HORIGUCHI
Subject Re: Possible duplicate release of buffer lock.
Date
Msg-id 20160805.145733.175979268.horiguchi.kyotaro@lab.ntt.co.jp
Whole thread Raw
In response to Re: Possible duplicate release of buffer lock.  (Peter Geoghegan <pg@heroku.com>)
Responses Re: Possible duplicate release of buffer lock.  (Amit Kapila <amit.kapila16@gmail.com>)
List pgsql-hackers
Hello,

At Thu, 4 Aug 2016 16:09:17 -0700, Peter Geoghegan <pg@heroku.com> wrote in
<CAM3SWZS2+ALNdDjWyeRLpjvss_MYU+n_9DQAL9fg4i-KPc9Q0A@mail.gmail.com>
> On Wed, Aug 3, 2016 at 1:31 AM, Kyotaro HORIGUCHI
> <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> > The first line is emitted for simultaneous deletion of a index
> > page, which is impossible by design in a consistent state so the
> > complained situation should be the result of an index corruption
> > before upgading, specifically, inconsistent sibling pointers
> > around a deleted page.
> 
> > My point here is that if concurrent deletion can't be perfomed by
> > the current implement, this while loop could be removed and
> > immediately error out or log a message,
> 
> Perhaps I have misunderstood, but I must ask: Are you aware that 9.4
> has commit efada2b8, which fixes a bug with page deletion, that never
> made it into earlier branches?

_bt_unlink_halfdead_page in nbtpage.c of 9.4 that I visited this
time is after the commit. And it doesn't change the condition for
the "no left sibling" message. If I misunderstood you.

> It is worth noting that anything that can make VACUUM in particular
> raise an error is considered problematic. For example, we do not allow
> VACUUM to finish incomplete B-Tree page splits, even though we could,
> because there might be no disk space for even one extra page at the
> worst possible time (we VACUUM in part to free disk space, of course).
> We really want to let VACUUM finish, if at all possible, even if it
> sees something that indicates data corruption.

Yes, It seems to me that the "no left sibling" is ignorable since
skipping for the case doesn't more harm.

For this inquiry case, VACUUM stopped by a seemingly
double-releasing of a lock just after "no left sibling". I
suspect that it is caused by skipping deleted page just left of
the target page and it can happen for certain type of
corruption. (writing out pages in certain order then crash and
incomplete crash recovery after that would cause this..)

If concurrent deletion (marking half-dead, specifically) can't
happen, P_ISDELETED() won't be seen in the loop but we may
consider it as a skippable condition to continue VACUUM. But
still I think we shouldn't release the lock on the target page
there. But it doesn't harm except that VACUUM stops there so I
don't mind it will be as it is. So I'd like to have opinions of
others about this.

> I remember that during the review of the similar B-Tree page split
> patch (not efada2b8, but rather commit 40dae7ec, which was also in
> 9.4), I found bugs due to functions not being very clear about whether
> a buffer lock and/or pin were held upon returning from a function in
> all cases, including very rare cases or "can't happen" cases. So, this
> seems kind of familiar.

I think the point is that is a danger leads to wrong result or
crash of server, or not. On that criteria, this doesn't harm,
maybe.

> Code comments in the relevant function say this:
> 
>  * Returns 'false' if the page could not be unlinked (shouldn't happen).
>  * If the (new) right sibling of the page is empty, *rightsib_empty is set
>  * to true.
>  */
> static bool
> _bt_unlink_halfdead_page(Relation rel, Buffer leafbuf, bool *rightsib_empty)
> {
> 
> What we see here is that this "shouldn't happen" condition does, in
> fact, happen very rarely, including in cases where the target is not a
> leaf page.
> 
> I think there might well be a real bug here, because we are not
> careful enough in ensuring that the _bt_unlink_halfdead_page()
> "contract" holds when something that "can't happen" does, in fact,
> happen:
> 
>     /*
>      * Release the target, if it was not the leaf block.  The leaf is always
>      * kept locked.
>      */
>     if (target != leafblkno)
>         _bt_relbuf(rel, buf);
> 
>     return true;
> }
> 
> (Assuming you can call this code at the end of
> _bt_unlink_halfdead_page() its "contract".)
> 
> It's not good at all if the code is stuck in a way that prevents
> VACUUM from finishing, as I said. That seems to be the case here,
> judging from Horiguchi-san's log output, which shows an ERROR from
> "lock main 13879 is not held". The problem is not so much that a
> "can't happen" thing happens (data corruption will sometimes make "the
> impossible" possible, after all -- we know this, and write code
> defensively because of this). The bug is that there is any ERROR for
> VACUUM that isn't absolutely unavoidable (the damage has been done
> anyway -- the index is already corrupt).

Agreed. My thought is almost the as this, it seems to be better
to save such 'not absolutely ununvoidable' case unless it's too
complicated.

> I'll need to think about this some more, when I have more time.
> Perhaps tomorrow.

regards,

-- 
Kyotaro Horiguchi
NTT Open Source Software Center





pgsql-hackers by date:

Previous
From: Kyotaro HORIGUCHI
Date:
Subject: Re: Possible duplicate release of buffer lock.
Next
From: Andres Freund
Date:
Subject: Re: Detecting skipped data from logical slots (data silently skipped)