Thread: BTP_DELETED leaf still in tree

BTP_DELETED leaf still in tree

From
Daniel Wood
Date:

Update query stuck in a loop.  Looping in _bt_moveright().

ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright

Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf.

btpo_next is also DELETED but not in the tree.

btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value.

Thus creating an endless loop in moveright.


The deleted page is in the tree.  The left leaf page still points to it.  The right leaf page points back to the deleted page.

The deleted page itself has arbitrary prev and next pointer.  But the next pointer does lead to a loop.


Is there any way, crash recovery or otherwise, that could result in a BTP_DELETED leaf which is still in the tree both in terms of the mid tree pointing down to it but also linked to by the 2 leaf siblings.  It is as if the mid tree and two siblings were updated but never made it to disk but the DELETED page itself got written.


Even after a restart the hang reoccurred.  Rebuild fixed the problem.  Unfortunately I'm not sure if I have enough log history left to examine.  But I do have the index file before the rebuild and it clearly has the loop on disk.



Re: BTP_DELETED leaf still in tree

From
Peter Geoghegan
Date:
On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood <hexexpert@comcast.net> wrote:
> Update query stuck in a loop.  Looping in _bt_moveright().

You didn't say which PostgreSQL versions were involved, and if the
database was ever upgraded using pg_upgrade. Those details could
matter.

> ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
>
> Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf.

This should hardly ever happen -- it is barely possible for an index
scan to land on a BTP_DELETED leaf page (or a half-dead page) when
following a downlink in its parent. Recall that nbtree uses Lehman &
Yao's design, so _bt_search() does not "couple" buffer locks on the
way down. It would probably be impossible to observe this happening
without carefully setting breakpoints in multiple sessions.

If this happens reliably for you, which it sounds like, then you can
already assume that the index is corrupt.

> btpo_next is also DELETED but not in the tree.
>
> btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value.
>
> Thus creating an endless loop in moveright.

Offhand, these other details sound normal. The side links are still
needed in fully deleted (BTP_DELETED) pages. And, moving right and
finding lesser key values (not greater key values) is normal with
deleted pages, since page deletion makes the keyspace move right, not
left (moving the keyspace left is how the source Lanin & Shasha paper
does it, though).

Actually, I take it back -- the looping part is not normal. The
btpo_next->btpo_next page has no business linking back to the
original/first deleted page you mentioned. That's just odd.

Can you provide me with a dump of the page images? The easiest way of
getting a page dump is described here:


https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump

If I had to guess, I'd guess that this was due to a generic storage problem.

-- 
Peter Geoghegan



Re: BTP_DELETED leaf still in tree

From
Peter Geoghegan
Date:
On Thu, Oct 10, 2019 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> You didn't say which PostgreSQL versions were involved, and if the
> database was ever upgraded using pg_upgrade. Those details could
> matter.

In case you weren't aware, contrib/amcheck should make detected and
diagnosing these kinds of problems a lot easier. You should prefer to
use the bt_index_parent_check() variant (which will block writes and
VACUUM, but not reads). It will verify that sibling pointers are in
agreement with each other, and that leaf pages contain keys that are
covered by the relevant separator keys from the parent level.

If you happen to be using v11, then you might also want to use the
heapallindexed option -- that will verify that the heap and index are
in agreement. If the issue is on v12, the new "rootdescend" option can
detect very subtle cross-level transitive consistency options. (This
is only available in v12 because that was the version that made all
entries in the index unique by using heap TID as a unique-ifier.)

-- 
Peter Geoghegan



Re: BTP_DELETED leaf still in tree

From
Daniel Wood
Date:
> On October 10, 2019 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote:
> 
> 
> On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood <hexexpert@comcast.net> wrote:
> > Update query stuck in a loop.  Looping in _bt_moveright().
> 
> You didn't say which PostgreSQL versions were involved, and if the
> database was ever upgraded using pg_upgrade. Those details could
> matter.

PG_VERSION says 10.  I suspect we are running 10.9.  I have no idea if pg_upgrade was ever done.

> > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
> >
> > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf.
> 
> This should hardly ever happen -- it is barely possible for an index
> scan to land on a BTP_DELETED leaf page (or a half-dead page) when
> following a downlink in its parent. Recall that nbtree uses Lehman &
> Yao's design, so _bt_search() does not "couple" buffer locks on the
> way down. It would probably be impossible to observe this happening
> without carefully setting breakpoints in multiple sessions.
> 
> If this happens reliably for you, which it sounds like, then you can
> already assume that the index is corrupt.
> 
> > btpo_next is also DELETED but not in the tree.
> >
> > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value.
> >
> > Thus creating an endless loop in moveright.
> 
> Offhand, these other details sound normal. The side links are still
> needed in fully deleted (BTP_DELETED) pages. And, moving right and
> finding lesser key values (not greater key values) is normal with
> deleted pages, since page deletion makes the keyspace move right, not
> left (moving the keyspace left is how the source Lanin & Shasha paper
> does it, though).
> 
> Actually, I take it back -- the looping part is not normal. The
> btpo_next->btpo_next page has no business linking back to the
> original/first deleted page you mentioned. That's just odd.

btpo_next->btpo_next does NOT link directly back to the 1st deleted page.  It simply links to some in-use page which is
50or so leaf pages back in the tree.  Eventually we do reach the two deleted pages again.  Only the first one is in the
'tree'.

> Can you provide me with a dump of the page images? The easiest way of
> getting a page dump is described here:

Customer data.  Looks like meaningless customer data (5 digit key values).  But too much paperwork. :-)

The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node.
It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together.  But this
hasn'tbeen done.
 

Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it.
Couldthis allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE
beingflushed?  Of course, I don't understand the "target != leafblkno".
 

In any case, thanks.

>
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump
> 
> If I had to guess, I'd guess that this was due to a generic storage problem.
> 
> -- 
> Peter Geoghegan



Re: BTP_DELETED leaf still in tree

From
Peter Geoghegan
Date:
On Fri, Oct 11, 2019 at 12:44 AM Daniel Wood <hexexpert@comcast.net> wrote:
> > Actually, I take it back -- the looping part is not normal. The
> > btpo_next->btpo_next page has no business linking back to the
> > original/first deleted page you mentioned. That's just odd.
>
> btpo_next->btpo_next does NOT link directly back to the 1st deleted page.  It simply links to some in-use page which
is50 or so leaf pages back in the tree. 

That sounds more normal.

> > Can you provide me with a dump of the page images? The easiest way of
> > getting a page dump is described here:
>
> Customer data.  Looks like meaningless customer data (5 digit key values).  But too much paperwork. :-)

Well, it was worth a try.  ;-)

> The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node.
> It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together.  But this
hasn'tbeen done. 

Before the page becomes BTP_DELETED, it must first be BTP_HALF_DEAD.
And that is also the point where it should be impossible for scans to
reach the page, more or less (there is still that narrow window where
the downlink can followed just before its deleted, making the scan
land on the BTP_HALF_DEAD page -- I mentioned this in my first mail).

> Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it.
Couldthis allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE
beingflushed?  Of course, I don't understand the "target != leafblkno". 

The "target != leafblkno" thing concerns whether or not this is a
multi-level deletion (actually, that's not quite right, since even a
multi-level deletion has "target == leafblkno" at the point where it
finally gets to mark a half dead page fully deleted).

Yes, it's odd that this deleted page exists, even though its siblings
still link to it -- the distinction between a fully deleted page and a
half dead page is really just the fact that a fully deleted page is
supposed to not be linked to from anywhere, including still-live
siblings. But you don't have to get that far to see evidence of
corruption -- having a downlink pointing to a half-dead page is
evidence enough of corruption.

(Actually, it's more complicated than that -- see the comments in
amcheck's bt_downlink_check() function from Postgres 11 or 12.
Multi-level deletion is a case where a half-dead page has a downlink,
but the subtree undergoing deletion is still isolated in about the
same way as it is in the simple single level case, since the
"topparent" downlink is zapped at the same point that the leaf page is
marked half-dead. The important thing is that even half-dead pages are
not reachable by descending the tree, except for the tiny window where
the topparent downlink is observed the instant before it is zapped.)

If page deletion didn't exist, it would be so much easier to
understand the B-Tree code.

My guess is that there wasn't sufficient WAL to replay the page
deletion, but some of the buffers were written out. You might have
"gotten away with it" if the internal page also happened to be written
out along with everything else, but it just didn't work out that way.
Remember, there are two weird things about this, that overlap with two
distinct types of atomic operations: the fact that the downlink still
exists at all, and the fact that the sidelinks still exist at all.
This smells like a problem with slightly inconsistent page images, as
opposed to a problem with how one particular atomic operation did
something. It's not actually surprising that this would be the first
place that you'd notice a generic issue, since many other things are
"more forgiving" in various ways.

--
Peter Geoghegan