Thread: BTP_DELETED leaf still in tree
Update query stuck in a loop. Looping in _bt_moveright().
ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright
Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf.
btpo_next is also DELETED but not in the tree.
btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value.
Thus creating an endless loop in moveright.
The deleted page is in the tree. The left leaf page still points to it. The right leaf page points back to the deleted page.
The deleted page itself has arbitrary prev and next pointer. But the next pointer does lead to a loop.
Is there any way, crash recovery or otherwise, that could result in a BTP_DELETED leaf which is still in the tree both in terms of the mid tree pointing down to it but also linked to by the 2 leaf siblings. It is as if the mid tree and two siblings were updated but never made it to disk but the DELETED page itself got written.
Even after a restart the hang reoccurred. Rebuild fixed the problem. Unfortunately I'm not sure if I have enough log history left to examine. But I do have the index file before the rebuild and it clearly has the loop on disk.
On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood <hexexpert@comcast.net> wrote: > Update query stuck in a loop. Looping in _bt_moveright(). You didn't say which PostgreSQL versions were involved, and if the database was ever upgraded using pg_upgrade. Those details could matter. > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright > > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf. This should hardly ever happen -- it is barely possible for an index scan to land on a BTP_DELETED leaf page (or a half-dead page) when following a downlink in its parent. Recall that nbtree uses Lehman & Yao's design, so _bt_search() does not "couple" buffer locks on the way down. It would probably be impossible to observe this happening without carefully setting breakpoints in multiple sessions. If this happens reliably for you, which it sounds like, then you can already assume that the index is corrupt. > btpo_next is also DELETED but not in the tree. > > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value. > > Thus creating an endless loop in moveright. Offhand, these other details sound normal. The side links are still needed in fully deleted (BTP_DELETED) pages. And, moving right and finding lesser key values (not greater key values) is normal with deleted pages, since page deletion makes the keyspace move right, not left (moving the keyspace left is how the source Lanin & Shasha paper does it, though). Actually, I take it back -- the looping part is not normal. The btpo_next->btpo_next page has no business linking back to the original/first deleted page you mentioned. That's just odd. Can you provide me with a dump of the page images? The easiest way of getting a page dump is described here: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump If I had to guess, I'd guess that this was due to a generic storage problem. -- Peter Geoghegan
On Thu, Oct 10, 2019 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > You didn't say which PostgreSQL versions were involved, and if the > database was ever upgraded using pg_upgrade. Those details could > matter. In case you weren't aware, contrib/amcheck should make detected and diagnosing these kinds of problems a lot easier. You should prefer to use the bt_index_parent_check() variant (which will block writes and VACUUM, but not reads). It will verify that sibling pointers are in agreement with each other, and that leaf pages contain keys that are covered by the relevant separator keys from the parent level. If you happen to be using v11, then you might also want to use the heapallindexed option -- that will verify that the heap and index are in agreement. If the issue is on v12, the new "rootdescend" option can detect very subtle cross-level transitive consistency options. (This is only available in v12 because that was the version that made all entries in the index unique by using heap TID as a unique-ifier.) -- Peter Geoghegan
> On October 10, 2019 at 1:18 PM Peter Geoghegan <pg@bowt.ie> wrote: > > > On Thu, Oct 10, 2019 at 12:48 PM Daniel Wood <hexexpert@comcast.net> wrote: > > Update query stuck in a loop. Looping in _bt_moveright(). > > You didn't say which PostgreSQL versions were involved, and if the > database was ever upgraded using pg_upgrade. Those details could > matter. PG_VERSION says 10. I suspect we are running 10.9. I have no idea if pg_upgrade was ever done. > > ExecInsertIndexTuples->btinsert->_bt_doinsert->_bt_search->_bt_moveright > > > > Mid Tree Node downlink path taken by _bt_search points to a BTP_DELETED Leaf. > > This should hardly ever happen -- it is barely possible for an index > scan to land on a BTP_DELETED leaf page (or a half-dead page) when > following a downlink in its parent. Recall that nbtree uses Lehman & > Yao's design, so _bt_search() does not "couple" buffer locks on the > way down. It would probably be impossible to observe this happening > without carefully setting breakpoints in multiple sessions. > > If this happens reliably for you, which it sounds like, then you can > already assume that the index is corrupt. > > > btpo_next is also DELETED but not in the tree. > > > > btpo_next->btpo_next is NOT deleted but in the mid tree as a lesser key value. > > > > Thus creating an endless loop in moveright. > > Offhand, these other details sound normal. The side links are still > needed in fully deleted (BTP_DELETED) pages. And, moving right and > finding lesser key values (not greater key values) is normal with > deleted pages, since page deletion makes the keyspace move right, not > left (moving the keyspace left is how the source Lanin & Shasha paper > does it, though). > > Actually, I take it back -- the looping part is not normal. The > btpo_next->btpo_next page has no business linking back to the > original/first deleted page you mentioned. That's just odd. btpo_next->btpo_next does NOT link directly back to the 1st deleted page. It simply links to some in-use page which is 50or so leaf pages back in the tree. Eventually we do reach the two deleted pages again. Only the first one is in the 'tree'. > Can you provide me with a dump of the page images? The easiest way of > getting a page dump is described here: Customer data. Looks like meaningless customer data (5 digit key values). But too much paperwork. :-) The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node. It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together. But this hasn'tbeen done. Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it. Couldthis allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE beingflushed? Of course, I don't understand the "target != leafblkno". In any case, thanks. > https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#contrib.2Fpageinspect_page_dump > > If I had to guess, I'd guess that this was due to a generic storage problem. > > -- > Peter Geoghegan
On Fri, Oct 11, 2019 at 12:44 AM Daniel Wood <hexexpert@comcast.net> wrote: > > Actually, I take it back -- the looping part is not normal. The > > btpo_next->btpo_next page has no business linking back to the > > original/first deleted page you mentioned. That's just odd. > > btpo_next->btpo_next does NOT link directly back to the 1st deleted page. It simply links to some in-use page which is50 or so leaf pages back in the tree. That sounds more normal. > > Can you provide me with a dump of the page images? The easiest way of > > getting a page dump is described here: > > Customer data. Looks like meaningless customer data (5 digit key values). But too much paperwork. :-) Well, it was worth a try. ;-) > The hard part for me to understand isn't just why the DELETED leaf node is still referenced in the mid tree node. > It is that the step which sets BTP_DELETED should have also linked its leaf and right siblings together. But this hasn'tbeen done. Before the page becomes BTP_DELETED, it must first be BTP_HALF_DEAD. And that is also the point where it should be impossible for scans to reach the page, more or less (there is still that narrow window where the downlink can followed just before its deleted, making the scan land on the BTP_HALF_DEAD page -- I mentioned this in my first mail). > Could the page have already have been dirty, but because of "target != leafblkno", we didn't stamp a new LSN on it. Couldthis allow us to write the DELETED dirty page without the XLOG_BTREE_MARK_PAGE_HALFDEAD and XLOG_BTREE_UNLINK_PAGE beingflushed? Of course, I don't understand the "target != leafblkno". The "target != leafblkno" thing concerns whether or not this is a multi-level deletion (actually, that's not quite right, since even a multi-level deletion has "target == leafblkno" at the point where it finally gets to mark a half dead page fully deleted). Yes, it's odd that this deleted page exists, even though its siblings still link to it -- the distinction between a fully deleted page and a half dead page is really just the fact that a fully deleted page is supposed to not be linked to from anywhere, including still-live siblings. But you don't have to get that far to see evidence of corruption -- having a downlink pointing to a half-dead page is evidence enough of corruption. (Actually, it's more complicated than that -- see the comments in amcheck's bt_downlink_check() function from Postgres 11 or 12. Multi-level deletion is a case where a half-dead page has a downlink, but the subtree undergoing deletion is still isolated in about the same way as it is in the simple single level case, since the "topparent" downlink is zapped at the same point that the leaf page is marked half-dead. The important thing is that even half-dead pages are not reachable by descending the tree, except for the tiny window where the topparent downlink is observed the instant before it is zapped.) If page deletion didn't exist, it would be so much easier to understand the B-Tree code. My guess is that there wasn't sufficient WAL to replay the page deletion, but some of the buffers were written out. You might have "gotten away with it" if the internal page also happened to be written out along with everything else, but it just didn't work out that way. Remember, there are two weird things about this, that overlap with two distinct types of atomic operations: the fact that the downlink still exists at all, and the fact that the sidelinks still exist at all. This smells like a problem with slightly inconsistent page images, as opposed to a problem with how one particular atomic operation did something. It's not actually surprising that this would be the first place that you'd notice a generic issue, since many other things are "more forgiving" in various ways. -- Peter Geoghegan