amcheck (B-Tree integrity checking tool) - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | amcheck (B-Tree integrity checking tool) |
Date | |
Msg-id | CAM3SWZQzLMhMwmBqjzK+pRKXrNUZ4w90wYMUWfkeV8mZ3Debvw@mail.gmail.com Whole thread Raw |
Responses |
Re: amcheck (B-Tree integrity checking tool)
Re: amcheck (B-Tree integrity checking tool) Re: amcheck (B-Tree integrity checking tool) |
List | pgsql-hackers |
I was assigned an "action point" during the FOSDEM developer meeting: "Post new version of btree consistency checker patch". I attach a new WIP version of my consistency checker tool, amcheck. This patch is proposed for 9.6, as an extension in contrib -- maybe we can still get it in. This is the first time I've added any version of this to a commitfest, although I've posted a couple of rough versions of this in the past couple of years. The attached version has received a major overhaul, and is primarily aimed at finding corruption in production systems, although I think it will still have significant value for debugging too. Maybe it can help with some of the B-Tree patches in the final commitfest, for example. I also have some hope that it will become a learning tool for people interested in how B-Tree indexes work. To recap, the extension adds some SQL-callable functions that verify certain invariant conditions hold within some particular B-Tree index. These are the conditions that index scans rely on always being true. The tool's scope may eventually cover other AMs, including heapam, but nbtree seems like the best place to start. Note that no function currently checks that the index is consistent with the heap, which would be very useful (that's probably how I'd eventually target the heapam, actually). Invariants ======== nbtree invariants that the tool verifies with just an AccessShareLock on the relation are: * That all items are in the correct, opclass order on each page. * That the page "high key", if any, actually bounds the items on the page. * That the last item on a page is less than or equal to the first item on the next page (the page to its right). The idea here is that the key space spans multiple pages, not just one page, so it make sense to check the last item where we can. With an ExclusiveLock + ShareLock, some addition invariants are verified: * That child pages actually have their parent's downlink as a lower bound. * Sane right links and left links at each level. Obviously, this tool is all about distrusting the structure of a B-Tree index. That being the case, it's not always totally clear where to draw the line. I think I have the balance about right, though. Interface ======= There are only 2 SQL callable functions in the extension, which are very similar: bt_index_check(index regclass) bt_index_parent_check(index regclass) The latter is more thorough than the former -- it performs all checks, including those checks that I mentioned required an ExclusiveLock. So, bt_index_check() only requires an AccessShareLock. bt_index_parent_check() requires an ExclusiveLock on the index relation, and a ShareLock on its heap relation, almost like REINDEX. bt_index_parent_check() performs verification that is a superset of the verification performed by bt_index_check() -- mostly, the extra verification/work is that it verifies downlinks against child pages. Both functions raise an error in the event of observing that an invariant in a B-Tree was violated, such as items being out of order on a page. I've written extensive documentation, which goes into practical aspects of using amcheck effectively. It doesn't go into significant detail about every invariant that is checked, but gives a good idea of roughly what checks are performed. I could almost justify only having one function with an argument about the downlink/child verification, but that would be a significant footgun given the varying locking requirements that such a function would have. Locking ====== We never rely on something like holding on to a buffer pin as an interlock for correctness (the vacuum interlock thing isn't generally necessary, because we don't look at the heap at all). We simply pin + BT_READ lock a buffer, copy it into local memory allocated by palloc(), and then immediately release the buffer lock and drop the pin. This is the same in all instances. There is never more than one buffer lock or pin held at a time. We do, on the other hand, have a detailed rationale for why it's okay that we don't have an ExclusiveLock on the index relation for checks that span the key space of more than one page by following right links to compare items across sibling pages. This isn't the same thing as having an explicit interlock like a pin -- our interlock is one against *recycling* by vacuum, which is based on recentGlobalXmin. This rationale requires expert review. Performance ========== Trying to keep the tool as simple as possible, while still making it do verification that is as useful as possible was my priority here, not performance. Still, verification completes fairly quickly. Certainly, it takes far less time than having to REINDEX the index, and doesn't need too much memory. I think that in practice most problems that can be detected by the B-Tree checker functions will be detected with the lighter variant. -- Peter Geoghegan
Attachment
pgsql-hackers by date: