btreecheck extension - Mailing list pgsql-hackers
From | Peter Geoghegan |
---|---|
Subject | btreecheck extension |
Date | |
Msg-id | CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com Whole thread Raw |
Responses |
Re: btreecheck extension
Re: btreecheck extension |
List | pgsql-hackers |
As discussed at the developer meeting at pgCon, I think that there is a lot to be said for a tool that checks nbtree index invariants on live systems. Attached prototype patch adds contrib extension, btreecheck. This extension provides SQL-callable functions for checking these conditions on nbtree indexes on live systems. This code generally works by using the same insertion ScanKey infrastructure that the B-Tree code itself uses in servicing all kinds of index scans. There are two major goals for this patch. In order of their importance in my mind: 1. Improve testing around the B-Tree code. Give 9.4 beta testers the tools needed to detect subtle errors in a semi-automated way. I'm quite encouraged by recent efforts along these lines by Jeff Janes, Heikki, and others. Beyond 9.4, I think that stress testing is likely to become indispensable as more complex performance optimizations are pursued. Personally, I see some good opportunities to improve the performance of the B-Tree code, and if those ideas are ever to be put into action, there is clearly a big need for automated stress-testing. I see this tool as a start. I'm not just talking about my recent work on sort support for text, and the idea of generalizing that; I think that there are many techniques that we could pursue for decreasing the I/O and CPU overhead of using B-Tree indexes that are just too risky to pursue today because of the subtle interactions of many moving parts. I'd like to lower the risk. 2. Allow users to sanity check indexes in production. This is not my immediate goal here, though; I will not add this patch to the next CommitFest. That's something to think about in more detail a bit later. Clearly there is a big demand for this kind of thing, though. Since I believe the B-Tree code changes of 9.4 are generally thought to be the most plausible source of serious bugs in 9.4 purely because of their subtlety and fundamental importance, it seemed valuable to get this prototype into the hands of beta testers as soon as possible. For that reason, I have not given some aspects of the design as much thought as I would have preferred. I haven't really come up with a plausible plan of attack for finding bugs in the B-Tree code. Rather, I've more or less just considered what invariant conditions exist that can be correctly verified without too much effort, and then implemented checks for each. I've also stuck to a design that only has the tool share lock a single B-Tree buffer at a time, and copy the contents into local memory before operating on that (relation level heavyweight locks are used in various different ways too), in a similar manner to contrib/pageinspect. Copying the entire page into local memory is probably the right thing to do even from a performance perspective, since in each case the entire page is inspected, which takes much longer than, say, a binary search. Even though it's supposed to be possible to run this on a live system without causing too much disruption, subtle buffer locking protocols were avoided (although there is a rationale for why things are okay in the module, that considers current buffer locking protocols particular to the B-Tree code). This is a principle that I think I can stick to, and would certainly prefer to stick to for obvious reasons. The checks performed (the entire set of invariant conditions checked by all SQL-callable functions in the patch) are currently: * That data items on each B-Tree page are in logical order. * That non-rightmost pages (which are the only pages that have a high key) have only data items that respect that high key as an upper bound. * That a down-link data item in a parent page points to a child page where the parent down-link data item is respected as a lower bound. * That a down-link data item in a parent page points to a child pages whose high key (if any) is equal to the next data item on the parent page (if any), while taking into account incomplete page splits and right-most-ness. * That a down-link data item in a parent page points to a child page whose right link (if any) points to the same block/page as pointed to by the next data item on the parent page (if any). In other words, the children and their parent are in agreement about the children being siblings, while taking into account incomplete page splits and right-most-ness. In other other words, the second phase of a page split where a downlink is inserted into the parent was sane (so this is similar to the previous invariant check listed). The two prior checks may not be sufficient, though, if we want to operate with weak, non-disruptive locks -- what about cousins and grandparents, and even great-grandparents and second cousins? It's a good start, though. * That on each level, starting at the true root level and working down the the leaf level (level 0), there is mutual agreement between siblings that they are siblings (that is, their right and left links comport). I won't go into locking protocols and their correctness here. That's something that I would like others to look over, though - there are fairly detailed comments. I would also like to coordinate with anyone who thinks they can come up with a good stress testing workload that introduces a lot of variability for the B-Tree code to deal with. I think it's appropriate that that general effort should mostly drive the development of this tool, and not the other way around. When you consider the difficulty of isolating the kind of bugs we're looking for here, a lower lock strength is not just a nice-to-have. I think it might actually be an important contribution to the effectiveness of the tool generally. I envisage tests running with a high amount of concurrency on big databases over days or even longer. Thoughts? -- Peter Geoghegan
Attachment
pgsql-hackers by date: