btreecheck extension - Mailing list pgsql-hackers

From Peter Geoghegan
Subject btreecheck extension
Date
Msg-id CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com
Whole thread Raw
Responses Re: btreecheck extension
Re: btreecheck extension
List pgsql-hackers
As discussed at the developer meeting at pgCon, I think that there is
a lot to be said for a tool that checks nbtree index invariants on
live systems.

Attached prototype patch adds contrib extension, btreecheck. This
extension provides SQL-callable functions for checking these
conditions on nbtree indexes on live systems. This code generally
works by using the same insertion ScanKey infrastructure that the
B-Tree code itself uses in servicing all kinds of index scans.

There are two major goals for this patch. In order of their importance
in my mind:

1. Improve testing around the B-Tree code. Give 9.4 beta testers the
tools needed to detect subtle errors in a semi-automated way. I'm
quite encouraged by recent efforts along these lines by Jeff Janes,
Heikki, and others. Beyond 9.4, I think that stress testing is likely
to become indispensable as more complex performance optimizations are
pursued. Personally, I see some good opportunities to improve the
performance of the B-Tree code, and if those ideas are ever to be put
into action, there is clearly a big need for automated stress-testing.
I see this tool as a start. I'm not just talking about my recent work
on sort support for text, and the idea of generalizing that; I think
that there are many techniques that we could pursue for decreasing the
I/O and CPU overhead of using B-Tree indexes that are just too risky
to pursue today because of the subtle interactions of many moving
parts. I'd like to lower the risk.

2. Allow users to sanity check indexes in production. This is not my
immediate goal here, though; I will not add this patch to the next
CommitFest. That's something to think about in more detail a bit
later. Clearly there is a big demand for this kind of thing, though.

Since I believe the B-Tree code changes of 9.4 are generally thought
to be the most plausible source of serious bugs in 9.4 purely because
of their subtlety and fundamental importance, it seemed valuable to
get this prototype into the hands of beta testers as soon as possible.
For that reason, I have not given some aspects of the design as much
thought as I would have preferred. I haven't really come up with a
plausible plan of attack for finding bugs in the B-Tree code. Rather,
I've more or less just considered what invariant conditions exist that
can be correctly verified without too much effort, and then
implemented checks for each. I've also stuck to a design that only has
the tool share lock a single B-Tree buffer at a time, and copy the
contents into local memory before operating on that (relation level
heavyweight locks are used in various different ways too), in a
similar manner to contrib/pageinspect. Copying the entire page into
local memory is probably the right thing to do even from a performance
perspective, since in each case the entire page is inspected, which
takes much longer than, say, a binary search. Even though it's
supposed to be possible to run this on a live system without causing
too much disruption, subtle buffer locking protocols were avoided
(although there is a rationale for why things are okay in the module,
that considers current buffer locking protocols particular to the
B-Tree code). This is a principle that I think I can stick to, and
would certainly prefer to stick to for obvious reasons.

The checks performed (the entire set of invariant conditions checked
by all SQL-callable functions in the patch) are currently:

* That data items on each B-Tree page are in logical order.

* That non-rightmost pages (which are the only pages that have a high
key) have only data items that respect that high key as an upper
bound.

* That a down-link data item in a parent page points to a child page
where the parent down-link data item is respected as a lower bound.

* That a down-link data item in a parent page points to a child pages
whose high key (if any) is equal to the next data item on the parent
page (if any), while taking into account incomplete page splits and
right-most-ness.

* That a down-link data item in a parent page points to a child page
whose right link (if any) points to the same block/page as pointed to
by the next data item on the parent page (if any). In other words, the
children and their parent are in agreement about the children being
siblings, while taking into account incomplete page splits and
right-most-ness. In other other words, the second phase of a page
split where a downlink is inserted into the parent was sane (so this
is similar to the previous invariant check listed).

The two prior checks may not be sufficient, though, if we want to
operate with weak, non-disruptive locks -- what about cousins and
grandparents, and even great-grandparents and second cousins? It's a
good start, though.

* That on each level, starting at the true root level and working down
the the leaf level (level 0), there is mutual agreement between
siblings that they are siblings (that is, their right and left links
comport).

I won't go into locking protocols and their correctness here. That's
something that I would like others to look over, though - there are
fairly detailed comments. I would also like to coordinate with anyone
who thinks they can come up with a good stress testing workload that
introduces a lot of variability for the B-Tree code to deal with. I
think it's appropriate that that general effort should mostly drive
the development of this tool, and not the other way around.

When you consider the difficulty of isolating the kind of bugs we're
looking for here, a lower lock strength is not just a nice-to-have. I
think it might actually be an important contribution to the
effectiveness of the tool generally. I envisage tests running with a
high amount of concurrency on big databases over days or even longer.

Thoughts?
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: rm_desc signature
Next
From: Craig Ringer
Date:
Subject: Re: How to change the pgsql source code and build it??