amcheck prototype - Mailing list pgsql-hackers

From Peter Geoghegan
Subject amcheck prototype
Date
Msg-id CAM3SWZSRS2kS1Npdwn1V-BvSHBcNrzXUZD=pxXhL3Gqf7k-z5Q@mail.gmail.com
Whole thread Raw
Responses Re: amcheck prototype  (Peter Geoghegan <pg@heroku.com>)
Re: amcheck prototype  (Peter Geoghegan <pg@heroku.com>)
List pgsql-hackers
Attached is a revision of what I previously called btreecheck, which
is now renamed to amcheck.

This is not 9.5 material - I already have 3 bigger patches in the
queue, 2 of which are large and complex and have major controversies,
and one of which has details that need to be worked out, which is
currently consuming a lot of reviewer time. There seems to be little
point in trying to get amcheck into shape for 9.5. The goals for this
as a real patch need to be worked out in greater detail. At some point
we'll need to have a discussion around both stress-testing (as a way
of finding bugs) and allowing users to verify indexes on production
systems when corruption is suspected. Since, as far as I know, no one
else has so much as applied and compiled my ON CONFLICT UPDATE patch,
it would be pretty senseless of me to add another patch to the queue.
Reviewers are clearly more overburdened than ever.

Anyway, this revision adds the ability to check invariants across
pages (that a page's right-link comports with the target page's last
item, since when targeting a particular page there is no locally
available "next" item to check the last item against, other than the
page highkey).  This even occurs for the index check user callable SQL
function that only acquire an AccessShareLock (bt_index_verify() and
bt_page_verify()). As before, it also exhaustively tests certain other
related invariants previously described [1], without really
considering their plausibility as either bugs in the B-Tree code, or
things likely to be violated in the event of organic data corruption.
In other words, I could probably stand to be considerably more
selective in what I'm testing, but in order to do that I'd need to
make up my mind about my exact goals for this tool.

amcheck is something that I thought might find bugs in approach #1 to
value locking [2] (for the ON CONFLICT UPDATE patch). However,
extensive stress testing while constantly using the tool has not
revealed any bugs. That doesn't mean that they're not there, of
course, and it doesn't really alter our understanding of approach #1,
but it's worth mentioning.

Anyway, this is presented here in the hope that it will be useful for
testing other patches, and perhaps even in testing corruption on
production systems (with appropriate precautions taken - this is still
a prototype patch - but it's also still the only thing of its kind). I
post this with the expectation that it won't make it into contrib
until PostgreSQL 9.6, or whatever we end up calling it. It might be
that someone has some feedback that allows me to build a better
temporary prototype (certainly, some testing tools were maintained out
of git for a while in the past, such as the precursor to isolation
tester), but I don't expect even that. If no one wants to do anything
with this patch in the foreseeable future (probably the current
cycle), there may still be some value in dumping my progress here. As
I said, I tend to think that its biggest problem right now is that
it's just too scatter gun, but that's probably appropriate for an
early iteration.

In general, I think we could prevent a lot of bugs by performing
targeted stress-testing with custom tools. Ideally, this tool would go
on to provide a way of doing for several different areas of the code.

[1] http://www.postgresql.org/message-id/CAM3SWZRtV+xmRWLWq6c-x7czvwavFdwFi4St1zz4dDgFH4yN4g@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Value_locking#.231._Heavyweight_page_locking_.28Peter_Geoghegan.29
--
Peter Geoghegan

Attachment

pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Increasing test coverage of WAL redo functions
Next
From: Pavel Stehule
Date:
Subject: Re: proposal: plpgsql - Assert statement