On Tue, Aug 13, 2019 at 5:17 AM Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> We have a bunch of internal testing HA clusters that suffered from corruption conditions.
> We fixed everything that can be detected with parent-check on primaries or usual check on standbys.
> (page updates were lost both on primary and during WAL replay)
> But from time to time when clusters switch primary from one availability zone to another we observe
> "right sibling's left-link doesn't match: block 32709 links to 37022 instead of expected 40953 in index"
That sounds like an issue caused by a failure to replay all available
WAL, where only one page happened to get written out by a checkpoint
before a crash. It's something like that. That wouldn't be caught by
the cross-page bt_index_check() check that we do already.
> We are going to search for these clusters with this [0] tolerating possible fraction of false positives, we have them
anyway.
> But I think I could put some effort into making corruption-detection tooling better.
> I think if we observe links discrepancy, we can acquire lock of left and right pages and recheck.
That's one possibility. When I first designed amcheck it was important
to be conservative, so I invented a general rule about never acquiring
multiple buffer locks at once. I still think that that was the correct
decision for the bt_downlink_check() check (the main extra
bt_index_parent_check() check), but I think that you're right about
retrying to verify the sibling links when bt_index_check() is called
from SQL.
nbtree will often "couple" buffer locks on the leaf level; it will
acquire a lock on a leaf page, and not release that lock until it has
also acquired a lock on the right sibling page (I'm mostly thinking of
_bt_stepright()). I am in favor of a patch that makes amcheck perform
sibling link verification within bt_index_check(), by retrying while
pessimistically coupling buffer locks. (Though I think that that
should just happen on the leaf level. We should not try to be too
clever about ignorable/half-dead/deleted pages, to be conservative.)
--
Peter Geoghegan