Re: Amcheck: do rightlink verification with lock coupling - Mailing list pgsql-hackers

From Andrey Borodin
Subject Re: Amcheck: do rightlink verification with lock coupling
Date
Msg-id F7527087-6E95-4077-B964-D2CAFEF6224B@yandex-team.ru
Whole thread Raw
In response to Re: Amcheck: do rightlink verification with lock coupling  (Peter Geoghegan <pg@bowt.ie>)
Responses Re: Amcheck: do rightlink verification with lock coupling  (Andrey Borodin <x4mmm@yandex-team.ru>)
Re: Amcheck: do rightlink verification with lock coupling  (Peter Geoghegan <pg@bowt.ie>)
List pgsql-hackers
Hi Peter! Sorry for answering so long.

> 11 янв. 2020 г., в 7:49, Peter Geoghegan <pg@bowt.ie> написал(а):
>
> I'm curious why Andrey's corruption problems were not detected by the
> cross-page amcheck test, though. We compare the first non-pivot tuple
> on the right sibling leaf page with the last one on the target page,
> towards the end of bt_target_page_check() -- isn't that almost as good
> as what you have here in practice? I probably would have added
> something like this myself earlier, if I had reason to think that
> verification would be a lot more effective that way.

We were dealing with corruption caused by lost page update. Consider two pages:
A->B
If A is split into A` and C we get:
A`->C->B
But if update of A is lost we still have
A->B, but B backward pointers points to C.
B's smallest key is bigger than hikey of A`, but this do not violate
cross-pages invariant.

Page updates may be lost due to bug in backup software with incremental
backups, bug in storage layer of Aurora-style system, bug in page cache, incorrect
fsync error handling, bug in ssd firmware etc. And our data checksums do not
detect this kind of corruption. BTW I think that it would be better if our
checksums were not stored on a page itseft, they could detect this kind of faults.

We were able to timely detect all those problems on primaries in our testing
environment. But much later found out that some standbys were corrupted,
the problem appeared only when they were promoted.
Also, in nearby thread Grygory Rylov (g0djan) is trying to enable one more
invariant in standby checks.


Thanks!

Best regards, Andrey Borodin.


pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] Block level parallel vacuum
Next
From: Amit Kapila
Date:
Subject: Re: PATCH: logical_work_mem and logical streaming of largein-progress transactions