Thread: Amcheck: do rightlink verification with lock coupling
Hi! This is a thread to discuss amcheck feature started in other thread[0]. Currently amcheck is scanning every B-tree level. If verification is done with ShareLock - amcheck will test that each pageleftlink is pointing to page with rightlink backwards. This is important invariant, in our experience it proved to be good at detecting various corruptions. But this invariant can be detected only if both pages are not modified (e.g. split concurrently). PFA patch, that in case of suspicion locks two pages and retests that invariant. This allows detection of several corruptionson standby. This patch violates one of amcheck design principles: current code does not ever take more than one page lock. I do not know:should we hold this rule or should we use more deep check? Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/DA9B33AC-53CB-4643-96D4-7A0BBC037FA1@yandex-team.ru
Attachment
On 2019-Sep-12, Andrey Borodin wrote: > This patch violates one of amcheck design principles: current code > does not ever take more than one page lock. I do not know: should we > hold this rule or should we use more deep check? The check does seem worthwhile to me. As far as I know, in btree you can lock the right sibling of a page while keeping lock on the page itself, without risk of deadlock. So I'm not sure what's the issue with that. This is in the README: In most cases we release our lock and pin on a page before attempting to acquire pin and lock on the page we are moving to. In a few places it is necessary to lock the next page before releasing the current one. This is safe when moving right or up, but not when moving left or down (else we'd create the possibility of deadlocks). I suppose Peter was concerned about being able to run amcheck without causing any trouble at all for concurrent operation; maybe we can retain that property by making this check optional. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Sep 12, 2019 at 10:16:12AM -0300, Alvaro Herrera wrote: >On 2019-Sep-12, Andrey Borodin wrote: > >> This patch violates one of amcheck design principles: current code >> does not ever take more than one page lock. I do not know: should we >> hold this rule or should we use more deep check? > >The check does seem worthwhile to me. > >As far as I know, in btree you can lock the right sibling of a page >while keeping lock on the page itself, without risk of deadlock. So I'm >not sure what's the issue with that. This is in the README: > > In most cases we release our lock and pin on a page before attempting > to acquire pin and lock on the page we are moving to. In a few places > it is necessary to lock the next page before releasing the current one. > This is safe when moving right or up, but not when moving left or down > (else we'd create the possibility of deadlocks). > >I suppose Peter was concerned about being able to run amcheck without >causing any trouble at all for concurrent operation; maybe we can retain >that property by making this check optional. > Peter, any opinion on this proposed amcheck patch? In the other thread [1] you seemed to agree this is worth checking, and Alvaro's proposal to make this check optional seems like a reasonable compromise with respect to the locking. [1] https://www.postgresql.org/message-id/flat/DA9B33AC-53CB-4643-96D4-7A0BBC037FA1@yandex-team.ru -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Jan 10, 2020 at 5:45 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Peter, any opinion on this proposed amcheck patch? In the other thread > [1] you seemed to agree this is worth checking, and Alvaro's proposal to > make this check optional seems like a reasonable compromise with respect > to the locking. It's a good idea, and it probably doesn't even need to be made optional -- lock coupling to the right is safe on a primary, and should also be safe on standbys (though I should triple check the REDO routines to be sure). The patch only does lock coupling when it proves necessary, which ought to only happen when there is a concurrent page split, which ought to be infrequent. Maybe there is no need to compromise. I'm curious why Andrey's corruption problems were not detected by the cross-page amcheck test, though. We compare the first non-pivot tuple on the right sibling leaf page with the last one on the target page, towards the end of bt_target_page_check() -- isn't that almost as good as what you have here in practice? I probably would have added something like this myself earlier, if I had reason to think that verification would be a lot more effective that way. To be clear, I believe that Andrey wrote this patch for a reason -- I assume that it makes a noticeable and consistent difference. I would like to gain a better understanding of why that was for my own benefit, though. For example, it might be that page deletion was a factor that made the test I mentioned less effective. I care about the specifics. -- Peter Geoghegan
On Fri, Jan 10, 2020 at 06:49:33PM -0800, Peter Geoghegan wrote: >On Fri, Jan 10, 2020 at 5:45 PM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> Peter, any opinion on this proposed amcheck patch? In the other thread >> [1] you seemed to agree this is worth checking, and Alvaro's proposal to >> make this check optional seems like a reasonable compromise with respect >> to the locking. > >It's a good idea, and it probably doesn't even need to be made >optional -- lock coupling to the right is safe on a primary, and >should also be safe on standbys (though I should triple check the REDO >routines to be sure). The patch only does lock coupling when it proves >necessary, which ought to only happen when there is a concurrent page >split, which ought to be infrequent. Maybe there is no need to >compromise. > OK, that makes sense. >I'm curious why Andrey's corruption problems were not detected by the >cross-page amcheck test, though. We compare the first non-pivot tuple >on the right sibling leaf page with the last one on the target page, >towards the end of bt_target_page_check() -- isn't that almost as good >as what you have here in practice? I probably would have added >something like this myself earlier, if I had reason to think that >verification would be a lot more effective that way. > >To be clear, I believe that Andrey wrote this patch for a reason -- I >assume that it makes a noticeable and consistent difference. I would >like to gain a better understanding of why that was for my own >benefit, though. For example, it might be that page deletion was a >factor that made the test I mentioned less effective. I care about the >specifics. > Understood. Is that a reason to not commit of this patch now, though? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Jan 11, 2020 at 4:25 AM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Understood. Is that a reason to not commit of this patch now, though? It could use some polishing. Are you interested in committing it? -- Peter Geoghegan
On Mon, Jan 13, 2020 at 03:49:40PM -0800, Peter Geoghegan wrote: >On Sat, Jan 11, 2020 at 4:25 AM Tomas Vondra ><tomas.vondra@2ndquadrant.com> wrote: >> Understood. Is that a reason to not commit of this patch now, though? > >It could use some polishing. Are you interested in committing it? > Not really - as a CFM I was trying to revive patches that seem in good shape but not moving. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi Peter! Sorry for answering so long. > 11 янв. 2020 г., в 7:49, Peter Geoghegan <pg@bowt.ie> написал(а): > > I'm curious why Andrey's corruption problems were not detected by the > cross-page amcheck test, though. We compare the first non-pivot tuple > on the right sibling leaf page with the last one on the target page, > towards the end of bt_target_page_check() -- isn't that almost as good > as what you have here in practice? I probably would have added > something like this myself earlier, if I had reason to think that > verification would be a lot more effective that way. We were dealing with corruption caused by lost page update. Consider two pages: A->B If A is split into A` and C we get: A`->C->B But if update of A is lost we still have A->B, but B backward pointers points to C. B's smallest key is bigger than hikey of A`, but this do not violate cross-pages invariant. Page updates may be lost due to bug in backup software with incremental backups, bug in storage layer of Aurora-style system, bug in page cache, incorrect fsync error handling, bug in ssd firmware etc. And our data checksums do not detect this kind of corruption. BTW I think that it would be better if our checksums were not stored on a page itseft, they could detect this kind of faults. We were able to timely detect all those problems on primaries in our testing environment. But much later found out that some standbys were corrupted, the problem appeared only when they were promoted. Also, in nearby thread Grygory Rylov (g0djan) is trying to enable one more invariant in standby checks. Thanks! Best regards, Andrey Borodin.
> 14 янв. 2020 г., в 9:47, Andrey Borodin <x4mmm@yandex-team.ru> написал(а): > > Page updates may be lost due to bug in backup software with incremental > backups, bug in storage layer of Aurora-style system, bug in page cache, incorrect > fsync error handling, bug in ssd firmware etc. And our data checksums do not > detect this kind of corruption. BTW I think that it would be better if our > checksums were not stored on a page itseft, they could detect this kind of faults. Observed it just now. There is one HA cluster where a node was marked dead. This node was disconnected from cluster, but due to human error therewas postgres running. Node managed to install block-level incremental backup to the chain. And backup software did not detect that backup stepwas taken from part of timeline that was not in actual timeline's history. Result of restoration is: man-w%/%db R # select bt_index_check('%.pk_%'); bt_index_check ---------------- (1 row) Time: 1411.065 ms (00:01.411) man-w%/%db R # select patched_index_check('%.pk_%'); ERROR: XX002: left link/right link pair in index "pk_labels" not in agreement DETAIL: Block=42705 left block=42707 left link from block=45495. LOCATION: bt_recheck_block_rightlink, verify_nbtree.c:621 Time: 671.336 ms ('%' is replacing removed chars) I understand that this corruption was not introduced by postgres itself, but by combination of bug in two 3rd party toolsand human error. But I can imagine similar corruptions with different root causes. Best regards, Andrey Borodin.
On Mon, Jan 13, 2020 at 8:47 PM Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > 11 янв. 2020 г., в 7:49, Peter Geoghegan <pg@bowt.ie> написал(а): > > I'm curious why Andrey's corruption problems were not detected by the > > cross-page amcheck test, though. We compare the first non-pivot tuple > > on the right sibling leaf page with the last one on the target page, > > towards the end of bt_target_page_check() -- isn't that almost as good > > as what you have here in practice? I probably would have added > > something like this myself earlier, if I had reason to think that > > verification would be a lot more effective that way. > > We were dealing with corruption caused by lost page update. Consider two pages: > A->B > If A is split into A` and C we get: > A`->C->B > But if update of A is lost we still have > A->B, but B backward pointers points to C. > B's smallest key is bigger than hikey of A`, but this do not violate > cross-pages invariant. > > Page updates may be lost due to bug in backup software with incremental > backups, bug in storage layer of Aurora-style system, bug in page cache, incorrect > fsync error handling, bug in ssd firmware etc. And our data checksums do not > detect this kind of corruption. BTW I think that it would be better if our > checksums were not stored on a page itseft, they could detect this kind of faults. I find this argument convincing. I'll try to get this committed soon. While you could have used bt_index_parent_check() or heapallindexed to detect the issue, those two options are a lot more expensive (plus the former option won't work on a standby). Relaxing the principle that says that we shouldn't hold multiple buffer locks concurrently doesn't seem like that much to ask for to get such a clear benefit. I think that this is safe, but page deletion/half-dead pages need more thought. In general, the target page could have become "ignorable" when rechecked. > We were able to timely detect all those problems on primaries in our testing > environment. But much later found out that some standbys were corrupted, > the problem appeared only when they were promoted. > Also, in nearby thread Grygory Rylov (g0djan) is trying to enable one more > invariant in standby checks. I looked at that thread just now, but Grygory didn't say why this true root check was particularly important, so I can't see much upside. Plus that seems riskier than what you have in mind here. Does it have something to do with the true root looking like a deleted page? The details matter. -- Peter Geoghegan
On Thu, Jan 16, 2020 at 5:11 PM Peter Geoghegan <pg@bowt.ie> wrote: > I find this argument convincing. I'll try to get this committed soon. > > While you could have used bt_index_parent_check() or heapallindexed to > detect the issue, those two options are a lot more expensive (plus the > former option won't work on a standby). Relaxing the principle that > says that we shouldn't hold multiple buffer locks concurrently doesn't > seem like that much to ask for to get such a clear benefit. Having looked into it some more, I now have my doubts about this patch. REDO routine code like btree_xlog_split() and btree_xlog_unlink_page() feel entitled to only lock one page at a time, which invalidates the assumption that we can couple locks on the leaf level to verify mutual agreement in left and right sibling links (with only an AccessShareLock on bt_index_check()'s target index relation). It would definitely be safe for bt_index_check() to so this were it not running in recovery mode, but that doesn't seem very useful on its own. I tried to come up with a specific example of how this could be unsafe, but my explanation was all over the place (this could have had something to do with it being Friday evening). Even still, it's up to the patch to justify why it's safe, and that seems even more difficult. -- Peter Geoghegan
On Fri, Jan 17, 2020 at 5:43 PM Peter Geoghegan <pg@bowt.ie> wrote: > I tried to come up with a specific example of how this could be > unsafe, but my explanation was all over the place (this could have had > something to do with it being Friday evening). Even still, it's up to > the patch to justify why it's safe, and that seems even more > difficult. I can't see a way around this problem, so I'm marking the patch rejected. Thanks -- Peter Geoghegan
Hi! > 23 янв. 2020 г., в 00:59, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Fri, Jan 17, 2020 at 5:43 PM Peter Geoghegan <pg@bowt.ie> wrote: >> I tried to come up with a specific example of how this could be >> unsafe, but my explanation was all over the place (this could have had >> something to do with it being Friday evening). Even still, it's up to >> the patch to justify why it's safe, and that seems even more >> difficult. > > I can't see a way around this problem, so I'm marking the patch rejected. In this thread [0] we decided that lock coupling is necessary for btree_xlog_unlink_page(). So, maybe let's revive this patch? Thanks! Best regards, Andrey Borodin. [0] https://www.postgresql.org/message-id/flat/CAH2-Wzm7T_O%2BVUeo7%3D_NGPncs08z3JEybEwVLZGaASnbfg5vDA%40mail.gmail.com#a4ef597251fed0eb5c2896937bdbd0cc
On Mon, Jul 20, 2020 at 11:46 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > In this thread [0] we decided that lock coupling is necessary for btree_xlog_unlink_page(). > So, maybe let's revive this patch? Yes, let's do that. Can you resubmit it, please? Peter Geoghegan
> 4 авг. 2020 г., в 03:44, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Mon, Jul 20, 2020 at 11:46 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >> In this thread [0] we decided that lock coupling is necessary for btree_xlog_unlink_page(). >> So, maybe let's revive this patch? > > Yes, let's do that. Can you resubmit it, please? PFA v3. Changes: fixed few typos in comments. BTW, reviewing this patch again I cannot understand why we verify link coherence only on leaf level but not for internalpages? I do not see any actual problems here. Corruption detection power of leftlink-rightlinks on internal pages is diminishingly small compared to leaf pages. But thereseems to be no reason to exclude internal pages? Thanks! Best regards, Andrey Borodin.
Attachment
On Tue, Aug 4, 2020 at 9:33 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > BTW, reviewing this patch again I cannot understand why we verify link coherence only on leaf level but not for internalpages? > I do not see any actual problems here. Well, I thought that it might be a good idea to limit it to the leaf level, based on the theory that we rarely couple locks on internal page levels in general. But yeah, that's probably not a good enough reason to avoid lock coupling on internal pages. It's probably better to do it everywhere than to explain why we don't do it on the internal level -- the explanation will probably be confusing. And even if there was a performance issue, it could only happen when there are concurrent internal page splits -- but those are supposed to be rare. Attached is v4, which now checks internal pages (just like leaf pages). The main other change in this revised version is that we make the error raised by bt_index_check() match the error used in the old bt_index_parent_check() case -- we always want to blame the current target page when amcheck complains (technically the page we blame when the invariant fails isn't strictly guaranteed to be quite the same thing as the target, but it's close enough to not really matter in reality). Other adjustments: * Added _bt_checkpage() calls for buffers, as is standard practice in nbtree. * Added protection against locking the same page a second time in the event of a faulty sibling link -- we should avoid a self-deadlock in the event of a page that is corrupt in just the wrong way. * Updated obsolescent comments that claimed that we never couple buffer locks in amcheck. I would like to commit something like this in the next day or two. Thoughts? -- Peter Geoghegan
Attachment
> 6 авг. 2020 г., в 04:25, Peter Geoghegan <pg@bowt.ie> написал(а): > > * Added _bt_checkpage() calls for buffers, as is standard practice in nbtree. > > * Added protection against locking the same page a second time in the > event of a faulty sibling link -- we should avoid a self-deadlock in > the event of a page that is corrupt in just the wrong way. > > * Updated obsolescent comments that claimed that we never couple > buffer locks in amcheck. Cool, thanks! There's mintor typo: missing space in "of_bt_check_unique". > > I would like to commit something like this in the next day or two. > > Thoughts? Sounds great! Thanks! Best regards, Andrey Borodin.
On Wed, Aug 5, 2020 at 9:50 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > Sounds great! Thanks! I'm afraid that there is another problem, this time with btree_xlog_split(). It's possible to get false positives when running the new test continually on a standby. You can see this by running verification on a standby continually, while the primary runs with a workload that gets many page splits. It's easy to see if you apply this patch: --- a/src/backend/access/nbtree/nbtxlog.c +++ b/src/backend/access/nbtree/nbtxlog.c @@ -435,6 +435,9 @@ btree_xlog_split(bool newitemonleft, XLogReaderState *record) UnlockReleaseBuffer(lbuf); UnlockReleaseBuffer(rbuf); + /* trick */ + pg_usleep(10 * 1000L); + The only thing that we can do is adjust the locking in btree_xlog_split() to match the primary (kind of like commit 9a9db08a, except with page splits instead of page deletion). Attached is a revised version of the patch, along with the changes that we'd need to REDO to make the amcheck patch really work. I'm not sure if this change to the REDO routine is worth the overhead or trouble, though. I have to think about it some more. BTW, the first patch in the series now has a new check for page deletion -- that was missing from v4. -- Peter Geoghegan
Attachment
> 6 авг. 2020 г., в 21:38, Peter Geoghegan <pg@bowt.ie> написал(а): > > On Wed, Aug 5, 2020 at 9:50 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >> Sounds great! Thanks! > > I'm afraid that there is another problem, this time with > btree_xlog_split(). It's possible to get false positives when running > the new test continually on a standby. You can see this by running > verification on a standby continually, while the primary runs with a > workload that gets many page splits. Yes, I see the problem... > > The only thing that we can do is adjust the locking in > btree_xlog_split() to match the primary (kind of like commit 9a9db08a, > except with page splits instead of page deletion). Attached is a > revised version of the patch, along with the changes that we'd need to > REDO to make the amcheck patch really work. > > I'm not sure if this change to the REDO routine is worth the overhead > or trouble, though. I have to think about it some more. If we want to check relations between pages we must either apply them together (under locks) or tolerate some fraction offalse positives. I understand that mitigating and tolerating false positives is nonsense in mathematica sense, but frompractical point of view it's just OK. But having complete solution with no false positives seems much better. > > BTW, the first patch in the series now has a new check for page > deletion -- that was missing from v4. Yes, seems like that was a bug.. Thanks! Best regards, Andrey Borodin.
On Thu, Aug 6, 2020 at 10:59 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > But having complete solution with no false positives seems much better. Agreed. I know that you didn't pursue this for no reason -- having the check available makes bt_check_index() a lot more valuable in practice. It detects what is actually a classic example of subtle B-Tree corruption (left link corruption), which appears in Modern B-Tree techniques in its discussion of corruption detection. It's actually the canonical example of how B-Tree corruption can be very subtle in the real world. I pushed a cleaned up version of this patch just now. I added some commentary about this canonical example in header comments for the new function. Thanks -- Peter Geoghegan
> 8 авг. 2020 г., в 23:14, Peter Geoghegan <pg@bowt.ie> написал(а): > > I pushed a cleaned up version of this patch just now. I added some > commentary about this canonical example in header comments for the new > function. Thanks for working on this! Best regards, Andrey Borodin.