Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons |
Date | |
Msg-id | 55d61cee-a13b-5623-737b-5d06774a9c3a@iki.fi Whole thread Raw |
In response to | Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>) |
Responses |
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
|
List | pgsql-hackers |
On 16/03/2019 10:51, Peter Geoghegan wrote: > On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote: >>> It would be nice if you could take a look at the amcheck "relocate" >>> patch >> When I started looking at this, I thought that "relocate" means "move". >> So I thought that the new mode would actually move tuples, i.e. that it >> would modify the index. That sounded horrible. Of course, it doesn't >> actually do that. It just checks that each tuple can be re-found, or >> "relocated", by descending the tree from the root. I'd suggest changing >> the language to avoid that confusion. > > Okay. What do you suggest? :-) Hmm. "re-find", maybe? We use that term in a few error messages already, to mean something similar. >> It seems like a nice way to catch all kinds of index corruption issues. >> Although, we already check that the tuples are in order within the page. >> Is it really necessary to traverse the tree for every tuple, as well? >> Maybe do it just for the first and last item? > > It's mainly intended as a developer option. I want it to be possible > to detect any form of corruption, however unlikely. It's an > adversarial mindset that will at least make me less nervous about the > patch. Fair enough. At first, I thought this would be horrendously expensive, but thinking about it a bit more, nearby tuples will always follow the same path through the upper nodes, so it'll all be cached. So maybe it's not quite so bad. >> I don't understand this. Can you give an example of this kind of >> inconsistency? > > The commit message gives an example, but I suggest trying it out for > yourself. Corrupt the least significant key byte of a root page of a > B-Tree using pg_hexedit. Say it's an index on a text column, then > you'd corrupt the last byte to be something slightly wrong. Then, the > only way to catch it is with "relocate" verification. You'll only miss > a few tuples on a cousin page at the leaf level (those on either side > of the high key that the corrupted separator key in the root was > originally copied from). > > The regular checks won't catch this, because the keys are similar > enough one level down. The "minus infinity" item is a kind of a blind > spot, because we cannot do a parent check of its children, because we > don't have the key (it's truncated when the item becomes a right page > minus infinity item, during an internal page split). Hmm. So, the initial situation would be something like this: +-----------+ | 1: root | | | | -inf -> 2 | | 20 -> 3 | | | +-----------+ +-------------+ +-------------+ | 2: internal | | 3: internal | | | | | | -inf -> 4 | | -inf -> 6 | | 10 -> 5 | | 30 -> 7 | | | | | | hi: 20 | | | +-------------+ +-------------+ +---------+ +---------+ +---------+ +---------+ | 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf | | | | | | | | | | 1 | | 11 | | 21 | | 31 | | 5 | | 15 | | 25 | | 35 | | 9 | | 19 | | 29 | | 39 | | | | | | | | | | hi: 10 | | hi: 20 | | hi: 30 | | | +---------+ +---------+ +---------+ +---------+ Then, a cosmic ray changes the 20 on the root page to 18. That causes the the leaf tuple 19 to become non-re-findable; if you descend the for 19, you'll incorrectly land on page 6. But it also causes the high key on page 2 to be different from the downlink on the root page. Wouldn't the existing checks catch this? - Heikki
pgsql-hackers by date: