Home > mailing lists

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

From	Heikki Linnakangas
Subject	Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Date	March 16, 2019 16:17:21
Msg-id	55d61cee-a13b-5623-737b-5d06774a9c3a@iki.fi Whole thread
In response to	Re: Making all nbtree entries unique by having heap TIDs participatein comparisons (Peter Geoghegan <pg@bowt.ie>)
Responses	Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
List	pgsql-hackers

Tree view

On 16/03/2019 10:51, Peter Geoghegan wrote:
> On Sat, Mar 16, 2019 at 1:44 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>>> It would be nice if you could take a look at the amcheck "relocate"
>>> patch
>> When I started looking at this, I thought that "relocate" means "move".
>> So I thought that the new mode would actually move tuples, i.e. that it
>> would modify the index. That sounded horrible. Of course, it doesn't
>> actually do that. It just checks that each tuple can be re-found, or
>> "relocated", by descending the tree from the root. I'd suggest changing
>> the language to avoid that confusion.
> 
> Okay. What do you suggest? :-)

Hmm. "re-find", maybe? We use that term in a few error messages already, 
to mean something similar.

>> It seems like a nice way to catch all kinds of index corruption issues.
>> Although, we already check that the tuples are in order within the page.
>> Is it really necessary to traverse the tree for every tuple, as well?
>> Maybe do it just for the first and last item?
> 
> It's mainly intended as a developer option. I want it to be possible
> to detect any form of corruption, however unlikely. It's an
> adversarial mindset that will at least make me less nervous about the
> patch.

Fair enough.

At first, I thought this would be horrendously expensive, but thinking 
about it a bit more, nearby tuples will always follow the same path 
through the upper nodes, so it'll all be cached. So maybe it's not quite 
so bad.

>> I don't understand this. Can you give an example of this kind of
>> inconsistency?
> 
> The commit message gives an example, but I suggest trying it out for
> yourself. Corrupt the least significant key byte of a root page of a
> B-Tree using pg_hexedit. Say it's an index on a text column, then
> you'd corrupt the last byte to be something slightly wrong. Then, the
> only way to catch it is with "relocate" verification. You'll only miss
> a few tuples on a cousin page at the leaf level (those on either side
> of the high key that the corrupted separator key in the root was
> originally copied from).
>
> The regular checks won't catch this, because the keys are similar
> enough one level down. The "minus infinity" item is a kind of a blind
> spot, because we cannot do a parent check of its children, because we
> don't have the key (it's truncated when the item becomes a right page
> minus infinity item, during an internal page split).

Hmm. So, the initial situation would be something like this:

                  +-----------+
                  | 1: root   |
                  |           |
                  | -inf -> 2 |
                  | 20   -> 3 |
                  |           |
                  +-----------+

         +-------------+ +-------------+
         | 2: internal | | 3: internal |
         |             | |             |
         | -inf -> 4   | | -inf -> 6   |
         | 10   -> 5   | | 30   -> 7   |
         |             | |             |
         | hi: 20      | |             |
         +-------------+ +-------------+

+---------+ +---------+ +---------+ +---------+
| 4: leaf | | 5: leaf | | 6: leaf | | 7: leaf |
|         | |         | |         | |         |
| 1       | | 11      | | 21      | | 31      |
| 5       | | 15      | | 25      | | 35      |
| 9       | | 19      | | 29      | | 39      |
|         | |         | |         | |         |
| hi: 10  | | hi: 20  | | hi: 30  | |         |
+---------+ +---------+ +---------+ +---------+

Then, a cosmic ray changes the 20 on the root page to 18. That causes 
the the leaf tuple 19 to become non-re-findable; if you descend the for 
19, you'll incorrectly land on page 6. But it also causes the high key 
on page 2 to be different from the downlink on the root page. Wouldn't 
the existing checks catch this?

- Heikki

pgsql-hackers by date:

From: Dmitry Dolgov
Date: 16 March 2019, 16:14:20
Subject: Re: Index Skip Scan

From: Tom Lane
Date: 16 March 2019, 16:20:01
Subject: Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?

Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

Previous

Next