Re: Making all nbtree entries unique by having heap TIDs participatein comparisons - Mailing list pgsql-hackers

From Peter Geoghegan
Subject Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Date
Msg-id CAH2-Wzko7dc9+7nzfKOJxYy00PXS3NsQc5XDt+dSeRgAYLDBkg@mail.gmail.com
Whole thread Raw
In response to Re: Making all nbtree entries unique by having heap TIDs participatein comparisons  (Heikki Linnakangas <hlinnaka@iki.fi>)
Responses Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
Re: Making all nbtree entries unique by having heap TIDs participatein comparisons
List pgsql-hackers
On Sat, Mar 16, 2019 at 9:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Hmm. "re-find", maybe? We use that term in a few error messages already,
> to mean something similar.

WFM.

> At first, I thought this would be horrendously expensive, but thinking
> about it a bit more, nearby tuples will always follow the same path
> through the upper nodes, so it'll all be cached. So maybe it's not quite
> so bad.

That's deliberate, though you could call bt_relocate_from_root() from
anywhere else if you wanted to. It's a bit like a big nested loop
join, where the inner side has locality.

> Then, a cosmic ray changes the 20 on the root page to 18. That causes
> the the leaf tuple 19 to become non-re-findable; if you descend the for
> 19, you'll incorrectly land on page 6. But it also causes the high key
> on page 2 to be different from the downlink on the root page. Wouldn't
> the existing checks catch this?

No, the existing checks will not check that. I suppose something
closer to the existing approach *could* detect this issue, by making
sure that the "seam of identical high keys" from the root to the leaf
are a match, but we don't use the high key outside of its own page.
Besides, there is something useful about having the code actually rely
on _bt_search().

There are other similar cases, where it's not obvious how you can do
verification without either 1) crossing multiple levels, or 2)
retaining a "low key" as well as a high key (this is what Goetz Graefe
calls "retaining fence keys to solve the cousin verification problem"
in Modern B-Tree Techniques). What if the corruption was in the leaf
page 6 from your example, which had a 20 at the start? We wouldn't
have compared the downlink from the parent to the child, because leaf
page 6 is the leftmost child, and so we only have "-inf". The lower
bound actually comes from the root page, because we truncate "-inf"
attributes during page splits, even though we don't have to. Most of
the time they're not "absolute minus infinity" -- they're "minus
infinity in this subtree".

Maybe you could actually do something with the high key from leaf page
5 to detect the stray value "20" in leaf page 6, but again, we don't
do anything like that right now.

-- 
Peter Geoghegan


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Feature improvement: can we add queryId for pg_catalog.pg_stat_activity view?
Next
From: Peter Geoghegan
Date:
Subject: Re: Making all nbtree entries unique by having heap TIDs participatein comparisons