Re: HOT chain validation in verify_heapam() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: HOT chain validation in verify_heapam()
Date
Msg-id CA+TgmoY+d7ZUr4+vFVYj6a836Npp3t=2NZn+RbOF3OFKr1S7nA@mail.gmail.com
Whole thread Raw
In response to Re: HOT chain validation in verify_heapam()  (Andres Freund <andres@anarazel.de>)
Responses Re: HOT chain validation in verify_heapam()
Re: HOT chain validation in verify_heapam()
List pgsql-hackers
On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> I don't really understand this logic - why can't we populate the predecessor
> array, if we can construct a successor entry?

This whole thing was my idea, so let me try to explain. I think the
naming and comments need work, but I believe the fundamental idea may
be sound.

successor[x] = y means that when we looked at line pointer x, we saw
that it was either a redirect to line pointer y, or else it had
storage and the associated tuple's CTID pointed to line pointer y. At
this point, we do not have any idea whether y is at all sane, nor we
do we know anything about which of x and y is larger. Furthermore, it
is possible that successor[x] = successor[x'] since the page might be
corrupted and we haven't checked otherwise.

predecessor[y] = x means that successor[x] = y but in addition we've
checked that y is sane, and that x.xmax=y.xmin. If there are multiple
tuples for which these conditions hold, we've issued complaints about
all but one and entered the last into the predecessor array.

An earlier version of the algorithm had only a predecessor[] array but
the code to try to populate in a single pass was complex and looked
ugly and error-prone. To set a predecessor entry in one step, we had
to sanity-check each of x and y but only if that hadn't yet been done,
which was quite awkward. For example, imagine line pointers 1 and 2
both point to 3, and line pointer 3 points backward to line pointer 1
(because of corruption, since it shouldn't ever be circular). We can't
reason about the relationship between 1 and 3 without first making
sure that each one is sane in isolation. But if we do that when we're
at line pointer 1, then when we get to 2, we need to check 2 but don't
need to recheck 3, and when we get to 3 we need to recheck neither 3
nor 1. This algorithm lets us run through and do all the basic sanity
checks first, while populating the successor array, and then check
relationships in later stages.

Part of the motivation here is also driven by trying to figure out how
to word the complaints. We have a dedicated field in the amcheck that
can hold one tuple offset or the other, but if we're checking the
relationships between tuples, what do we put there? I feel it will be
easiest to understand if we put the offset of the older tuple in that
field and then phrase the complaint as the patch does, e.g.:

tuple with uncommitted xmin %u was updated to produce a tuple at
offset %u with differing xmin %u

We could flip that around and put the newer tuple offset in the field
and then phrase the complaint the other way around, but it seems a bit
awkward, e.g.:

tuple with uncommited xmin %u at offset %u was updated to produce this
tuple with differing xmin %u

I think if we did do it that way around (and figured out how to phrase
the messages) we might not need both arrays any more (though I'm not
positive about that). It seems hard to avoid needing at least one,
else you can't explicitly notice two converging HOT chains, which
seems like a case we probably ought to notice. But the "to produce
this tuple" phrasing is just confusing, I think, and removing "this"
doesn't help. You need to somehow get people to understand that the
offset they probably saw in another field is the second tuple, not the
first one. Maybe:

xmin %u does not match xmax %u of prior tuple at offset %u

Hmm.

Anyway, whether it was the right idea or not, the desire to have the
earlier tuple be the focus of the error messages was part of the
motivation here.

> I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood
> of getting that right seems low and I don't see us gaining much by even trying.

I agree with Peter. We have to try to get that case right. If we can
eventually eliminate it as a valid case by some mechanism, hooray. But
in the meantime we have to deal with it as best we can.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Add sub-transaction overflow status in pg_stat_activity
Next
From: Robert Haas
Date:
Subject: Re: Add sub-transaction overflow status in pg_stat_activity