Re: HOT chain validation in verify_heapam() - Mailing list pgsql-hackers

From Robert Haas
Subject Re: HOT chain validation in verify_heapam()
Date
Msg-id CA+TgmoY+QZLsB9m+JgmFHiVonhE_utc+6dWgYZjCwXa0-=SSng@mail.gmail.com
Whole thread Raw
In response to Re: HOT chain validation in verify_heapam()  (Andres Freund <andres@anarazel.de>)
Responses Re: HOT chain validation in verify_heapam()
List pgsql-hackers
On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-14 14:27:54 -0500, Robert Haas wrote:
> > On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote:
> > > I don't really understand this logic - why can't we populate the predecessor
> > > array, if we can construct a successor entry?
> >
> > This whole thing was my idea, so let me try to explain. I think the
> > naming and comments need work, but I believe the fundamental idea may
> > be sound.
> >
> > successor[x] = y means that when we looked at line pointer x, we saw
> > that it was either a redirect to line pointer y, or else it had
> > storage and the associated tuple's CTID pointed to line pointer y.
>
> > At this point, we do not have any idea whether y is at all sane, nor we do
> > we know anything about which of x and y is larger.
>
> What do you mean with "larger" here?

Numerically bigger. As in, a redirect line pointer or CTID will most
commonly point to a tuple that appears later in the line pointer
array, because we assign offset numbers in ascending order. But we
also reuse line pointers, so it's possible for a redirect line pointer
or CTID to point backwards to a lower-number offset.

> > Furthermore, it is
> > possible that successor[x] = successor[x'] since the page might be corrupted
> > and we haven't checked otherwise.
> >
> > predecessor[y] = x means that successor[x] = y but in addition we've
> > checked that y is sane, and that x.xmax=y.xmin. If there are multiple
> > tuples for which these conditions hold, we've issued complaints about
> > all but one and entered the last into the predecessor array.
>
> As shown by the isolationtester test I just posted, this doesn't quite work
> right now. Probably fixable.
>
> I don't think we can follow non-HOT ctid chains if they're older than the xmin
> horizon, including all cases of xmin being frozen. There's just nothing
> guaranteeing that the tuples are actually "related".

Yeah, glad you caught that. I think it's clearly wrong to regard A and
B as related if B.xmin is frozen. It doesn't matter whether it's
"old-style" frozen where we actually put 2 into the xmin field, or
new-style frozen where we set hint bits. Because, even if it's
new-style frozen and the value actually stored in B.xmin is
numerically equal to A.xmax, it could be from a different epoch. If
it's from the same epoch, then something is corrupted, because when we
froze B we should have pruned A. But we have no way of knowing whether
that's the case, and shouldn't assume corruption.

> It seems like we should do a bit more validation within a chain of
> tuples. E.g. that no live tuple can follow an !DidCommit xmin?

I think this check is already present in stronger form. If we see a
!DidCommit xmin, the xmin of the next tuple in the chain not only
can't be committed, but had better be the same. See "tuple with
uncommitted xmin %u was updated to produce a tuple at offset %u with
differing xmin %u".

> I now think that the 9.4 specific reasoning is bogus in the first place. The
> patch says:
>
>                          * Add a line pointer offset to the predecessor array if xmax is
>                          * matching with xmin of next tuple (reaching via its t_ctid).
>                          * Prior to PostgreSQL 9.4, we actually changed the xmin to
>                          * FrozenTransactionId so we must add offset to predecessor
>                          * array(irrespective of xmax-xmin matching) if updated tuple xmin
>                          * is frozen, so that we can later do validation related to frozen
>                          * xmin. Raise corruption if we have two tuples having the same
>                          * predecessor.
>
> but it's simply not correct to iterate through xmin=FrozenTransactionId - as
> shown in the isolationtester test. And that's unrelated to 9.4, because we
> couldn't rely on the raw xmin value either, because even if they match, they
> could be from different epochs.

I agree completely.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Error-safe user functions
Next
From: Nathan Bossart
Date:
Subject: Re: archive modules