Re: HOT chain validation in verify_heapam() - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: HOT chain validation in verify_heapam() |
Date | |
Msg-id | CA+TgmoY+QZLsB9m+JgmFHiVonhE_utc+6dWgYZjCwXa0-=SSng@mail.gmail.com Whole thread Raw |
In response to | Re: HOT chain validation in verify_heapam() (Andres Freund <andres@anarazel.de>) |
Responses |
Re: HOT chain validation in verify_heapam()
|
List | pgsql-hackers |
On Mon, Nov 14, 2022 at 5:02 PM Andres Freund <andres@anarazel.de> wrote: > On 2022-11-14 14:27:54 -0500, Robert Haas wrote: > > On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote: > > > I don't really understand this logic - why can't we populate the predecessor > > > array, if we can construct a successor entry? > > > > This whole thing was my idea, so let me try to explain. I think the > > naming and comments need work, but I believe the fundamental idea may > > be sound. > > > > successor[x] = y means that when we looked at line pointer x, we saw > > that it was either a redirect to line pointer y, or else it had > > storage and the associated tuple's CTID pointed to line pointer y. > > > At this point, we do not have any idea whether y is at all sane, nor we do > > we know anything about which of x and y is larger. > > What do you mean with "larger" here? Numerically bigger. As in, a redirect line pointer or CTID will most commonly point to a tuple that appears later in the line pointer array, because we assign offset numbers in ascending order. But we also reuse line pointers, so it's possible for a redirect line pointer or CTID to point backwards to a lower-number offset. > > Furthermore, it is > > possible that successor[x] = successor[x'] since the page might be corrupted > > and we haven't checked otherwise. > > > > predecessor[y] = x means that successor[x] = y but in addition we've > > checked that y is sane, and that x.xmax=y.xmin. If there are multiple > > tuples for which these conditions hold, we've issued complaints about > > all but one and entered the last into the predecessor array. > > As shown by the isolationtester test I just posted, this doesn't quite work > right now. Probably fixable. > > I don't think we can follow non-HOT ctid chains if they're older than the xmin > horizon, including all cases of xmin being frozen. There's just nothing > guaranteeing that the tuples are actually "related". Yeah, glad you caught that. I think it's clearly wrong to regard A and B as related if B.xmin is frozen. It doesn't matter whether it's "old-style" frozen where we actually put 2 into the xmin field, or new-style frozen where we set hint bits. Because, even if it's new-style frozen and the value actually stored in B.xmin is numerically equal to A.xmax, it could be from a different epoch. If it's from the same epoch, then something is corrupted, because when we froze B we should have pruned A. But we have no way of knowing whether that's the case, and shouldn't assume corruption. > It seems like we should do a bit more validation within a chain of > tuples. E.g. that no live tuple can follow an !DidCommit xmin? I think this check is already present in stronger form. If we see a !DidCommit xmin, the xmin of the next tuple in the chain not only can't be committed, but had better be the same. See "tuple with uncommitted xmin %u was updated to produce a tuple at offset %u with differing xmin %u". > I now think that the 9.4 specific reasoning is bogus in the first place. The > patch says: > > * Add a line pointer offset to the predecessor array if xmax is > * matching with xmin of next tuple (reaching via its t_ctid). > * Prior to PostgreSQL 9.4, we actually changed the xmin to > * FrozenTransactionId so we must add offset to predecessor > * array(irrespective of xmax-xmin matching) if updated tuple xmin > * is frozen, so that we can later do validation related to frozen > * xmin. Raise corruption if we have two tuples having the same > * predecessor. > > but it's simply not correct to iterate through xmin=FrozenTransactionId - as > shown in the isolationtester test. And that's unrelated to 9.4, because we > couldn't rely on the raw xmin value either, because even if they match, they > could be from different epochs. I agree completely. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: