Re: HOT chain validation in verify_heapam() - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: HOT chain validation in verify_heapam() |
Date | |
Msg-id | CA+TgmoY+d7ZUr4+vFVYj6a836Npp3t=2NZn+RbOF3OFKr1S7nA@mail.gmail.com Whole thread Raw |
In response to | Re: HOT chain validation in verify_heapam() (Andres Freund <andres@anarazel.de>) |
Responses |
Re: HOT chain validation in verify_heapam()
Re: HOT chain validation in verify_heapam() |
List | pgsql-hackers |
On Wed, Nov 9, 2022 at 5:08 PM Andres Freund <andres@anarazel.de> wrote: > I don't really understand this logic - why can't we populate the predecessor > array, if we can construct a successor entry? This whole thing was my idea, so let me try to explain. I think the naming and comments need work, but I believe the fundamental idea may be sound. successor[x] = y means that when we looked at line pointer x, we saw that it was either a redirect to line pointer y, or else it had storage and the associated tuple's CTID pointed to line pointer y. At this point, we do not have any idea whether y is at all sane, nor we do we know anything about which of x and y is larger. Furthermore, it is possible that successor[x] = successor[x'] since the page might be corrupted and we haven't checked otherwise. predecessor[y] = x means that successor[x] = y but in addition we've checked that y is sane, and that x.xmax=y.xmin. If there are multiple tuples for which these conditions hold, we've issued complaints about all but one and entered the last into the predecessor array. An earlier version of the algorithm had only a predecessor[] array but the code to try to populate in a single pass was complex and looked ugly and error-prone. To set a predecessor entry in one step, we had to sanity-check each of x and y but only if that hadn't yet been done, which was quite awkward. For example, imagine line pointers 1 and 2 both point to 3, and line pointer 3 points backward to line pointer 1 (because of corruption, since it shouldn't ever be circular). We can't reason about the relationship between 1 and 3 without first making sure that each one is sane in isolation. But if we do that when we're at line pointer 1, then when we get to 2, we need to check 2 but don't need to recheck 3, and when we get to 3 we need to recheck neither 3 nor 1. This algorithm lets us run through and do all the basic sanity checks first, while populating the successor array, and then check relationships in later stages. Part of the motivation here is also driven by trying to figure out how to word the complaints. We have a dedicated field in the amcheck that can hold one tuple offset or the other, but if we're checking the relationships between tuples, what do we put there? I feel it will be easiest to understand if we put the offset of the older tuple in that field and then phrase the complaint as the patch does, e.g.: tuple with uncommitted xmin %u was updated to produce a tuple at offset %u with differing xmin %u We could flip that around and put the newer tuple offset in the field and then phrase the complaint the other way around, but it seems a bit awkward, e.g.: tuple with uncommited xmin %u at offset %u was updated to produce this tuple with differing xmin %u I think if we did do it that way around (and figured out how to phrase the messages) we might not need both arrays any more (though I'm not positive about that). It seems hard to avoid needing at least one, else you can't explicitly notice two converging HOT chains, which seems like a case we probably ought to notice. But the "to produce this tuple" phrasing is just confusing, I think, and removing "this" doesn't help. You need to somehow get people to understand that the offset they probably saw in another field is the second tuple, not the first one. Maybe: xmin %u does not match xmax %u of prior tuple at offset %u Hmm. Anyway, whether it was the right idea or not, the desire to have the earlier tuple be the focus of the error messages was part of the motivation here. > I'm doubtful it's a good idea to try to validate the 9.4 case. The likelihood > of getting that right seems low and I don't see us gaining much by even trying. I agree with Peter. We have to try to get that case right. If we can eventually eliminate it as a valid case by some mechanism, hooray. But in the meantime we have to deal with it as best we can. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: