Thread: Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.
On Sat, Mar 25, 2023 at 6:25 PM Peter Geoghegan <pg@bowt.ie> wrote: > On Fri, Mar 24, 2023 at 8:13 AM Robert Haas <rhaas@postgresql.org> wrote: > > If we're checking xmin and find that it is invalid (i.e. 0) just > > report that as corruption, similar to what's already done in the > > three cases that seem correct. If we're checking xmax and find > > that's invalid, that's fine: it just means that the tuple hasn't > > been updated or deleted. > > What about aborted speculative insertions? See > heap_abort_speculative(), which directly sets the speculatively > inserted heap tuple's xmin to InvalidTransactionId/zero. Oh, dear. I didn't know about that case. > It probably does make sense to keep something close to this check -- > it just needs to account for speculative insertions to avoid false > positive reports of corruption. We could perform cross-checks against > a tuple whose xmin is InvalidTransactionId/zero to verify that it > really is from an aborted speculative insertion, to the extent that > that's possible. For example, such a tuple can't be a heap-only tuple, > and it can't have any xmax value other than InvalidTransactionId/zero. Since this was back-patched, I think it's probably better to just remove the error. We can introduce new validation if we want, but that should probably be master-only. -- Robert Haas EDB: http://www.enterprisedb.com
Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.
From
Peter Geoghegan
Date:
On Mon, Mar 27, 2023 at 10:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > > What about aborted speculative insertions? See > > heap_abort_speculative(), which directly sets the speculatively > > inserted heap tuple's xmin to InvalidTransactionId/zero. > > Oh, dear. I didn't know about that case. A big benefit of having extensive amcheck coverage is that it effectively centralizes information about the on-disk format, in an easy to understand way, and (over time) puts things on a more rigorous footing. Now it'll be a lot harder for somebody else to overlook that case in the future, which is good. Things are trending in the right direction. > > It probably does make sense to keep something close to this check -- > > it just needs to account for speculative insertions to avoid false > > positive reports of corruption. We could perform cross-checks against > > a tuple whose xmin is InvalidTransactionId/zero to verify that it > > really is from an aborted speculative insertion, to the extent that > > that's possible. For example, such a tuple can't be a heap-only tuple, > > and it can't have any xmax value other than InvalidTransactionId/zero. > > Since this was back-patched, I think it's probably better to just > remove the error. We can introduce new validation if we want, but that > should probably be master-only. That makes sense. I don't think that it's particularly likely that having refined aborted speculative insertion amcheck coverage will make a critical difference to any user, at any time. But "amcheck as documentation of the on-disk format" is reason enough to have it. -- Peter Geoghegan
On Mon, Mar 27, 2023 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote: > > Since this was back-patched, I think it's probably better to just > > remove the error. We can introduce new validation if we want, but that > > should probably be master-only. > > That makes sense. Patch attached. > I don't think that it's particularly likely that having refined > aborted speculative insertion amcheck coverage will make a critical > difference to any user, at any time. But "amcheck as documentation of > the on-disk format" is reason enough to have it. Sure, if someone feels like writing the code. I have to admit that I have mixed feelings about this whole direction. In concept, I agree with you entirely: a fringe benefit of having checks that tell us whether or not a page is valid is that it helps to make clear what page states we think are valid. In practice, however, the point you raise in your first sentence weighs awfully heavily with me. Spending a lot of energy on checks that are unlikely to catch practical problems feels like it may not be the best use of time. I'm not sure exactly where to draw the line, but it seems highly likely to be that there are things we could deduce about the page that wouldn't be worth the effort. For example, would we bother checking that a tuple with an in-progress xmin does not have a smaller natts value than a tuple with a committed xmin? Or that natts values are non-decreasing across a HOT chain? I suspect there are even more obscure examples of things that should be true but might not really be worth worrying about in the code. -- Robert Haas EDB: http://www.enterprisedb.com
Attachment
Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.
From
Peter Geoghegan
Date:
On Mon, Mar 27, 2023 at 1:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > Patch attached. This is fine, as far as it goes. Obviously it fixes the immediate problem. > > I don't think that it's particularly likely that having refined > > aborted speculative insertion amcheck coverage will make a critical > > difference to any user, at any time. But "amcheck as documentation of > > the on-disk format" is reason enough to have it. > > Sure, if someone feels like writing the code. I have to admit that I > have mixed feelings about this whole direction. In concept, I agree > with you entirely: a fringe benefit of having checks that tell us > whether or not a page is valid is that it helps to make clear what > page states we think are valid. I don't think that it's a fringe benefit; it's just not necessarily of direct benefit to amcheck users. Before the HOT chain validation patch went in, it was unclear whether certain conceivable on-disk states should constitute corruption. In particular, it wasn't clear to anybody whether or not it was okay for an LP_REDIRECT to point to an LP_DEAD until recently (and probably other things besides that). I don't think that we should assume that the easy part is abstractly defining corruption, while the hard part is writing the tool to check for the corruption. Sometimes it is, but I think that it's often the other way around. > In practice, however, the point you > raise in your first sentence weighs awfully heavily with me. Spending > a lot of energy on checks that are unlikely to catch practical > problems feels like it may not be the best use of time. That definitely could be true, but I don't think that it's terribly much extra effort in most cases. > I'm not sure > exactly where to draw the line, but it seems highly likely to be that > there are things we could deduce about the page that wouldn't be worth > the effort. For example, would we bother checking that a tuple with an > in-progress xmin does not have a smaller natts value than a tuple with > a committed xmin? Or that natts values are non-decreasing across a HOT > chain? I suspect there are even more obscure examples of things that > should be true but might not really be worth worrying about in the > code. A related way of looking at it (that I also find appealing) is that it's often easier (far easier) to just have the check, and be done with it. Of course there is bound to be uncertainty about how useful any given check might be; we're looking for something that is theoretically never supposed to happen. Why not just assume that it might matter if it's not costing very much to check for it? This is quite a different mentality than the one we bring to core heapam code, where it's quite natural to just avoid strange corner cases in the on-disk format like the plague. The risk profile is totally different for amcheck code. Within amcheck, I'd rather go too far than not go far enough. -- Peter Geoghegan
On Mon, Mar 27, 2023 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote: > This is fine, as far as it goes. Obviously it fixes the immediate problem. OK, I've committed and back-patched this fix to v14, just like the erroneous commit that created the issue. -- Robert Haas EDB: http://www.enterprisedb.com