Thread: Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

From
Robert Haas
Date:
On Sat, Mar 25, 2023 at 6:25 PM Peter Geoghegan <pg@bowt.ie> wrote:
> On Fri, Mar 24, 2023 at 8:13 AM Robert Haas <rhaas@postgresql.org> wrote:
> > If we're checking xmin and find that it is invalid (i.e. 0) just
> > report that as corruption, similar to what's already done in the
> > three cases that seem correct. If we're checking xmax and find
> > that's invalid, that's fine: it just means that the tuple hasn't
> > been updated or deleted.
>
> What about aborted speculative insertions? See
> heap_abort_speculative(), which directly sets the speculatively
> inserted heap tuple's xmin to InvalidTransactionId/zero.

Oh, dear. I didn't know about that case.

> It probably does make sense to keep something close to this check --
> it just needs to account for speculative insertions to avoid false
> positive reports of corruption. We could perform cross-checks against
> a tuple whose xmin is InvalidTransactionId/zero to verify that it
> really is from an aborted speculative insertion, to the extent that
> that's possible. For example, such a tuple can't be a heap-only tuple,
> and it can't have any xmax value other than InvalidTransactionId/zero.

Since this was back-patched, I think it's probably better to just
remove the error. We can introduce new validation if we want, but that
should probably be master-only.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

From
Peter Geoghegan
Date:
On Mon, Mar 27, 2023 at 10:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> > What about aborted speculative insertions? See
> > heap_abort_speculative(), which directly sets the speculatively
> > inserted heap tuple's xmin to InvalidTransactionId/zero.
>
> Oh, dear. I didn't know about that case.

A big benefit of having extensive amcheck coverage is that it
effectively centralizes information about the on-disk format, in an
easy to understand way, and (over time) puts things on a more rigorous
footing. Now it'll be a lot harder for somebody else to overlook that
case in the future, which is good. Things are trending in the right
direction.

> > It probably does make sense to keep something close to this check --
> > it just needs to account for speculative insertions to avoid false
> > positive reports of corruption. We could perform cross-checks against
> > a tuple whose xmin is InvalidTransactionId/zero to verify that it
> > really is from an aborted speculative insertion, to the extent that
> > that's possible. For example, such a tuple can't be a heap-only tuple,
> > and it can't have any xmax value other than InvalidTransactionId/zero.
>
> Since this was back-patched, I think it's probably better to just
> remove the error. We can introduce new validation if we want, but that
> should probably be master-only.

That makes sense.

I don't think that it's particularly likely that having refined
aborted speculative insertion amcheck coverage will make a critical
difference to any user, at any time. But "amcheck as documentation of
the on-disk format" is reason enough to have it.

--
Peter Geoghegan



Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

From
Robert Haas
Date:
On Mon, Mar 27, 2023 at 2:34 PM Peter Geoghegan <pg@bowt.ie> wrote:
> > Since this was back-patched, I think it's probably better to just
> > remove the error. We can introduce new validation if we want, but that
> > should probably be master-only.
>
> That makes sense.

Patch attached.

> I don't think that it's particularly likely that having refined
> aborted speculative insertion amcheck coverage will make a critical
> difference to any user, at any time. But "amcheck as documentation of
> the on-disk format" is reason enough to have it.

Sure, if someone feels like writing the code. I have to admit that I
have mixed feelings about this whole direction. In concept, I agree
with you entirely: a fringe benefit of having checks that tell us
whether or not a page is valid is that it helps to make clear what
page states we think are valid. In practice, however, the point you
raise in your first sentence weighs awfully heavily with me. Spending
a lot of energy on checks that are unlikely to catch practical
problems feels like it may not be the best use of time. I'm not sure
exactly where to draw the line, but it seems highly likely to be that
there are things we could deduce about the page that wouldn't be worth
the effort. For example, would we bother checking that a tuple with an
in-progress xmin does not have a smaller natts value than a tuple with
a committed xmin? Or that natts values are non-decreasing across a HOT
chain? I suspect there are even more obscure examples of things that
should be true but might not really be worth worrying about in the
code.

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment

Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

From
Peter Geoghegan
Date:
On Mon, Mar 27, 2023 at 1:17 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Patch attached.

This is fine, as far as it goes. Obviously it fixes the immediate problem.

> > I don't think that it's particularly likely that having refined
> > aborted speculative insertion amcheck coverage will make a critical
> > difference to any user, at any time. But "amcheck as documentation of
> > the on-disk format" is reason enough to have it.
>
> Sure, if someone feels like writing the code. I have to admit that I
> have mixed feelings about this whole direction. In concept, I agree
> with you entirely: a fringe benefit of having checks that tell us
> whether or not a page is valid is that it helps to make clear what
> page states we think are valid.

I don't think that it's a fringe benefit; it's just not necessarily of
direct benefit to amcheck users.

Before the HOT chain validation patch went in, it was unclear whether
certain conceivable on-disk states should constitute corruption. In
particular, it wasn't clear to anybody whether or not it was okay for
an LP_REDIRECT to point to an LP_DEAD until recently (and probably
other things besides that). I don't think that we should assume that
the easy part is abstractly defining corruption, while the hard part
is writing the tool to check for the corruption. Sometimes it is, but
I think that it's often the other way around.

> In practice, however, the point you
> raise in your first sentence weighs awfully heavily with me. Spending
> a lot of energy on checks that are unlikely to catch practical
> problems feels like it may not be the best use of time.

That definitely could be true, but I don't think that it's terribly
much extra effort in most cases.

> I'm not sure
> exactly where to draw the line, but it seems highly likely to be that
> there are things we could deduce about the page that wouldn't be worth
> the effort. For example, would we bother checking that a tuple with an
> in-progress xmin does not have a smaller natts value than a tuple with
> a committed xmin? Or that natts values are non-decreasing across a HOT
> chain? I suspect there are even more obscure examples of things that
> should be true but might not really be worth worrying about in the
> code.

A related way of looking at it (that I also find appealing) is that
it's often easier (far easier) to just have the check, and be done
with it. Of course there is bound to be uncertainty about how useful
any given check might be; we're looking for something that is
theoretically never supposed to happen. Why not just assume that it
might matter if it's not costing very much to check for it?

This is quite a different mentality than the one we bring to core
heapam code, where it's quite natural to just avoid strange corner
cases in the on-disk format like the plague. The risk profile is
totally different for amcheck code. Within amcheck, I'd rather go too
far than not go far enough.

--
Peter Geoghegan



Re: pgsql: amcheck: Fix verify_heapam for tuples where xmin or xmax is 0.

From
Robert Haas
Date:
On Mon, Mar 27, 2023 at 4:52 PM Peter Geoghegan <pg@bowt.ie> wrote:
> This is fine, as far as it goes. Obviously it fixes the immediate problem.

OK, I've committed and back-patched this fix to v14, just like the
erroneous commit that created the issue.

--
Robert Haas
EDB: http://www.enterprisedb.com