Re: amcheck assert failure - Mailing list pgsql-bugs

From Peter Geoghegan
Subject Re: amcheck assert failure
Date
Msg-id CAH2-WzmkurhCqnyLHxk0VkOZqd49+ZZsp1xAJOg7j2x7dmp_XQ@mail.gmail.com
Whole thread Raw
In response to Re: amcheck assert failure  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: amcheck assert failure  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: amcheck assert failure  (Grigory Smolkin <g.smolkin@postgrespro.ru>)
List pgsql-bugs
On Mon, Apr 22, 2019 at 12:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Nonetheless ... do we really want an assertion failure rather than
> some more-mundane error report?  Seems like the point of amcheck
> is to detect data corruption, so I'd rather get an ERROR than an
> assertion (which would not fire in production builds anyway).

(Apologies for sending this a second time, Tom -- somehow failed to CC
the list the first time)

I think that the size cross-check must have failed to catch the problem:

        /*
         * lp_len should match the IndexTuple reported length exactly, since
         * lp_len is completely redundant in indexes, and both sources of
         * tuple length are MAXALIGN()'d.  nbtree does not use lp_len all that
         * frequently, and is surprisingly tolerant of corrupt lp_len fields.
         */
        if (tupsize != ItemIdGetLength(itemid))
            ereport(ERROR,
                    (errcode(ERRCODE_INDEX_CORRUPTED),
                     errmsg("index tuple size does not equal lp_len in
index \"%s\"",
                            RelationGetRelationName(state->rel)),
                     errdetail_internal("Index tid=(%u,%u) tuple
size=%zu lp_len=%u page lsn=%X/%X.",
                                        state->targetblock, offset,
                                        tupsize, ItemIdGetLength(itemid),
                                        (uint32) (state->targetlsn >> 32),
                                        (uint32) state->targetlsn),
                     errhint("This could be a torn page problem.")));

It seems like the length from tupsize (IndexTupleSize()) happened to
match the redundant ItemIdGetLength()-reported size from the line
pointer. No idea how that could actually be possible, since the value
returned by ItemIdGetLength() would be 12592. What are the chances of
that accidentally matching what the wild itup pointer reported as its
IndexTupleSize()?

The only explanation I can think of is that the assertion failure was
from the core code -- _bt_check_natts() is itself called within core
code assertions, so it's just about possible that that was how this
happened, before amcheck was even called. Doesn't seem like a very
good explanation. Do you think that that's possible, Grigory?

In any case, you're clearly right that amcheck could be more defensive
here. My pg_hexedit tool detected that there was an LP_UNUSED item
pointer with storage (i.e. whose lp_len > 0), which is a check that
amcheck can and should perform, but doesn't. Especially because it
would prevent a deference of a likely-wild pointer (IndexTuple).

We could:

* Throw an error when any line item reports lp_len == 0.

* Throw an error when there is a line item that's either LP_UNUSED, or
LP_REDIRECT. The corrupt page sent by Grigory had both.

What do you think of the idea of committing some simple checks to do
this with contrib/amcheck on v12?

-- 
Peter Geoghegan



pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: amcheck assert failure
Next
From: Tom Lane
Date:
Subject: Re: amcheck assert failure