Re: new heapcheck contrib module - Mailing list pgsql-hackers

From Andres Freund
Subject Re: new heapcheck contrib module
Date
Msg-id 20200420194210.sevnpi4lyghoanxi@alap3.anarazel.de
Whole thread Raw
In response to new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: new heapcheck contrib module  (Robert Haas <robertmhaas@gmail.com>)
Re: new heapcheck contrib module  (Peter Geoghegan <pg@bowt.ie>)
Re: new heapcheck contrib module  (Mark Dilger <mark.dilger@enterprisedb.com>)
List pgsql-hackers
Hi,

On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
> I have been talking with Robert about table corruption that occurs
> from time to time. The page checksum feature seems sufficient to
> detect most random corruption problems, but it can't detect "logical"
> corruption, where the page is valid but inconsistent with the rest of
> the database cluster. This can happen due to faulty or ill-conceived
> backup and restore tools, or bad storage, or user error, or bugs in
> the server itself. (Also, not everyone enables checksums.)

This is something we really really really need. I'm very excited to see
progress!


> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
> From: Mark Dilger <mark.dilger@enterprisedb.com>
> Date: Mon, 20 Apr 2020 08:05:58 -0700
> Subject: [PATCH v2] Adding heapcheck contrib module.
> 
> The heapcheck module introduces a new function for checking a heap
> relation and associated toast relation, if any, for corruption.

Why not add it to amcheck?


I wonder if a mode where heapcheck optionally would only checks
non-frozen (perhaps also non-all-visible) regions of a table would be a
good idea? Would make it a lot more viable to run this regularly on
bigger databases. Even if there's a window to not check some data
(because it's frozen before the next heapcheck run).


> The attached module provides the means to scan a relation and sanity
> check it. Currently, it checks xmin and xmax values against
> relfrozenxid and relminmxid, and also validates TOAST pointers. If
> people like this, it could be expanded to perform additional checks.


> The postgres backend already defends against certain forms of
> corruption, by checking the page header of each page before allowing
> it into the page cache, and by checking the page checksum, if enabled.
> Experience shows that broken or ill-conceived backup and restore
> mechanisms can result in a page, or an entire file, being overwritten
> with an earlier version of itself, restored from backup.  Pages thus
> overwritten will appear to have valid page headers and checksums,
> while potentially containing xmin, xmax, and toast pointers that are
> invalid.

We also had a *lot* of bugs that we'd have found a lot earlier, possibly
even during development, if we had a way to easily perform these checks.


> contrib/heapcheck introduces a function, heapcheck_relation, that
> takes a regclass argument, scans the given heap relation, and returns
> rows containing information about corruption found within the table.
> The main focus of the scan is to find invalid xmin, xmax, and toast
> pointer values.  It also checks for structural corruption within the
> page (such as invalid t_hoff values) that could lead to the backend
> aborting should the function blindly trust the data as it finds it.


> +typedef struct CorruptionInfo
> +{
> +    BlockNumber blkno;
> +    OffsetNumber offnum;
> +    int16        lp_off;
> +    int16        lp_flags;
> +    int16        lp_len;
> +    int32        attnum;
> +    int32        chunk;
> +    char       *msg;
> +}            CorruptionInfo;

Adding a short comment explaining what this is for would be good.


> +/* Internal implementation */
> +void        record_corruption(HeapCheckContext * ctx, char *msg);
> +TupleDesc    heapcheck_relation_tupdesc(void);
> +
> +void        beginRelBlockIteration(HeapCheckContext * ctx);
> +bool        relBlockIteration_next(HeapCheckContext * ctx);
> +void        endRelBlockIteration(HeapCheckContext * ctx);
> +
> +void        beginPageTupleIteration(HeapCheckContext * ctx);
> +bool        pageTupleIteration_next(HeapCheckContext * ctx);
> +void        endPageTupleIteration(HeapCheckContext * ctx);
> +
> +void        beginTupleAttributeIteration(HeapCheckContext * ctx);
> +bool        tupleAttributeIteration_next(HeapCheckContext * ctx);
> +void        endTupleAttributeIteration(HeapCheckContext * ctx);
> +
> +void        beginToastTupleIteration(HeapCheckContext * ctx,
> +                                     struct varatt_external *toast_pointer);
> +void        endToastTupleIteration(HeapCheckContext * ctx);
> +bool        toastTupleIteration_next(HeapCheckContext * ctx);
> +
> +bool        TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid);
> +bool        HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx);
> +void        check_toast_tuple(HeapCheckContext * ctx);
> +bool        check_tuple_attribute(HeapCheckContext * ctx);
> +void        check_tuple(HeapCheckContext * ctx);
> +
> +List       *check_relation(Oid relid);
> +void        check_relation_relkind(Relation rel);

Why aren't these static?


> +/*
> + * record_corruption
> + *
> + *   Record a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + */
> +void
> +record_corruption(HeapCheckContext * ctx, char *msg)
> +{

Given that you went through the trouble of adding prototypes for all of
these, I'd start with the most important functions, not the unimportant
details.


> +/*
> + * Helper function to construct the TupleDesc needed by heapcheck_relation.
> + */
> +TupleDesc
> +heapcheck_relation_tupdesc()

Missing (void) (it's our style, even though you could theoretically not
have it as long as you have a prototype).


> +{
> +    TupleDesc    tupdesc;
> +    AttrNumber    maxattr = 8;

This 8 is in multiple places, I'd add a define for it.

> +    AttrNumber    a = 0;
> +
> +    tupdesc = CreateTemplateTupleDesc(maxattr);
> +    TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
> +    Assert(a == maxattr);
> +
> +    return BlessTupleDesc(tupdesc);
> +}


> +/*
> + * heapcheck_relation
> + *
> + *   Scan and report corruption in heap pages or in associated toast relation.
> + */
> +Datum
> +heapcheck_relation(PG_FUNCTION_ARGS)
> +{
> +    FuncCallContext *funcctx;
> +    CheckRelCtx *ctx;
> +
> +    if (SRF_IS_FIRSTCALL())
> +    {

I think it'd be good to have a version that just returned a boolean. For
one, in many cases that's all we care about when scripting things. But
also, on a large relation, there could be a lot of errors.


> +        Oid            relid = PG_GETARG_OID(0);
> +        MemoryContext oldcontext;
> +
> +        /*
> +         * Scan the entire relation, building up a list of corruption found in
> +         * ctx->corruption, for returning later.  The scan must be performed
> +         * in a memory context that will survive until after all rows are
> +         * returned.
> +         */
> +        funcctx = SRF_FIRSTCALL_INIT();
> +        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
> +        funcctx->tuple_desc = heapcheck_relation_tupdesc();
> +        ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx));
> +        ctx->corruption = check_relation(relid);
> +        ctx->idx = 0;            /* start the iterator at the beginning */
> +        funcctx->user_fctx = (void *) ctx;
> +        MemoryContextSwitchTo(oldcontext);

Hm. This builds up all the errors in memory. Is that a good idea? I mean
for a large relation having one returned value for each tuple could be a
heck of a lot of data.

I think it'd be better to use the spilling SRF protocol here.  It's not
like you're benefitting from deferring the tuple construction to the
return currently.


> +/*
> + * beginRelBlockIteration
> + *
> + *   For the given heap relation being checked, as recorded in ctx, sets up
> + *   variables for iterating over the heap's pages.
> + *
> + *   The caller should have already opened the heap relation, ctx->rel
> + */
> +void
> +beginRelBlockIteration(HeapCheckContext * ctx)
> +{
> +    ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel);
> +    ctx->blkno = InvalidBlockNumber;
> +    ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD);
> +    ctx->buffer = InvalidBuffer;
> +    ctx->page = NULL;
> +}
> +
> +/*
> + * endRelBlockIteration
> + *
> + *   Releases resources that were reserved by either beginRelBlockIteration or
> + *   relBlockIteration_next.
> + */
> +void
> +endRelBlockIteration(HeapCheckContext * ctx)
> +{
> +    /*
> +     * Clean up.  If the caller iterated to the end, the final call to
> +     * relBlockIteration_next will already have released the buffer, but if
> +     * the caller is bailing out early, we have to release it ourselves.
> +     */
> +    if (InvalidBuffer != ctx->buffer)
> +        UnlockReleaseBuffer(ctx->buffer);
> +}

These seem mighty granular and generically named to me.


> + * pageTupleIteration_next
> + *
> + *   Advances the state tracked in ctx to the next tuple on the page.
> + *
> + *   Caller should have already set up the iteration via
> + *   beginPageTupleIteration, and should stop calling when this function
> + *   returns false.
> + */
> +bool
> +pageTupleIteration_next(HeapCheckContext * ctx)

I don't think this is a naming scheme we use anywhere in postgres. I
don't think it's a good idea to add yet more of those.


> +{
> +    /*
> +     * Iterate to the next interesting line pointer, if any. Unused, dead and
> +     * redirect line pointers are of no interest.
> +     */
> +    do
> +    {
> +        ctx->offnum = OffsetNumberNext(ctx->offnum);
> +        if (ctx->offnum > ctx->maxoff)
> +            return false;
> +        ctx->itemid = PageGetItemId(ctx->page, ctx->offnum);
> +    } while (!ItemIdIsUsed(ctx->itemid) ||
> +             ItemIdIsDead(ctx->itemid) ||
> +             ItemIdIsRedirected(ctx->itemid));

This is an odd loop. Part of the test is in the body, part of in the
loop header.


> +/*
> + * Given a TransactionId, attempt to interpret it as a valid
> + * FullTransactionId, neither in the future nor overlong in
> + * the past.  Stores the inferred FullTransactionId in *fxid.
> + *
> + * Returns whether the xid is newer than the oldest clog xid.
> + */
> +bool
> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid)

I don't at all like the naming of this function. This isn't a reliable
check. As before, it obviously also shouldn't be static.


> +{
> +    FullTransactionId fnow;
> +    uint32        epoch;
> +
> +    /* Initialize fxid; we'll overwrite this later if needed */
> +    *fxid = FullTransactionIdFromEpochAndXid(0, xid);

> +    /* Special xids can quickly be turned into invalid fxids */
> +    if (!TransactionIdIsValid(xid))
> +        return false;
> +    if (!TransactionIdIsNormal(xid))
> +        return true;
> +
> +    /*
> +     * Charitably infer the full transaction id as being within one epoch ago
> +     */
> +    fnow = ReadNextFullTransactionId();
> +    epoch = EpochFromFullTransactionId(fnow);
> +    *fxid = FullTransactionIdFromEpochAndXid(epoch, xid);

So now you're overwriting the fxid value from above unconditionally?


> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
> +        *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid);


I think it'd be better to do the conversion the following way:

    *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow)
                                    + (int32) (XidFromFullTransactionId(fnow) - xid));


> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
> +        return false;
> +    /* The oldestClogXid is protected by CLogTruncationLock */
> +    Assert(LWLockHeldByMe(CLogTruncationLock));
> +    if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid))
> +        return false;
> +    return true;
> +}

Why is this testing oldestClogXid instead of oldestXid?


> +/*
> + * HeapTupleIsVisible
> + *
> + *    Determine whether tuples are visible for heapcheck.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
> + *
> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> + *     to a table during a corruption check.
> + *  2) Gracefully handles xids that are too old by calling
> + *     TransactionIdStillValid before TransactionLogFetch, thus avoiding
> + *     a backend abort.

I think it'd be better to protect against this by avoiding checks for
xids that are older than relfrozenxid. And ones that are newer than
ReadNextTransactionId().  But all of those cases should be errors
anyway, so it doesn't seem like that should be handled within the
visibility routine.


> + *  3) Only makes a boolean determination of whether heapcheck should
> + *     see the tuple, rather than doing extra work for vacuum-related
> + *     categorization.
> + */
> +bool
> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> +{

> +    FullTransactionId fxmin,
> +                fxmax;
> +    uint16        infomask = tuphdr->t_infomask;
> +    TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr);
> +
> +    if (!HeapTupleHeaderXminCommitted(tuphdr))
> +    {

Hm. I wonder if it'd be good to crosscheck the xid committed hint bits
with clog?


> +        else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
> +        {
> +            LWLockRelease(CLogTruncationLock);
> +            return false;        /* HEAPTUPLE_DEAD */
> +        }

Note that this actually can error out, if xmin is a subtransaction xid,
because pg_subtrans is truncated a lot more aggressively than anything
else. I think you'd need to filter against subtransactions older than
RecentXmin before here, and treat that as an error.


> +    if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
> +    {
> +        if (infomask & HEAP_XMAX_IS_MULTI)
> +        {
> +            TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
> +
> +            /* not LOCKED_ONLY, so it has to have an xmax */
> +            if (!TransactionIdIsValid(xmax))
> +            {
> +                record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is "
> +                                         "neither LOCKED_ONLY nor has a "
> +                                         "valid xmax"));
> +                return false;
> +            }

I think it's bad to have code like this in a routine that's named like a
generic visibility check routine.


> +            if (TransactionIdIsInProgress(xmax))
> +                return false;    /* HEAPTUPLE_DELETE_IN_PROGRESS */
> +
> +            LWLockAcquire(CLogTruncationLock, LW_SHARED);
> +            if (!TransactionIdStillValid(xmax, &fxmax))
> +            {
> +                LWLockRelease(CLogTruncationLock);
> +                record_corruption(ctx, psprintf("tuple xmax = %u (interpreted "
> +                                                "as " UINT64_FORMAT
> +                                                ") not or no longer valid",
> +                                                xmax, fxmax.value));
> +                return false;
> +            }
> +            else if (TransactionIdDidCommit(xmax))
> +            {
> +                LWLockRelease(CLogTruncationLock);
> +                return false;    /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
> +            }
> +            LWLockRelease(CLogTruncationLock);
> +            /* Ok, the tuple is live */

I don't think random interspersed uses of CLogTruncationLock are a good
idea. If you move to only checking visibility after tuple fits into
[relfrozenxid, nextXid), then you don't need to take any locks here, as
long as a lock against vacuum is taken (which I think this should do
anyway).


> +/*
> + * check_tuple
> + *
> + *   Checks the current tuple as tracked in ctx for corruption.  Records any
> + *   corruption found in ctx->corruption.
> + *
> + *   The caller should have iterated to a tuple via pageTupleIteration_next.
> + */
> +void
> +check_tuple(HeapCheckContext * ctx)
> +{
> +    bool        fatal = false;

Wait, aren't some checks here duplicate with ones in
HeapTupleIsVisible()?


> +    /* Check relminmxid against mxid, if any */
> +    if (ctx->infomask & HEAP_XMAX_IS_MULTI &&
> +        MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> +                                        "relminmxid = %u",
> +                                        ctx->xmax, ctx->relminmxid));
> +    }

It's pretty weird that the routines here access xmin/xmax/... via
HeapCheckContext, but HeapTupleIsVisible() doesn't.


> +    /* Check xmin against relfrozenxid */
> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> +        TransactionIdIsNormal(ctx->xmin) &&
> +        TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmin = %u precedes relation "
> +                                        "relfrozenxid = %u",
> +                                        ctx->xmin, ctx->relfrozenxid));
> +    }
> +
> +    /* Check xmax against relfrozenxid */
> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> +        TransactionIdIsNormal(ctx->xmax) &&
> +        TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> +                                        "relfrozenxid = %u",
> +                                        ctx->xmax, ctx->relfrozenxid));
> +    }

these all should be fatal. You definitely cannot just continue
afterwards given the justification below:


> +    /*
> +     * Iterate over the attributes looking for broken toast values. This
> +     * roughly follows the logic of heap_deform_tuple, except that it doesn't
> +     * bother building up isnull[] and values[] arrays, since nobody wants
> +     * them, and it unrolls anything that might trip over an Assert when
> +     * processing corrupt data.
> +     */
> +    beginTupleAttributeIteration(ctx);
> +    while (tupleAttributeIteration_next(ctx) &&
> +           check_tuple_attribute(ctx))
> +        ;
> +    endTupleAttributeIteration(ctx);
> +}

I really don't find these helpers helpful.


> +/*
> + * check_relation
> + *
> + *   Checks the relation given by relid for corruption, returning a list of all
> + *   it finds.
> + *
> + *   The caller should set up the memory context as desired before calling.
> + *   The returned list belongs to the caller.
> + */
> +List *
> +check_relation(Oid relid)
> +{
> +    HeapCheckContext ctx;
> +
> +    memset(&ctx, 0, sizeof(HeapCheckContext));
> +
> +    /* Open the relation */
> +    ctx.relid = relid;
> +    ctx.corruption = NIL;
> +    ctx.rel = relation_open(relid, AccessShareLock);

I think you need to protect at least against concurrent schema changes
given some of your checks. But I think it'd be better to also conflict
with vacuum here.


> +    check_relation_relkind(ctx.rel);

I think you also need to ensure that the table is actually using heap
AM, not another tableam. Oh - you're doing that inside the check. But
that's confusing, because that's not 'relkind'.


> +    ctx.relDesc = RelationGetDescr(ctx.rel);
> +    ctx.rel_natts = RelationGetDescr(ctx.rel)->natts;
> +    ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid;
> +    ctx.relminmxid = ctx.rel->rd_rel->relminmxid;

three naming schemes in three lines...



> +    /* check all blocks of the relation */
> +    beginRelBlockIteration(&ctx);
> +    while (relBlockIteration_next(&ctx))
> +    {
> +        /* Perform tuple checks */
> +        beginPageTupleIteration(&ctx);
> +        while (pageTupleIteration_next(&ctx))
> +            check_tuple(&ctx);
> +        endPageTupleIteration(&ctx);
> +    }
> +    endRelBlockIteration(&ctx);

I again do not find this helper stuff helpful.


> +    /* Close the associated toast table and indexes, if any. */
> +    if (ctx.has_toastrel)
> +    {
> +        toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
> +                            AccessShareLock);
> +        table_close(ctx.toastrel, AccessShareLock);
> +    }
> +
> +    /* Close the main relation */
> +    relation_close(ctx.rel, AccessShareLock);

Why the closing here?



> +# This regression test demonstrates that the heapcheck_relation() function
> +# supplied with this contrib module correctly identifies specific kinds of
> +# corruption within pages.  To test this, we need a mechanism to create corrupt
> +# pages with predictable, repeatable corruption.  The postgres backend cannot be
> +# expected to help us with this, as its design is not consistent with the goal
> +# of intentionally corrupting pages.
> +#
> +# Instead, we create a table to corrupt, and with careful consideration of how
> +# postgresql lays out heap pages, we seek to offsets within the page and
> +# overwrite deliberately chosen bytes with specific values calculated to
> +# corrupt the page in expected ways.  We then verify that heapcheck_relation
> +# reports the corruption, and that it runs without crashing.  Note that the
> +# backend cannot simply be started to run queries against the corrupt table, as
> +# the backend will crash, at least for some of the corruption types we
> +# generate.
> +#
> +# Autovacuum potentially touching the table in the background makes the exact
> +# behavior of this test harder to reason about.  We turn it off to keep things
> +# simpler.  We use a "belt and suspenders" approach, turning it off for the
> +# system generally in postgresql.conf, and turning it off specifically for the
> +# test table.
> +#
> +# This test depends on the table being written to the heap file exactly as we
> +# expect it to be, so we take care to arrange the columns of the table, and
> +# insert rows of the table, that give predictable sizes and locations within
> +# the table page.

I have a hard time believing this is going to be really
reliable. E.g. the alignment requirements will vary between platforms,
leading to different layouts. In particular, MAXALIGN differs between
platforms.

Also, it's supported to compile postgres with a different pagesize.


Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Mark Dilger
Date:
Subject: Re: new heapcheck contrib module
Next
From: Justin Pryzby
Date:
Subject: Re: DETACH PARTITION and FOR EACH ROW triggers on partitioned tables