Thread: new heapcheck contrib module

new heapcheck contrib module

From
Mark Dilger
Date:
Hackers,

I have been talking with Robert about table corruption that occurs from time to time. The page checksum feature seems
sufficientto detect most random corruption problems, but it can't detect "logical" corruption, where the page is valid
butinconsistent with the rest of the database cluster. This can happen due to faulty or ill-conceived backup and
restoretools, or bad storage, or user error, or bugs in the server itself. (Also, not everyone enables checksums.) 

The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax
valuesagainst relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded
toperform additional checks. 

There was a prior v1 patch, discussed offlist with Robert, not posted.  Here is v2:



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Apr 20, 2020 at 10:59 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax
valuesagainst relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded
toperform additional checks. 

Cool. Why not make it part of contrib/amcheck?

We talked about the kinds of checks that we'd like to have for a tool
like this before:

https://postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Apr 20, 2020 at 2:09 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Cool. Why not make it part of contrib/amcheck?

I wondered if people would suggest that. Didn't take long.

The documentation would need some updating, but that's doable.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Apr 20, 2020 at 11:19 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I wondered if people would suggest that. Didn't take long.

You were the one that pointed out that my first version of
contrib/amcheck, which was called "contrib/btreecheck", should have
a more general name. And rightly so!

The basic interface used for the heap checker functions seem very
similar to what amcheck already offers for B-Tree indexes, so it seems
very natural to distribute them together.

IMV, the problem that we have with amcheck is that it's too hard to
use in a top down kind of way. Perhaps there is an opportunity to
provide a more top-down interface to an expanded version of amcheck
that does heap checking. Something with a high level practical focus,
in addition to the low level functions. I'm not saying that Mark
should be required to solve that problem, but it certainly seems worth
considering now.

> The documentation would need some updating, but that's doable.

It would also probably need a bit of renaming, so that analogous
function names are used.


--
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Apr 20, 2020, at 11:31 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> IMV, the problem that we have with amcheck is that it's too hard to
> use in a top down kind of way. Perhaps there is an opportunity to
> provide a more top-down interface to an expanded version of amcheck
> that does heap checking. Something with a high level practical focus,
> in addition to the low level functions. I'm not saying that Mark
> should be required to solve that problem, but it certainly seems worth
> considering now.

Thanks for your quick response and interest in this submission!

Can you elaborate on "top-down"?  I'm not sure what that means in this context.

I don't mind going further with this project if I understand what you are suggesting.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required. Something that provides a reasonable mix of verification options based on high level directives. All verification methods can be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel. 

For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's not hard to combine it with the rootdescend stuff from amcheck. It should be composable. 

The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work together. 

Peter Geoghegan
(Sent from my phone)

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Apr 20, 2020, at 12:37 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required.
Somethingthat provides a reasonable mix of verification options based on high level directives. All verification
methodscan be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel.  
>
> For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's
nothard to combine it with the rootdescend stuff from amcheck. It should be composable.  
>
> The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work
together. 

Ok, I'll work in that direction and repost when I have something along those lines.

Thanks again for your input.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
> I have been talking with Robert about table corruption that occurs
> from time to time. The page checksum feature seems sufficient to
> detect most random corruption problems, but it can't detect "logical"
> corruption, where the page is valid but inconsistent with the rest of
> the database cluster. This can happen due to faulty or ill-conceived
> backup and restore tools, or bad storage, or user error, or bugs in
> the server itself. (Also, not everyone enables checksums.)

This is something we really really really need. I'm very excited to see
progress!


> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
> From: Mark Dilger <mark.dilger@enterprisedb.com>
> Date: Mon, 20 Apr 2020 08:05:58 -0700
> Subject: [PATCH v2] Adding heapcheck contrib module.
> 
> The heapcheck module introduces a new function for checking a heap
> relation and associated toast relation, if any, for corruption.

Why not add it to amcheck?


I wonder if a mode where heapcheck optionally would only checks
non-frozen (perhaps also non-all-visible) regions of a table would be a
good idea? Would make it a lot more viable to run this regularly on
bigger databases. Even if there's a window to not check some data
(because it's frozen before the next heapcheck run).


> The attached module provides the means to scan a relation and sanity
> check it. Currently, it checks xmin and xmax values against
> relfrozenxid and relminmxid, and also validates TOAST pointers. If
> people like this, it could be expanded to perform additional checks.


> The postgres backend already defends against certain forms of
> corruption, by checking the page header of each page before allowing
> it into the page cache, and by checking the page checksum, if enabled.
> Experience shows that broken or ill-conceived backup and restore
> mechanisms can result in a page, or an entire file, being overwritten
> with an earlier version of itself, restored from backup.  Pages thus
> overwritten will appear to have valid page headers and checksums,
> while potentially containing xmin, xmax, and toast pointers that are
> invalid.

We also had a *lot* of bugs that we'd have found a lot earlier, possibly
even during development, if we had a way to easily perform these checks.


> contrib/heapcheck introduces a function, heapcheck_relation, that
> takes a regclass argument, scans the given heap relation, and returns
> rows containing information about corruption found within the table.
> The main focus of the scan is to find invalid xmin, xmax, and toast
> pointer values.  It also checks for structural corruption within the
> page (such as invalid t_hoff values) that could lead to the backend
> aborting should the function blindly trust the data as it finds it.


> +typedef struct CorruptionInfo
> +{
> +    BlockNumber blkno;
> +    OffsetNumber offnum;
> +    int16        lp_off;
> +    int16        lp_flags;
> +    int16        lp_len;
> +    int32        attnum;
> +    int32        chunk;
> +    char       *msg;
> +}            CorruptionInfo;

Adding a short comment explaining what this is for would be good.


> +/* Internal implementation */
> +void        record_corruption(HeapCheckContext * ctx, char *msg);
> +TupleDesc    heapcheck_relation_tupdesc(void);
> +
> +void        beginRelBlockIteration(HeapCheckContext * ctx);
> +bool        relBlockIteration_next(HeapCheckContext * ctx);
> +void        endRelBlockIteration(HeapCheckContext * ctx);
> +
> +void        beginPageTupleIteration(HeapCheckContext * ctx);
> +bool        pageTupleIteration_next(HeapCheckContext * ctx);
> +void        endPageTupleIteration(HeapCheckContext * ctx);
> +
> +void        beginTupleAttributeIteration(HeapCheckContext * ctx);
> +bool        tupleAttributeIteration_next(HeapCheckContext * ctx);
> +void        endTupleAttributeIteration(HeapCheckContext * ctx);
> +
> +void        beginToastTupleIteration(HeapCheckContext * ctx,
> +                                     struct varatt_external *toast_pointer);
> +void        endToastTupleIteration(HeapCheckContext * ctx);
> +bool        toastTupleIteration_next(HeapCheckContext * ctx);
> +
> +bool        TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid);
> +bool        HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx);
> +void        check_toast_tuple(HeapCheckContext * ctx);
> +bool        check_tuple_attribute(HeapCheckContext * ctx);
> +void        check_tuple(HeapCheckContext * ctx);
> +
> +List       *check_relation(Oid relid);
> +void        check_relation_relkind(Relation rel);

Why aren't these static?


> +/*
> + * record_corruption
> + *
> + *   Record a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + */
> +void
> +record_corruption(HeapCheckContext * ctx, char *msg)
> +{

Given that you went through the trouble of adding prototypes for all of
these, I'd start with the most important functions, not the unimportant
details.


> +/*
> + * Helper function to construct the TupleDesc needed by heapcheck_relation.
> + */
> +TupleDesc
> +heapcheck_relation_tupdesc()

Missing (void) (it's our style, even though you could theoretically not
have it as long as you have a prototype).


> +{
> +    TupleDesc    tupdesc;
> +    AttrNumber    maxattr = 8;

This 8 is in multiple places, I'd add a define for it.

> +    AttrNumber    a = 0;
> +
> +    tupdesc = CreateTemplateTupleDesc(maxattr);
> +    TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0);
> +    TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
> +    Assert(a == maxattr);
> +
> +    return BlessTupleDesc(tupdesc);
> +}


> +/*
> + * heapcheck_relation
> + *
> + *   Scan and report corruption in heap pages or in associated toast relation.
> + */
> +Datum
> +heapcheck_relation(PG_FUNCTION_ARGS)
> +{
> +    FuncCallContext *funcctx;
> +    CheckRelCtx *ctx;
> +
> +    if (SRF_IS_FIRSTCALL())
> +    {

I think it'd be good to have a version that just returned a boolean. For
one, in many cases that's all we care about when scripting things. But
also, on a large relation, there could be a lot of errors.


> +        Oid            relid = PG_GETARG_OID(0);
> +        MemoryContext oldcontext;
> +
> +        /*
> +         * Scan the entire relation, building up a list of corruption found in
> +         * ctx->corruption, for returning later.  The scan must be performed
> +         * in a memory context that will survive until after all rows are
> +         * returned.
> +         */
> +        funcctx = SRF_FIRSTCALL_INIT();
> +        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
> +        funcctx->tuple_desc = heapcheck_relation_tupdesc();
> +        ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx));
> +        ctx->corruption = check_relation(relid);
> +        ctx->idx = 0;            /* start the iterator at the beginning */
> +        funcctx->user_fctx = (void *) ctx;
> +        MemoryContextSwitchTo(oldcontext);

Hm. This builds up all the errors in memory. Is that a good idea? I mean
for a large relation having one returned value for each tuple could be a
heck of a lot of data.

I think it'd be better to use the spilling SRF protocol here.  It's not
like you're benefitting from deferring the tuple construction to the
return currently.


> +/*
> + * beginRelBlockIteration
> + *
> + *   For the given heap relation being checked, as recorded in ctx, sets up
> + *   variables for iterating over the heap's pages.
> + *
> + *   The caller should have already opened the heap relation, ctx->rel
> + */
> +void
> +beginRelBlockIteration(HeapCheckContext * ctx)
> +{
> +    ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel);
> +    ctx->blkno = InvalidBlockNumber;
> +    ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD);
> +    ctx->buffer = InvalidBuffer;
> +    ctx->page = NULL;
> +}
> +
> +/*
> + * endRelBlockIteration
> + *
> + *   Releases resources that were reserved by either beginRelBlockIteration or
> + *   relBlockIteration_next.
> + */
> +void
> +endRelBlockIteration(HeapCheckContext * ctx)
> +{
> +    /*
> +     * Clean up.  If the caller iterated to the end, the final call to
> +     * relBlockIteration_next will already have released the buffer, but if
> +     * the caller is bailing out early, we have to release it ourselves.
> +     */
> +    if (InvalidBuffer != ctx->buffer)
> +        UnlockReleaseBuffer(ctx->buffer);
> +}

These seem mighty granular and generically named to me.


> + * pageTupleIteration_next
> + *
> + *   Advances the state tracked in ctx to the next tuple on the page.
> + *
> + *   Caller should have already set up the iteration via
> + *   beginPageTupleIteration, and should stop calling when this function
> + *   returns false.
> + */
> +bool
> +pageTupleIteration_next(HeapCheckContext * ctx)

I don't think this is a naming scheme we use anywhere in postgres. I
don't think it's a good idea to add yet more of those.


> +{
> +    /*
> +     * Iterate to the next interesting line pointer, if any. Unused, dead and
> +     * redirect line pointers are of no interest.
> +     */
> +    do
> +    {
> +        ctx->offnum = OffsetNumberNext(ctx->offnum);
> +        if (ctx->offnum > ctx->maxoff)
> +            return false;
> +        ctx->itemid = PageGetItemId(ctx->page, ctx->offnum);
> +    } while (!ItemIdIsUsed(ctx->itemid) ||
> +             ItemIdIsDead(ctx->itemid) ||
> +             ItemIdIsRedirected(ctx->itemid));

This is an odd loop. Part of the test is in the body, part of in the
loop header.


> +/*
> + * Given a TransactionId, attempt to interpret it as a valid
> + * FullTransactionId, neither in the future nor overlong in
> + * the past.  Stores the inferred FullTransactionId in *fxid.
> + *
> + * Returns whether the xid is newer than the oldest clog xid.
> + */
> +bool
> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid)

I don't at all like the naming of this function. This isn't a reliable
check. As before, it obviously also shouldn't be static.


> +{
> +    FullTransactionId fnow;
> +    uint32        epoch;
> +
> +    /* Initialize fxid; we'll overwrite this later if needed */
> +    *fxid = FullTransactionIdFromEpochAndXid(0, xid);

> +    /* Special xids can quickly be turned into invalid fxids */
> +    if (!TransactionIdIsValid(xid))
> +        return false;
> +    if (!TransactionIdIsNormal(xid))
> +        return true;
> +
> +    /*
> +     * Charitably infer the full transaction id as being within one epoch ago
> +     */
> +    fnow = ReadNextFullTransactionId();
> +    epoch = EpochFromFullTransactionId(fnow);
> +    *fxid = FullTransactionIdFromEpochAndXid(epoch, xid);

So now you're overwriting the fxid value from above unconditionally?


> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
> +        *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid);


I think it'd be better to do the conversion the following way:

    *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow)
                                    + (int32) (XidFromFullTransactionId(fnow) - xid));


> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
> +        return false;
> +    /* The oldestClogXid is protected by CLogTruncationLock */
> +    Assert(LWLockHeldByMe(CLogTruncationLock));
> +    if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid))
> +        return false;
> +    return true;
> +}

Why is this testing oldestClogXid instead of oldestXid?


> +/*
> + * HeapTupleIsVisible
> + *
> + *    Determine whether tuples are visible for heapcheck.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
> + *
> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> + *     to a table during a corruption check.
> + *  2) Gracefully handles xids that are too old by calling
> + *     TransactionIdStillValid before TransactionLogFetch, thus avoiding
> + *     a backend abort.

I think it'd be better to protect against this by avoiding checks for
xids that are older than relfrozenxid. And ones that are newer than
ReadNextTransactionId().  But all of those cases should be errors
anyway, so it doesn't seem like that should be handled within the
visibility routine.


> + *  3) Only makes a boolean determination of whether heapcheck should
> + *     see the tuple, rather than doing extra work for vacuum-related
> + *     categorization.
> + */
> +bool
> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> +{

> +    FullTransactionId fxmin,
> +                fxmax;
> +    uint16        infomask = tuphdr->t_infomask;
> +    TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr);
> +
> +    if (!HeapTupleHeaderXminCommitted(tuphdr))
> +    {

Hm. I wonder if it'd be good to crosscheck the xid committed hint bits
with clog?


> +        else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
> +        {
> +            LWLockRelease(CLogTruncationLock);
> +            return false;        /* HEAPTUPLE_DEAD */
> +        }

Note that this actually can error out, if xmin is a subtransaction xid,
because pg_subtrans is truncated a lot more aggressively than anything
else. I think you'd need to filter against subtransactions older than
RecentXmin before here, and treat that as an error.


> +    if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
> +    {
> +        if (infomask & HEAP_XMAX_IS_MULTI)
> +        {
> +            TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
> +
> +            /* not LOCKED_ONLY, so it has to have an xmax */
> +            if (!TransactionIdIsValid(xmax))
> +            {
> +                record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is "
> +                                         "neither LOCKED_ONLY nor has a "
> +                                         "valid xmax"));
> +                return false;
> +            }

I think it's bad to have code like this in a routine that's named like a
generic visibility check routine.


> +            if (TransactionIdIsInProgress(xmax))
> +                return false;    /* HEAPTUPLE_DELETE_IN_PROGRESS */
> +
> +            LWLockAcquire(CLogTruncationLock, LW_SHARED);
> +            if (!TransactionIdStillValid(xmax, &fxmax))
> +            {
> +                LWLockRelease(CLogTruncationLock);
> +                record_corruption(ctx, psprintf("tuple xmax = %u (interpreted "
> +                                                "as " UINT64_FORMAT
> +                                                ") not or no longer valid",
> +                                                xmax, fxmax.value));
> +                return false;
> +            }
> +            else if (TransactionIdDidCommit(xmax))
> +            {
> +                LWLockRelease(CLogTruncationLock);
> +                return false;    /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
> +            }
> +            LWLockRelease(CLogTruncationLock);
> +            /* Ok, the tuple is live */

I don't think random interspersed uses of CLogTruncationLock are a good
idea. If you move to only checking visibility after tuple fits into
[relfrozenxid, nextXid), then you don't need to take any locks here, as
long as a lock against vacuum is taken (which I think this should do
anyway).


> +/*
> + * check_tuple
> + *
> + *   Checks the current tuple as tracked in ctx for corruption.  Records any
> + *   corruption found in ctx->corruption.
> + *
> + *   The caller should have iterated to a tuple via pageTupleIteration_next.
> + */
> +void
> +check_tuple(HeapCheckContext * ctx)
> +{
> +    bool        fatal = false;

Wait, aren't some checks here duplicate with ones in
HeapTupleIsVisible()?


> +    /* Check relminmxid against mxid, if any */
> +    if (ctx->infomask & HEAP_XMAX_IS_MULTI &&
> +        MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> +                                        "relminmxid = %u",
> +                                        ctx->xmax, ctx->relminmxid));
> +    }

It's pretty weird that the routines here access xmin/xmax/... via
HeapCheckContext, but HeapTupleIsVisible() doesn't.


> +    /* Check xmin against relfrozenxid */
> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> +        TransactionIdIsNormal(ctx->xmin) &&
> +        TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmin = %u precedes relation "
> +                                        "relfrozenxid = %u",
> +                                        ctx->xmin, ctx->relfrozenxid));
> +    }
> +
> +    /* Check xmax against relfrozenxid */
> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> +        TransactionIdIsNormal(ctx->xmax) &&
> +        TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid))
> +    {
> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
> +                                        "relfrozenxid = %u",
> +                                        ctx->xmax, ctx->relfrozenxid));
> +    }

these all should be fatal. You definitely cannot just continue
afterwards given the justification below:


> +    /*
> +     * Iterate over the attributes looking for broken toast values. This
> +     * roughly follows the logic of heap_deform_tuple, except that it doesn't
> +     * bother building up isnull[] and values[] arrays, since nobody wants
> +     * them, and it unrolls anything that might trip over an Assert when
> +     * processing corrupt data.
> +     */
> +    beginTupleAttributeIteration(ctx);
> +    while (tupleAttributeIteration_next(ctx) &&
> +           check_tuple_attribute(ctx))
> +        ;
> +    endTupleAttributeIteration(ctx);
> +}

I really don't find these helpers helpful.


> +/*
> + * check_relation
> + *
> + *   Checks the relation given by relid for corruption, returning a list of all
> + *   it finds.
> + *
> + *   The caller should set up the memory context as desired before calling.
> + *   The returned list belongs to the caller.
> + */
> +List *
> +check_relation(Oid relid)
> +{
> +    HeapCheckContext ctx;
> +
> +    memset(&ctx, 0, sizeof(HeapCheckContext));
> +
> +    /* Open the relation */
> +    ctx.relid = relid;
> +    ctx.corruption = NIL;
> +    ctx.rel = relation_open(relid, AccessShareLock);

I think you need to protect at least against concurrent schema changes
given some of your checks. But I think it'd be better to also conflict
with vacuum here.


> +    check_relation_relkind(ctx.rel);

I think you also need to ensure that the table is actually using heap
AM, not another tableam. Oh - you're doing that inside the check. But
that's confusing, because that's not 'relkind'.


> +    ctx.relDesc = RelationGetDescr(ctx.rel);
> +    ctx.rel_natts = RelationGetDescr(ctx.rel)->natts;
> +    ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid;
> +    ctx.relminmxid = ctx.rel->rd_rel->relminmxid;

three naming schemes in three lines...



> +    /* check all blocks of the relation */
> +    beginRelBlockIteration(&ctx);
> +    while (relBlockIteration_next(&ctx))
> +    {
> +        /* Perform tuple checks */
> +        beginPageTupleIteration(&ctx);
> +        while (pageTupleIteration_next(&ctx))
> +            check_tuple(&ctx);
> +        endPageTupleIteration(&ctx);
> +    }
> +    endRelBlockIteration(&ctx);

I again do not find this helper stuff helpful.


> +    /* Close the associated toast table and indexes, if any. */
> +    if (ctx.has_toastrel)
> +    {
> +        toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
> +                            AccessShareLock);
> +        table_close(ctx.toastrel, AccessShareLock);
> +    }
> +
> +    /* Close the main relation */
> +    relation_close(ctx.rel, AccessShareLock);

Why the closing here?



> +# This regression test demonstrates that the heapcheck_relation() function
> +# supplied with this contrib module correctly identifies specific kinds of
> +# corruption within pages.  To test this, we need a mechanism to create corrupt
> +# pages with predictable, repeatable corruption.  The postgres backend cannot be
> +# expected to help us with this, as its design is not consistent with the goal
> +# of intentionally corrupting pages.
> +#
> +# Instead, we create a table to corrupt, and with careful consideration of how
> +# postgresql lays out heap pages, we seek to offsets within the page and
> +# overwrite deliberately chosen bytes with specific values calculated to
> +# corrupt the page in expected ways.  We then verify that heapcheck_relation
> +# reports the corruption, and that it runs without crashing.  Note that the
> +# backend cannot simply be started to run queries against the corrupt table, as
> +# the backend will crash, at least for some of the corruption types we
> +# generate.
> +#
> +# Autovacuum potentially touching the table in the background makes the exact
> +# behavior of this test harder to reason about.  We turn it off to keep things
> +# simpler.  We use a "belt and suspenders" approach, turning it off for the
> +# system generally in postgresql.conf, and turning it off specifically for the
> +# test table.
> +#
> +# This test depends on the table being written to the heap file exactly as we
> +# expect it to be, so we take care to arrange the columns of the table, and
> +# insert rows of the table, that give predictable sizes and locations within
> +# the table page.

I have a hard time believing this is going to be really
reliable. E.g. the alignment requirements will vary between platforms,
leading to different layouts. In particular, MAXALIGN differs between
platforms.

Also, it's supported to compile postgres with a different pagesize.


Greetings,

Andres Freund



Re: new heapcheck contrib module

From
Robert Haas
Date:
[ retrying from the email address I intended to use ]

On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).

I think it would be *really* good to avoid ShareUpdateExclusiveLock
here. Running with only AccessShareLock would be a big advantage. I
agree that any use of CLogTruncationLock should not be "random", but I
don't see why the same method we use to make txid_status() safe to
expose to SQL shouldn't also be used here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On 2020-04-20 15:59:49 -0400, Robert Haas wrote:
> On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:
> > I don't think random interspersed uses of CLogTruncationLock are a good
> > idea. If you move to only checking visibility after tuple fits into
> > [relfrozenxid, nextXid), then you don't need to take any locks here, as
> > long as a lock against vacuum is taken (which I think this should do
> > anyway).
> 
> I think it would be *really* good to avoid ShareUpdateExclusiveLock
> here. Running with only AccessShareLock would be a big advantage. I
> agree that any use of CLogTruncationLock should not be "random", but I
> don't see why the same method we use to make txid_status() safe to
> expose to SQL shouldn't also be used here.

A few billion CLogTruncationLock acquisitions in short order will likely
have at least as big an impact as ShareUpdateExclusiveLock held for the
duration of the check. That's not really a relevant concern or
txid_status().  Per-tuple lock acquisitions aren't great.

I think it might be doable to not need either. E.g. we could set the
checking backend's xmin to relfrozenxid, and set somethign like
PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
while not blocking vacuum.

The similar concern for ReadNewTransactionId() can probably more easily
be addressed, by only calling ReadNewTransactionId() when encountering
an xid that's newer than the last value read.


I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
of it) while checking anyway. Reading the full relation can take quite a
while, and we shouldn't prevent hot pruning while doing so.


There's some things we'd need to figure out to be able to use
PROC_IN_VACUUM, as that's really only safe in some
circumstances. Possibly it'd be easiest to address that if we'd make the
check a procedure...

Greetings,

Andres Freund



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Apr 20, 2020 at 12:42 PM Andres Freund <andres@anarazel.de> wrote:
> This is something we really really really need. I'm very excited to see
> progress!

+1

My experience with amcheck was that the requirement that we document
and verify pretty much every invariant (the details of which differ
slightly based on the B-Tree version in use) has had intangible
benefits. It helped me come up with a simpler, better design in the
first place. Also, many of the benchmarks that I perform get to be a
stress-test of the feature itself. It saves quite a lot of testing
work in the long run.

> I wonder if a mode where heapcheck optionally would only checks
> non-frozen (perhaps also non-all-visible) regions of a table would be a
> good idea? Would make it a lot more viable to run this regularly on
> bigger databases. Even if there's a window to not check some data
> (because it's frozen before the next heapcheck run).

That's a great idea. It could also make it practical to use the
rootdescend verification option to verify indexes selectively -- if
you don't have too many blocks to check on average, the overhead is
tolerable. This is the kind of thing that naturally belongs in the
higher level interface that I sketched already.

> We also had a *lot* of bugs that we'd have found a lot earlier, possibly
> even during development, if we had a way to easily perform these checks.

I can think of a case where it was quite unclear what the invariants
for the heap even were, at least temporarily. And this was in the
context of fixing a bug that was really quite nasty. Formally defining
the invariants in one place, and taking a position on exactly what
correct looks like seems like a very valuable exercise. Even without
the tool catching a single bug.

> I have a hard time believing this is going to be really
> reliable. E.g. the alignment requirements will vary between platforms,
> leading to different layouts. In particular, MAXALIGN differs between
> platforms.

Over on another thread, I suggested that Mark might want to have a
corruption test framework that exposes some of the bufpage.c routines.
The idea is that you can destructively manipulate a page using the
logical page interface. Something that works one level below the
access method, but one level above the raw page image. It probably
wouldn't test everything that Mark wants to test, but it would test
some things in a way that seems maintainable to me.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:
> A few billion CLogTruncationLock acquisitions in short order will likely
> have at least as big an impact as ShareUpdateExclusiveLock held for the
> duration of the check. That's not really a relevant concern or
> txid_status().  Per-tuple lock acquisitions aren't great.

Yeah, that's true. Doing it for every tuple is going to be too much, I
think. I was hoping we could avoid that.

> I think it might be doable to not need either. E.g. we could set the
> checking backend's xmin to relfrozenxid, and set somethign like
> PROC_IN_VACUUM. That should, I think, prevent clog from being truncated
> in a problematic way (clog truncations look at PROC_IN_VACUUM backends),
> while not blocking vacuum.

Hmm, OK, I don't know if that would be OK or not.

> The similar concern for ReadNewTransactionId() can probably more easily
> be addressed, by only calling ReadNewTransactionId() when encountering
> an xid that's newer than the last value read.

Yeah, if we can cache some things to avoid repetitive calls, that would be good.

> I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version
> of it) while checking anyway. Reading the full relation can take quite a
> while, and we shouldn't prevent hot pruning while doing so.
>
> There's some things we'd need to figure out to be able to use
> PROC_IN_VACUUM, as that's really only safe in some
> circumstances. Possibly it'd be easiest to address that if we'd make the
> check a procedure...

I think we sure want to set things up so that we do this check without
holding a snapshot, if we can. Not sure exactly how to get there.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Apr 20, 2020 at 12:40 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Ok, I'll work in that direction and repost when I have something along those lines.

Great, thanks!

It also occurs to me that the B-Tree checks that amcheck already has
have one remaining blindspot: While the heapallindexed verification
option has the ability to detect the absence of an index tuple that
the dummy CREATE INDEX that we perform under the hood says should be
in the index, it cannot do the opposite: It cannot detect the presence
of a malformed tuple that shouldn't be there at all, unless the index
tuple itself is corrupt. That could miss an inconsistent page image
when a few tuples have been VACUUMed away, but still appear in the
index.

In order to do that, we'd have to have something a bit like the
validate_index() heap scan that CREATE INDEX CONCURRENTLY uses. We'd
have to get a list of heap TIDs that any index tuple might be pointing
to, and then make sure that there were no TIDs in the index that were
not in that list -- tuples that were pointing to nothing in the heap
at all. This could use the index_bulk_delete() interface. This is the
kind of verification option that I might work on for debugging
purposes, but not the kind of thing I could really recommend to
ordinary users outside of exceptional cases. This is the kind of thing
that argues for more or less providing all of the verification
functionality we have through both high level and low level
interfaces. This isn't likely to be all that valuable most of the
time, and users shouldn't have to figure that out for themselves the
hard way. (BTW, I think that this could be implemented in an
index-AM-agnostic way, I think, so perhaps you can consider adding it
too, if you have time.)

One last thing for now: take a look at amcheck's
bt_tuple_present_callback() function. It has comments about HOT chain
corruption that you may find interesting. Note that this check played
a role in the "freeze the dead" corruption bug [1] -- it detected that
our initial fix for that was broken. It seems like it would be a good
idea to go back through the reproducers we've seen for some of the
more memorable corruption bugs, and actually make sure that your tool
detects them where that isn't clear. History doesn't repeat itself,
but it often rhymes.

[1] https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com
-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Apr 20, 2020 at 1:40 PM Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote:
> > A few billion CLogTruncationLock acquisitions in short order will likely
> > have at least as big an impact as ShareUpdateExclusiveLock held for the
> > duration of the check. That's not really a relevant concern or
> > txid_status().  Per-tuple lock acquisitions aren't great.
>
> Yeah, that's true. Doing it for every tuple is going to be too much, I
> think. I was hoping we could avoid that.

What about the visibility map? It would be nice if pg_visibility was
merged into amcheck, since it mostly provides integrity checking for
the visibility map. Maybe we could just merge the functions that
perform verification, and leave other functions (like
pg_truncate_visibility_map()) where they are. We could keep the
current interface for functions like pg_check_visible(), but also
allow the same verification to occur in passing, as part of a higher
level check.

It wouldn't be so bad if pg_visibility was an expert-only tool. But
ISTM that the verification performed by code like
collect_corrupt_items() could easily take place at the same time as
the new checks that Mark proposes. Possibly only some of the time. It
can work in a totally additive way. (Though like Andres I don't really
like the current "helper" functions used to iterate through a heap
relation; they seem like they'd make this harder.)

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-04-20 10:59:28 -0700, Mark Dilger wrote:
>> I have been talking with Robert about table corruption that occurs
>> from time to time. The page checksum feature seems sufficient to
>> detect most random corruption problems, but it can't detect "logical"
>> corruption, where the page is valid but inconsistent with the rest of
>> the database cluster. This can happen due to faulty or ill-conceived
>> backup and restore tools, or bad storage, or user error, or bugs in
>> the server itself. (Also, not everyone enables checksums.)
>
> This is something we really really really need. I'm very excited to see
> progress!

Thanks for the review!

>> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001
>> From: Mark Dilger <mark.dilger@enterprisedb.com>
>> Date: Mon, 20 Apr 2020 08:05:58 -0700
>> Subject: [PATCH v2] Adding heapcheck contrib module.
>>
>> The heapcheck module introduces a new function for checking a heap
>> relation and associated toast relation, if any, for corruption.
>
> Why not add it to amcheck?

That seems to be the general consensus.  The functionality has been moved there, renamed as "verify_heapam", as that
seemsmore in line with the "verify_nbtree" name already present in that module.  The docs have also been moved there,
althoughnot very gracefully.  It seems premature to polish the documentation given that the interface will likely
changeat least one more time, to incorporate more of Peter's suggestions.  There are still design differences between
thetwo implementations that need to be harmonized.  The verify_heapam function returns rows detailing the corruption
found,which is inconsistent with how verify_heapam does things. 

> I wonder if a mode where heapcheck optionally would only checks
> non-frozen (perhaps also non-all-visible) regions of a table would be a
> good idea? Would make it a lot more viable to run this regularly on
> bigger databases. Even if there's a window to not check some data
> (because it's frozen before the next heapcheck run).

Perhaps we should come back to that.  Version 3 of this patch addresses concerns about the v2 patch without adding too
manynew features. 

>> The attached module provides the means to scan a relation and sanity
>> check it. Currently, it checks xmin and xmax values against
>> relfrozenxid and relminmxid, and also validates TOAST pointers. If
>> people like this, it could be expanded to perform additional checks.
>
>
>> The postgres backend already defends against certain forms of
>> corruption, by checking the page header of each page before allowing
>> it into the page cache, and by checking the page checksum, if enabled.
>> Experience shows that broken or ill-conceived backup and restore
>> mechanisms can result in a page, or an entire file, being overwritten
>> with an earlier version of itself, restored from backup.  Pages thus
>> overwritten will appear to have valid page headers and checksums,
>> while potentially containing xmin, xmax, and toast pointers that are
>> invalid.
>
> We also had a *lot* of bugs that we'd have found a lot earlier, possibly
> even during development, if we had a way to easily perform these checks.

I certainly hope this is useful for testing.

>> contrib/heapcheck introduces a function, heapcheck_relation, that
>> takes a regclass argument, scans the given heap relation, and returns
>> rows containing information about corruption found within the table.
>> The main focus of the scan is to find invalid xmin, xmax, and toast
>> pointer values.  It also checks for structural corruption within the
>> page (such as invalid t_hoff values) that could lead to the backend
>> aborting should the function blindly trust the data as it finds it.
>
>
>> +typedef struct CorruptionInfo
>> +{
>> +    BlockNumber blkno;
>> +    OffsetNumber offnum;
>> +    int16        lp_off;
>> +    int16        lp_flags;
>> +    int16        lp_len;
>> +    int32        attnum;
>> +    int32        chunk;
>> +    char       *msg;
>> +}            CorruptionInfo;
>
> Adding a short comment explaining what this is for would be good.

This struct has been removed.

>> +/* Internal implementation */
>> +void        record_corruption(HeapCheckContext * ctx, char *msg);
>> +TupleDesc    heapcheck_relation_tupdesc(void);
>> +
>> +void        beginRelBlockIteration(HeapCheckContext * ctx);
>> +bool        relBlockIteration_next(HeapCheckContext * ctx);
>> +void        endRelBlockIteration(HeapCheckContext * ctx);
>> +
>> +void        beginPageTupleIteration(HeapCheckContext * ctx);
>> +bool        pageTupleIteration_next(HeapCheckContext * ctx);
>> +void        endPageTupleIteration(HeapCheckContext * ctx);
>> +
>> +void        beginTupleAttributeIteration(HeapCheckContext * ctx);
>> +bool        tupleAttributeIteration_next(HeapCheckContext * ctx);
>> +void        endTupleAttributeIteration(HeapCheckContext * ctx);
>> +
>> +void        beginToastTupleIteration(HeapCheckContext * ctx,
>> +                                     struct varatt_external *toast_pointer);
>> +void        endToastTupleIteration(HeapCheckContext * ctx);
>> +bool        toastTupleIteration_next(HeapCheckContext * ctx);
>> +
>> +bool        TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid);
>> +bool        HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx);
>> +void        check_toast_tuple(HeapCheckContext * ctx);
>> +bool        check_tuple_attribute(HeapCheckContext * ctx);
>> +void        check_tuple(HeapCheckContext * ctx);
>> +
>> +List       *check_relation(Oid relid);
>> +void        check_relation_relkind(Relation rel);
>
> Why aren't these static?

They are now, except for the iterator style functions, which are gone.

>> +/*
>> + * record_corruption
>> + *
>> + *   Record a message about corruption, including information
>> + *   about where in the relation the corruption was found.
>> + */
>> +void
>> +record_corruption(HeapCheckContext * ctx, char *msg)
>> +{
>
> Given that you went through the trouble of adding prototypes for all of
> these, I'd start with the most important functions, not the unimportant
> details.

Yeah, good idea.  The most important functions are now at the top.

>> +/*
>> + * Helper function to construct the TupleDesc needed by heapcheck_relation.
>> + */
>> +TupleDesc
>> +heapcheck_relation_tupdesc()
>
> Missing (void) (it's our style, even though you could theoretically not
> have it as long as you have a prototype).

That was unintentional, and is now fixed.

>> +{
>> +    TupleDesc    tupdesc;
>> +    AttrNumber    maxattr = 8;
>
> This 8 is in multiple places, I'd add a define for it.

Done.

>> +    AttrNumber    a = 0;
>> +
>> +    tupdesc = CreateTemplateTupleDesc(maxattr);
>> +    TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0);
>> +    TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0);
>> +    Assert(a == maxattr);
>> +
>> +    return BlessTupleDesc(tupdesc);
>> +}
>
>
>> +/*
>> + * heapcheck_relation
>> + *
>> + *   Scan and report corruption in heap pages or in associated toast relation.
>> + */
>> +Datum
>> +heapcheck_relation(PG_FUNCTION_ARGS)
>> +{
>> +    FuncCallContext *funcctx;
>> +    CheckRelCtx *ctx;
>> +
>> +    if (SRF_IS_FIRSTCALL())
>> +    {
>
> I think it'd be good to have a version that just returned a boolean. For
> one, in many cases that's all we care about when scripting things. But
> also, on a large relation, there could be a lot of errors.

There is now a second parameter to the function, "stop_on_error".  The function performs exactly the same checks, but
returnsafter the first page that contains corruption. 

>> +        Oid            relid = PG_GETARG_OID(0);
>> +        MemoryContext oldcontext;
>> +
>> +        /*
>> +         * Scan the entire relation, building up a list of corruption found in
>> +         * ctx->corruption, for returning later.  The scan must be performed
>> +         * in a memory context that will survive until after all rows are
>> +         * returned.
>> +         */
>> +        funcctx = SRF_FIRSTCALL_INIT();
>> +        oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx);
>> +        funcctx->tuple_desc = heapcheck_relation_tupdesc();
>> +        ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx));
>> +        ctx->corruption = check_relation(relid);
>> +        ctx->idx = 0;            /* start the iterator at the beginning */
>> +        funcctx->user_fctx = (void *) ctx;
>> +        MemoryContextSwitchTo(oldcontext);
>
> Hm. This builds up all the errors in memory. Is that a good idea? I mean
> for a large relation having one returned value for each tuple could be a
> heck of a lot of data.
>
> I think it'd be better to use the spilling SRF protocol here.  It's not
> like you're benefitting from deferring the tuple construction to the
> return currently.

Done.

>> +/*
>> + * beginRelBlockIteration
>> + *
>> + *   For the given heap relation being checked, as recorded in ctx, sets up
>> + *   variables for iterating over the heap's pages.
>> + *
>> + *   The caller should have already opened the heap relation, ctx->rel
>> + */
>> +void
>> +beginRelBlockIteration(HeapCheckContext * ctx)
>> +{
>> +    ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel);
>> +    ctx->blkno = InvalidBlockNumber;
>> +    ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD);
>> +    ctx->buffer = InvalidBuffer;
>> +    ctx->page = NULL;
>> +}
>> +
>> +/*
>> + * endRelBlockIteration
>> + *
>> + *   Releases resources that were reserved by either beginRelBlockIteration or
>> + *   relBlockIteration_next.
>> + */
>> +void
>> +endRelBlockIteration(HeapCheckContext * ctx)
>> +{
>> +    /*
>> +     * Clean up.  If the caller iterated to the end, the final call to
>> +     * relBlockIteration_next will already have released the buffer, but if
>> +     * the caller is bailing out early, we have to release it ourselves.
>> +     */
>> +    if (InvalidBuffer != ctx->buffer)
>> +        UnlockReleaseBuffer(ctx->buffer);
>> +}
>
> These seem mighty granular and generically named to me.

Removed.

>> + * pageTupleIteration_next
>> + *
>> + *   Advances the state tracked in ctx to the next tuple on the page.
>> + *
>> + *   Caller should have already set up the iteration via
>> + *   beginPageTupleIteration, and should stop calling when this function
>> + *   returns false.
>> + */
>> +bool
>> +pageTupleIteration_next(HeapCheckContext * ctx)
>
> I don't think this is a naming scheme we use anywhere in postgres. I
> don't think it's a good idea to add yet more of those.

Removed.

>> +{
>> +    /*
>> +     * Iterate to the next interesting line pointer, if any. Unused, dead and
>> +     * redirect line pointers are of no interest.
>> +     */
>> +    do
>> +    {
>> +        ctx->offnum = OffsetNumberNext(ctx->offnum);
>> +        if (ctx->offnum > ctx->maxoff)
>> +            return false;
>> +        ctx->itemid = PageGetItemId(ctx->page, ctx->offnum);
>> +    } while (!ItemIdIsUsed(ctx->itemid) ||
>> +             ItemIdIsDead(ctx->itemid) ||
>> +             ItemIdIsRedirected(ctx->itemid));
>
> This is an odd loop. Part of the test is in the body, part of in the
> loop header.

Refactored.

>> +/*
>> + * Given a TransactionId, attempt to interpret it as a valid
>> + * FullTransactionId, neither in the future nor overlong in
>> + * the past.  Stores the inferred FullTransactionId in *fxid.
>> + *
>> + * Returns whether the xid is newer than the oldest clog xid.
>> + */
>> +bool
>> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid)
>
> I don't at all like the naming of this function. This isn't a reliable
> check. As before, it obviously also shouldn't be static.

Renamed and refactored.

>> +{
>> +    FullTransactionId fnow;
>> +    uint32        epoch;
>> +
>> +    /* Initialize fxid; we'll overwrite this later if needed */
>> +    *fxid = FullTransactionIdFromEpochAndXid(0, xid);
>
>> +    /* Special xids can quickly be turned into invalid fxids */
>> +    if (!TransactionIdIsValid(xid))
>> +        return false;
>> +    if (!TransactionIdIsNormal(xid))
>> +        return true;
>> +
>> +    /*
>> +     * Charitably infer the full transaction id as being within one epoch ago
>> +     */
>> +    fnow = ReadNextFullTransactionId();
>> +    epoch = EpochFromFullTransactionId(fnow);
>> +    *fxid = FullTransactionIdFromEpochAndXid(epoch, xid);
>
> So now you're overwriting the fxid value from above unconditionally?
>
>
>> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
>> +        *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid);
>
>
> I think it'd be better to do the conversion the following way:
>
>    *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow)
>                                    + (int32) (XidFromFullTransactionId(fnow) - xid));

This has been refactored to the point that these review comments cannot be directly replied to.

>> +    if (!FullTransactionIdPrecedes(*fxid, fnow))
>> +        return false;
>> +    /* The oldestClogXid is protected by CLogTruncationLock */
>> +    Assert(LWLockHeldByMe(CLogTruncationLock));
>> +    if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid))
>> +        return false;
>> +    return true;
>> +}
>
> Why is this testing oldestClogXid instead of oldestXid?

References to clog have been refactored out of this module.

>> +/*
>> + * HeapTupleIsVisible
>> + *
>> + *    Determine whether tuples are visible for heapcheck.  Similar to
>> + *  HeapTupleSatisfiesVacuum, but with critical differences.
>> + *
>> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
>> + *     to a table during a corruption check.
>> + *  2) Gracefully handles xids that are too old by calling
>> + *     TransactionIdStillValid before TransactionLogFetch, thus avoiding
>> + *     a backend abort.
>
> I think it'd be better to protect against this by avoiding checks for
> xids that are older than relfrozenxid. And ones that are newer than
> ReadNextTransactionId().  But all of those cases should be errors
> anyway, so it doesn't seem like that should be handled within the
> visibility routine.

The new implementation caches a range of expected xids.  With the relation locked against concurrent vacuum runs, it
cantrust that the old end of the range won't move during the course of the scan.  The newest end may move, but it only
hasto check for that when it encounters a newer than expected xid, and it updates the cache with the new maximum. 

>
>> + *  3) Only makes a boolean determination of whether heapcheck should
>> + *     see the tuple, rather than doing extra work for vacuum-related
>> + *     categorization.
>> + */
>> +bool
>> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
>> +{
>
>> +    FullTransactionId fxmin,
>> +                fxmax;
>> +    uint16        infomask = tuphdr->t_infomask;
>> +    TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr);
>> +
>> +    if (!HeapTupleHeaderXminCommitted(tuphdr))
>> +    {
>
> Hm. I wonder if it'd be good to crosscheck the xid committed hint bits
> with clog?

This is not done in v3, as it no longer checks clog.

>> +        else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr)))
>> +        {
>> +            LWLockRelease(CLogTruncationLock);
>> +            return false;        /* HEAPTUPLE_DEAD */
>> +        }
>
> Note that this actually can error out, if xmin is a subtransaction xid,
> because pg_subtrans is truncated a lot more aggressively than anything
> else. I think you'd need to filter against subtransactions older than
> RecentXmin before here, and treat that as an error.

Calls to TransactionIdDidCommit are now preceded by checks that the xid argument is not too old.

>> +    if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask))
>> +    {
>> +        if (infomask & HEAP_XMAX_IS_MULTI)
>> +        {
>> +            TransactionId xmax = HeapTupleGetUpdateXid(tuphdr);
>> +
>> +            /* not LOCKED_ONLY, so it has to have an xmax */
>> +            if (!TransactionIdIsValid(xmax))
>> +            {
>> +                record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is "
>> +                                         "neither LOCKED_ONLY nor has a "
>> +                                         "valid xmax"));
>> +                return false;
>> +            }
>
> I think it's bad to have code like this in a routine that's named like a
> generic visibility check routine.

Renamed.

>> +            if (TransactionIdIsInProgress(xmax))
>> +                return false;    /* HEAPTUPLE_DELETE_IN_PROGRESS */
>> +
>> +            LWLockAcquire(CLogTruncationLock, LW_SHARED);
>> +            if (!TransactionIdStillValid(xmax, &fxmax))
>> +            {
>> +                LWLockRelease(CLogTruncationLock);
>> +                record_corruption(ctx, psprintf("tuple xmax = %u (interpreted "
>> +                                                "as " UINT64_FORMAT
>> +                                                ") not or no longer valid",
>> +                                                xmax, fxmax.value));
>> +                return false;
>> +            }
>> +            else if (TransactionIdDidCommit(xmax))
>> +            {
>> +                LWLockRelease(CLogTruncationLock);
>> +                return false;    /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */
>> +            }
>> +            LWLockRelease(CLogTruncationLock);
>> +            /* Ok, the tuple is live */
>
> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).

Done.

>> +/*
>> + * check_tuple
>> + *
>> + *   Checks the current tuple as tracked in ctx for corruption.  Records any
>> + *   corruption found in ctx->corruption.
>> + *
>> + *   The caller should have iterated to a tuple via pageTupleIteration_next.
>> + */
>> +void
>> +check_tuple(HeapCheckContext * ctx)
>> +{
>> +    bool        fatal = false;
>
> Wait, aren't some checks here duplicate with ones in
> HeapTupleIsVisible()?

Yeah, there was some overlap.  That should be better now.

>> +    /* Check relminmxid against mxid, if any */
>> +    if (ctx->infomask & HEAP_XMAX_IS_MULTI &&
>> +        MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid))
>> +    {
>> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
>> +                                        "relminmxid = %u",
>> +                                        ctx->xmax, ctx->relminmxid));
>> +    }
>
> It's pretty weird that the routines here access xmin/xmax/... via
> HeapCheckContext, but HeapTupleIsVisible() doesn't.

Fair point.  HeapCheckContext no longer has fields for xmin/xmax after the refactoring.

>> +    /* Check xmin against relfrozenxid */
>> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
>> +        TransactionIdIsNormal(ctx->xmin) &&
>> +        TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid))
>> +    {
>> +        record_corruption(ctx, psprintf("tuple xmin = %u precedes relation "
>> +                                        "relfrozenxid = %u",
>> +                                        ctx->xmin, ctx->relfrozenxid));
>> +    }
>> +
>> +    /* Check xmax against relfrozenxid */
>> +    if (TransactionIdIsNormal(ctx->relfrozenxid) &&
>> +        TransactionIdIsNormal(ctx->xmax) &&
>> +        TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid))
>> +    {
>> +        record_corruption(ctx, psprintf("tuple xmax = %u precedes relation "
>> +                                        "relfrozenxid = %u",
>> +                                        ctx->xmax, ctx->relfrozenxid));
>> +    }
>
> these all should be fatal. You definitely cannot just continue
> afterwards given the justification below:

They are now fatal.

>> +    /*
>> +     * Iterate over the attributes looking for broken toast values. This
>> +     * roughly follows the logic of heap_deform_tuple, except that it doesn't
>> +     * bother building up isnull[] and values[] arrays, since nobody wants
>> +     * them, and it unrolls anything that might trip over an Assert when
>> +     * processing corrupt data.
>> +     */
>> +    beginTupleAttributeIteration(ctx);
>> +    while (tupleAttributeIteration_next(ctx) &&
>> +           check_tuple_attribute(ctx))
>> +        ;
>> +    endTupleAttributeIteration(ctx);
>> +}
>
> I really don't find these helpers helpful.

Removed.

>> +/*
>> + * check_relation
>> + *
>> + *   Checks the relation given by relid for corruption, returning a list of all
>> + *   it finds.
>> + *
>> + *   The caller should set up the memory context as desired before calling.
>> + *   The returned list belongs to the caller.
>> + */
>> +List *
>> +check_relation(Oid relid)
>> +{
>> +    HeapCheckContext ctx;
>> +
>> +    memset(&ctx, 0, sizeof(HeapCheckContext));
>> +
>> +    /* Open the relation */
>> +    ctx.relid = relid;
>> +    ctx.corruption = NIL;
>> +    ctx.rel = relation_open(relid, AccessShareLock);
>
> I think you need to protect at least against concurrent schema changes
> given some of your checks. But I think it'd be better to also conflict
> with vacuum here.

The relation is now opened with ShareUpdateExclusiveLock.

>
>> +    check_relation_relkind(ctx.rel);
>
> I think you also need to ensure that the table is actually using heap
> AM, not another tableam. Oh - you're doing that inside the check. But
> that's confusing, because that's not 'relkind'.

It is checking both relkind and relam.  The function has been renamed to reflect that.

>> +    ctx.relDesc = RelationGetDescr(ctx.rel);
>> +    ctx.rel_natts = RelationGetDescr(ctx.rel)->natts;
>> +    ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid;
>> +    ctx.relminmxid = ctx.rel->rd_rel->relminmxid;
>
> three naming schemes in three lines...

Fixed.

>> +    /* check all blocks of the relation */
>> +    beginRelBlockIteration(&ctx);
>> +    while (relBlockIteration_next(&ctx))
>> +    {
>> +        /* Perform tuple checks */
>> +        beginPageTupleIteration(&ctx);
>> +        while (pageTupleIteration_next(&ctx))
>> +            check_tuple(&ctx);
>> +        endPageTupleIteration(&ctx);
>> +    }
>> +    endRelBlockIteration(&ctx);
>
> I again do not find this helper stuff helpful.

Removed.

>> +    /* Close the associated toast table and indexes, if any. */
>> +    if (ctx.has_toastrel)
>> +    {
>> +        toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
>> +                            AccessShareLock);
>> +        table_close(ctx.toastrel, AccessShareLock);
>> +    }
>> +
>> +    /* Close the main relation */
>> +    relation_close(ctx.rel, AccessShareLock);
>
> Why the closing here?

As opposed to where...?  It seems fairly standard to close the relation in the function where it was opened.  Do you
preferthat the relation not be closed?  Or that it be closed but the lock retained? 

>> +# This regression test demonstrates that the heapcheck_relation() function
>> +# supplied with this contrib module correctly identifies specific kinds of
>> +# corruption within pages.  To test this, we need a mechanism to create corrupt
>> +# pages with predictable, repeatable corruption.  The postgres backend cannot be
>> +# expected to help us with this, as its design is not consistent with the goal
>> +# of intentionally corrupting pages.
>> +#
>> +# Instead, we create a table to corrupt, and with careful consideration of how
>> +# postgresql lays out heap pages, we seek to offsets within the page and
>> +# overwrite deliberately chosen bytes with specific values calculated to
>> +# corrupt the page in expected ways.  We then verify that heapcheck_relation
>> +# reports the corruption, and that it runs without crashing.  Note that the
>> +# backend cannot simply be started to run queries against the corrupt table, as
>> +# the backend will crash, at least for some of the corruption types we
>> +# generate.
>> +#
>> +# Autovacuum potentially touching the table in the background makes the exact
>> +# behavior of this test harder to reason about.  We turn it off to keep things
>> +# simpler.  We use a "belt and suspenders" approach, turning it off for the
>> +# system generally in postgresql.conf, and turning it off specifically for the
>> +# test table.
>> +#
>> +# This test depends on the table being written to the heap file exactly as we
>> +# expect it to be, so we take care to arrange the columns of the table, and
>> +# insert rows of the table, that give predictable sizes and locations within
>> +# the table page.
>
> I have a hard time believing this is going to be really
> reliable. E.g. the alignment requirements will vary between platforms,
> leading to different layouts. In particular, MAXALIGN differs between
> platforms.
>
> Also, it's supported to compile postgres with a different pagesize.

It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if the
pagesize is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment
issues,gremlins, or whatever.).  There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules
recentlysubmitted on another thread *could* be used here, but only if those modules are likely to be committed.  This
test*could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples will be
onthe page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of thought
thatregression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this? 

Here is the work thus far:


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:
>> I wonder if a mode where heapcheck optionally would only checks
>> non-frozen (perhaps also non-all-visible) regions of a table would be a
>> good idea?

Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible.

>> Would make it a lot more viable to run this regularly on
>> bigger databases. Even if there's a window to not check some data
>> (because it's frozen before the next heapcheck run).

Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions, a
bgworkerbased mode that periodically checks your database?  The work along those lines is not included in v4, but if it
werepart of v5, would you have specific design preferences? 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions,
abgworker based mode that periodically checks your database?  The work along those lines is not included in v4, but if
itwere part of v5, would you have specific design preferences? 

-1 on that idea from me. That sounds like it's basically building
"cron" into PostgreSQL, but in a way that can only be used by amcheck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if the
pagesize is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment
issues,gremlins, or whatever.). 

Skipping the test if the tuple isn't in the expected location sounds
really bad. That will just lead to the tests passing without actually
doing anything. If the tuple isn't in the expected location, the tests
should fail.

> There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another
thread*could* be used here, but only if those modules are likely to be committed. 

Yeah, I don't know if we want that stuff or not.

> This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples
willbe on the page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of
thoughtthat regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with
this?

How much extra complexity are we talking about? It feels to me like
for a heap page, the only things that are going to affect the position
of the tuples on the page -- supposing we know the tuple size -- are
the page size and, I think, MAXALIGN, and that doesn't sound too bad.
Another possibility is to use pageinspect's heap_page_items() to
determine the position within the page (lp_off), which seems like it
might simplify things considerably. Then, we're entirely relying on
the backend to tell us where the tuples are, and we only need to worry
about the offsets relative to the start of the tuple.

I kind of like that approach, because it doesn't involve having Perl
code that knows how heap pages are laid out; we rely entirely on the C
code for that. I'm not sure if it'd be a problem to have a TAP test
for one contrib module that uses another contrib module, but maybe
there's some way to figure that problem out.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible.

I'm not sure sure, but maybe there should just be one argument with
three possible values, because skip_all_frozen = true and
skip_all_visible = false seems nonsensical. On the other hand, if we
used a text argument with three possible values, I'm not sure what
we'd call the argument or what strings we'd use as the values.

Also, what do people -- either those who have already responded, or
others -- think about the idea of putting a command-line tool around
this? I know that there were some rumblings about this in respect to
pg_verifybackup, but I think a pg_amcheck binary would be
well-received. It could do some interesting things, too. For instance,
it could query pg_class for a list of relations that amcheck would
know how to check, and then issue a separate query for each relation,
which would avoid holding a snapshot or heavyweight locks across the
whole operation. It could do parallelism across relations by opening
multiple connections, or even within a single relation if -- as I
think would be a good idea -- we extended heapcheck to take a range of
block numbers after the style of pg_prewarm.

Apart from allowing for client-driven parallelism, accepting block
number ranges would have the advantage -- IMHO pretty significant --
of making it far easier to use this on a relation where some blocks
are entirely unreadable. You could specify ranges to check out the
remaining blocks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Apr 29, 2020, at 11:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> It's simple enough to extend the tap test a little to check for those things.  In v3, the tap test skips tests if
thepage size is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to
alignmentissues, gremlins, or whatever.). 
>
> Skipping the test if the tuple isn't in the expected location sounds
> really bad. That will just lead to the tests passing without actually
> doing anything. If the tuple isn't in the expected location, the tests
> should fail.
>
>> There are other approaches, though.  The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another
thread*could* be used here, but only if those modules are likely to be committed. 
>
> Yeah, I don't know if we want that stuff or not.
>
>> This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples
willbe on the page, but only if folks don't mind the test having that extra complexity in it.  (There is a school of
thoughtthat regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with
this?
>
> How much extra complexity are we talking about?

The page size is easy to query, and the test already does so, skipping if the answer isn't 8k.  The test could
recalculateoffsets based on the pagesize rather than skipping the test easily enough, but the MAXALIGN stuff is a
littleharder.  I don't know (perhaps someone would share?) how to easily query that from within a perl test.  So the
testcould guess all possible alignments that occur in the real world, read from the page at the offset that alignment
wouldcreate, and check if the expected datum is there.  The test would have to be careful to avoid false positives, by
placingdata before and after the datum being checked with bit patterns that cannot be misinterpreted as a match.  That
levelof complexity seems unappealing, at least to me.  It's not hard to write, but maintaining stuff like that is an
unwelcomeburden. 

> It feels to me like
> for a heap page, the only things that are going to affect the position
> of the tuples on the page -- supposing we know the tuple size -- are
> the page size and, I think, MAXALIGN, and that doesn't sound too bad.
> Another possibility is to use pageinspect's heap_page_items() to
> determine the position within the page (lp_off), which seems like it
> might simplify things considerably. Then, we're entirely relying on
> the backend to tell us where the tuples are, and we only need to worry
> about the offsets relative to the start of the tuple.
>
> I kind of like that approach, because it doesn't involve having Perl
> code that knows how heap pages are laid out; we rely entirely on the C
> code for that. I'm not sure if it'd be a problem to have a TAP test
> for one contrib module that uses another contrib module, but maybe
> there's some way to figure that problem out.

Yeah, I'll give this a try.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:
Here is v5 of the patch.  Major changes in this version include:

1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 

2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.
Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the
pg_amcheckcommand line tool. 

3) The regression test which generates corruption within a table uses the pageinspect module to determine the location
ofeach tuple on disk for corrupting.  This was suggested upthread. 

Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently
crashesthe backend on certain corruptions.  When I corrupt relation files for tables and indexes in the backend and
thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code.  I
thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so.  Here are
someexample problems I'm seeing.  Note the stack trace when calling from the command line tool includes the new
verify_btreeamfunction, but you can get the same crashes using the old interface via psql: 

From psql, first error:

test=# select bt_index_parent_check('corrupted_idx', true, true);
TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
0   postgres                            0x0000000106872977 ExceptionalCondition + 103
1   postgres                            0x00000001063a33e2 _bt_compare + 1090
2   amcheck.so                          0x0000000106d62921 bt_target_page_check + 6033
3   amcheck.so                          0x0000000106d5fd2f bt_index_check_internal + 2847
4   amcheck.so                          0x0000000106d60433 bt_index_parent_check + 67
5   postgres                            0x00000001064d6762 ExecInterpExpr + 1634
6   postgres                            0x000000010650d071 ExecResult + 321
7   postgres                            0x00000001064ddc3d standard_ExecutorRun + 301
8   postgres                            0x00000001066600c5 PortalRunSelect + 389
9   postgres                            0x000000010665fc7f PortalRun + 527
10  postgres                            0x000000010665ed59 exec_simple_query + 1641
11  postgres                            0x000000010665c99d PostgresMain + 3661
12  postgres                            0x00000001065d6a8a BackendRun + 410
13  postgres                            0x00000001065d61c4 ServerLoop + 3044
14  postgres                            0x00000001065d2fe9 PostmasterMain + 3769
15  postgres                            0x000000010652e3b0 help + 0
16  libdyld.dylib                       0x00007fff6725fcc9 start + 1
server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.
The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG:  server process (PID
41309)was terminated by signal 6: Abort trap: 6 



From commandline, second error:

pgtest % pg_amcheck -i test
(relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 3289393 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
tuple xmin = 12593 is in the future
(relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)

<snip>

(relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 306 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmax = 0 precedes relation relminmxid = 1
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
tuple xmin = 305 precedes relation relfrozenxid = 487
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff > lp_len (54 > 34)
(relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
t_hoff not max-aligned (54)
TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
0   postgres                            0x0000000105b22977 ExceptionalCondition + 103
1   postgres                            0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
2   postgres                            0x0000000105634aa1 heapam_index_build_range_scan + 1089
3   amcheck.so                          0x00000001060100f3 bt_index_check_internal + 3811
4   amcheck.so                          0x000000010601057c verify_btreeam + 316
5   postgres                            0x0000000105796266 ExecMakeTableFunctionResult + 422
6   postgres                            0x00000001057a8c35 FunctionNext + 101
7   postgres                            0x00000001057bbf3e ExecNestLoop + 478
8   postgres                            0x000000010578dc3d standard_ExecutorRun + 301
9   postgres                            0x00000001059100c5 PortalRunSelect + 389
10  postgres                            0x000000010590fc7f PortalRun + 527
11  postgres                            0x000000010590ed59 exec_simple_query + 1641
12  postgres                            0x000000010590c99d PostgresMain + 3661
13  postgres                            0x0000000105886a8a BackendRun + 410
14  postgres                            0x00000001058861c4 ServerLoop + 3044
15  postgres                            0x0000000105882fe9 PostmasterMain + 3769
16  postgres                            0x00000001057de3b0 help + 0
17  libdyld.dylib                       0x00007fff6725fcc9 start + 1
pg_amcheck: error: query failed: server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, May 11, 2020 at 10:21 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.
 

Somebody suggested that I make amcheck work in this way during its
initial development. I rejected that idea at the time, though. It
seems hard to make it work because the B-Tree index scan is a logical
order index scan. It's quite possible that a corrupt index will have
circular sibling links, and things like that. Making everything an
error removes that concern. There are clearly some failures that we
could just soldier on from, but the distinction gets rather blurred.

I understand why you want to do it this way. It makes sense that the
heap stuff would report all inconsistencies together, at the end. I
don't think that that's really workable (or even desirable) in the
case of B-Tree indexes, though. When an index is corrupt, the solution
is always to do root cause analysis, to make sure that the issue does
not recur, and then to REINDEX. There isn't really a question about
doing data recovery of the index structure.

Would it be possible to log the first B-Tree inconsistency, and then
move on to the next high-level phase of verification? You don't have
to throw an error, but it seems like a good idea for amcheck to still
give up on further verification of the index.

The assertion failure that you reported happens because of a generic
assertion made from _bt_compare(). It doesn't have anything to do with
amcheck (you'll see the same thing from regular index scans), really.
I think that removing that assertion would be the opposite of
hardening. Even if you removed it, the backend will still crash once
you come up with a slightly more evil index tuple. Maybe *that* could
be mostly avoided with widespread hardening; we could in principle
perform cross-checks of varlena headers against the tuple or page
layout at any point reachable from _bt_compare(). That seems like
something that would have unacceptable overhead, because the cost
would be imposed on everything. And even then you've only ameliorated
the problem.

Code like amcheck's PageGetItemIdCareful() goes further than the
equivalent backend macro (PageGetItemId()) to avoid assertion failures
and crashes with corrupt data. I doubt that it is practical to take it
much further than that, though. It's subject to diminishing returns.
In general, _bt_compare() calls user-defined code that is usually
written in C. This C code could in principle feel entitled to do any
number of scary things when you corrupt the input data. The amcheck
module's dependency on user-defined operator code is totally
unavoidable -- it is the single source of truth for the nbtree checks.

It boils down to this: I think that regression tests that run on the
buildfarm and actually corrupt data are not practical, at least in the
case of the index checks -- though probably in all cases. Look at the
pageinspect "btree.out" test output file -- it's very limited, because
we have to work around a bunch of implementation details. It's no
accident that the bt_page_items() test shows a palindrome value in the
data column (the value is "01 00 00 00 00 00 00 01"). That's an
endianness workaround.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On May 12, 2020, at 5:34 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, May 11, 2020 at 10:21 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. 

Thank you yet again for reviewing.  I really appreciate the feedback!

> Somebody suggested that I make amcheck work in this way during its
> initial development. I rejected that idea at the time, though. It
> seems hard to make it work because the B-Tree index scan is a logical
> order index scan. It's quite possible that a corrupt index will have
> circular sibling links, and things like that. Making everything an
> error removes that concern. There are clearly some failures that we
> could just soldier on from, but the distinction gets rather blurred.

Ok, I take your point that the code cannot soldier on after the first error is returned.  I'll change that for v6 of
thepatch, moving on to the next relation after hitting the first corruption in any particular index.  Do you mind that
Irefactored the code to return the error rather than ereporting?  If it offends your sensibilities, I could rip that
backout, at the expense of having to use try/catch logic in some other places.  I prefer to avoid the try/catch stuff,
butI'm not going to put up a huge fuss. 

> I understand why you want to do it this way. It makes sense that the
> heap stuff would report all inconsistencies together, at the end. I
> don't think that that's really workable (or even desirable) in the
> case of B-Tree indexes, though. When an index is corrupt, the solution
> is always to do root cause analysis, to make sure that the issue does
> not recur, and then to REINDEX. There isn't really a question about
> doing data recovery of the index structure.

Yes, I agree that reindexing is the most sensible remedy.  I certainly have no plans to implement some pg_fsck_index
typetool.  Even for tables, I'm not interested in creating such a tool. I just want a good tool for finding out what
thenature of the corruption is, as that might make it easier to debug what went wrong.  It's not just for debugging
productionsystems, but also for chasing down problems in half-baked code prior to release. 

> Would it be possible to log the first B-Tree inconsistency, and then
> move on to the next high-level phase of verification? You don't have
> to throw an error, but it seems like a good idea for amcheck to still
> give up on further verification of the index.

Ok, good, it sounds like we're converging on the same idea.  I'm happy to do so.

> The assertion failure that you reported happens because of a generic
> assertion made from _bt_compare(). It doesn't have anything to do with
> amcheck (you'll see the same thing from regular index scans), really.

Oh, I know that already.  I could see that easily enough in the backtrace.  But if you look at the way I implemented
verify_heapam,you might notice this: 

/*
 * check_tuphdr_xids
 *
 *  Determine whether tuples are visible for verification.  Similar to
 *  HeapTupleSatisfiesVacuum, but with critical differences.
 *
 *  1) Does not touch hint bits.  It seems imprudent to write hint bits
 *     to a table during a corruption check.
 *  2) Only makes a boolean determination of whether verification should
 *     see the tuple, rather than doing extra work for vacuum-related
 *     categorization.
 *
 *  The caller should already have checked that xmin and xmax are not out of
 *  bounds for the relation.
 */

The point is that when checking the table for corruption I avoid calling anything that might assert (or segfault, or
whatever). I was talking about refactoring the btree checking code to be similarly careful. 

> I think that removing that assertion would be the opposite of
> hardening. Even if you removed it, the backend will still crash once
> you come up with a slightly more evil index tuple. Maybe *that* could
> be mostly avoided with widespread hardening; we could in principle
> perform cross-checks of varlena headers against the tuple or page
> layout at any point reachable from _bt_compare(). That seems like
> something that would have unacceptable overhead, because the cost
> would be imposed on everything. And even then you've only ameliorated
> the problem.

I think we may have different mental models of how this all works in practice.  I am (or was) envisioning that the
backend,during regular table and index scans, cannot afford to check for corruption at all steps along the way, and
thereforedoes not, but that a corruption checking tool has a fundamentally different purpose, and can and should choose
tooperate in a way that won't blow up when checking a corrupt relation.  It's the difference between a car designed to
drivedown the highway at high speed vs. a military vehicle designed to drive over a minefield with a guy on the front
bumperscanning for landmines, the whole while going half a mile an hour. 

I'm starting to infer from your comments that you see the landmine detection vehicle as also driving at high speed,
detectinglandmines on occasion by seeing them first, but frequently by failing to see them and just blowing up. 

> Code like amcheck's PageGetItemIdCareful() goes further than the
> equivalent backend macro (PageGetItemId()) to avoid assertion failures
> and crashes with corrupt data. I doubt that it is practical to take it
> much further than that, though. It's subject to diminishing returns.

Ok.

> In general, _bt_compare() calls user-defined code that is usually
> written in C. This C code could in principle feel entitled to do any
> number of scary things when you corrupt the input data. The amcheck
> module's dependency on user-defined operator code is totally
> unavoidable -- it is the single source of truth for the nbtree checks.

I don't really understand this argument, since users with buggy user defined operators are not the target audience, but
Ialso don't think there is any point in arguing it, since I'm already resolved to take your advice about not hardening
thebtree stuff any further. 

> It boils down to this: I think that regression tests that run on the
> buildfarm and actually corrupt data are not practical, at least in the
> case of the index checks -- though probably in all cases. Look at the
> pageinspect "btree.out" test output file -- it's very limited, because
> we have to work around a bunch of implementation details. It's no
> accident that the bt_page_items() test shows a palindrome value in the
> data column (the value is "01 00 00 00 00 00 00 01"). That's an
> endianness workaround.

One of the delays in submitting the most recent version of the patch is that I was having trouble creating a reliable,
portablebtree corrupting regression test.  Ultimately, I submitted v5 without any btree corrupting regression test, as
itproved pretty difficult to write one good enough for submission, and I had already put a couple more days into
developingv5 than I had intended.  So I can't argue too much with your point here. 

I did however address (some?) issues that you and others mentioned about the table corrupting regression test.  Perhaps
thereare remaining issues that will show up on machines with different endianness than I have thus far tested, but I
don'tsee that they will be insurmountable.  Are you fundamentally opposed to that test framework?  If you're going to
voteagainst committing the patch with that test, I'll back down and just remove it from the patch, but it doesn't seem
likea bad regression test to me. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Tue, May 12, 2020 at 7:07 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Thank you yet again for reviewing.  I really appreciate the feedback!

Happy to help. It's important work.

> Ok, I take your point that the code cannot soldier on after the first error is returned.  I'll change that for v6 of
thepatch, moving on to the next relation after hitting the first corruption in any particular index.  Do you mind that
Irefactored the code to return the error rather than ereporting? 

try/catch seems like the way to do it. Not all amcheck errors come
from amcheck -- some are things that the backend code does, that are
known to appear in amcheck from time to time. I'm thinking in
particular of the
table_index_build_scan()/heapam_index_build_range_scan() errors, as
well as the errors from _bt_checkpage().

> Yes, I agree that reindexing is the most sensible remedy.  I certainly have no plans to implement some pg_fsck_index
typetool.  Even for tables, I'm not interested in creating such a tool. I just want a good tool for finding out what
thenature of the corruption is, as that might make it easier to debug what went wrong.  It's not just for debugging
productionsystems, but also for chasing down problems in half-baked code prior to release. 

All good goals.

>  * check_tuphdr_xids

> The point is that when checking the table for corruption I avoid calling anything that might assert (or segfault, or
whatever).

I don't think that you can expect to avoid assertion failures in
general. I'll stick with your example. You're calling
TransactionIdDidCommit() from check_tuphdr_xids(), which will
interrogate the commit log and pg_subtrans. It's just not under your
control. I'm sure that you could get an assertion failure somewhere in
there, and even if you couldn't that could change at any time.

You've quasi-duplicated some sensitive code to do that much, which
seems excessive. But it's also not enough.

> I'm starting to infer from your comments that you see the landmine detection vehicle as also driving at high speed,
detectinglandmines on occasion by seeing them first, but frequently by failing to see them and just blowing up. 

That's not it. I would certainly prefer if the landmine detector
didn't blow up. Not having that happen is certainly a goal I share --
that's why PageGetItemIdCareful() exists. But not at any cost,
especially not when "blow up" means an assertion failure that users
won't actually see in production. Avoiding assertion failures like the
one you showed is likely to have a high cost (removing defensive
asserts in low level access method code) for a low benefit. Any
attempt to avoid having the checker itself blow up rather than throw
an error message needs to be assessed pragmatically, on a case-by-case
basis.

> One of the delays in submitting the most recent version of the patch is that I was having trouble creating a
reliable,portable btree corrupting regression test. 

To be clear, I think that corrupting data is very helpful with ad-hoc
testing during development.

> I did however address (some?) issues that you and others mentioned about the table corrupting regression test.
Perhapsthere are remaining issues that will show up on machines with different endianness than I have thus far tested,
butI don't see that they will be insurmountable.  Are you fundamentally opposed to that test framework? 

I haven't thought about it enough just yet, but I am certainly suspicious of it.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, May 12, 2020 at 11:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> try/catch seems like the way to do it. Not all amcheck errors come
> from amcheck -- some are things that the backend code does, that are
> known to appear in amcheck from time to time. I'm thinking in
> particular of the
> table_index_build_scan()/heapam_index_build_range_scan() errors, as
> well as the errors from _bt_checkpage().

That would require the use of a subtransaction.

> You've quasi-duplicated some sensitive code to do that much, which
> seems excessive. But it's also not enough.

I think this is a good summary of the problems in this area. On the
one hand, I think it's hideous that we sanity check user input to
death, but blindly trust the bytes on disk to the point of seg
faulting if they're wrong. The idea that int4 + int4 has to have
overflow checking because otherwise a user might be sad when they get
a negative result from adding two negative numbers, while at the same
time supposing that the same user will be unwilling to accept the
performance hit to avoid crashing if they have a bad tuple, is quite
suspect in my mind. The overflow checking is also expensive, but we do
it because it's the right thing to do, and then we try to minimize the
overhead. It is unclear to me why we shouldn't also take that approach
with bytes that come from disk. In particular, using Assert() checks
for such things instead of elog() is basically Assert(there is no such
thing as a corrupted database).

On the other hand, that problem is clearly way above this patch's pay
grade. There's a lot of stuff all over the code base that would have
to be changed to fix it. It can't be done as an incidental thing as
part of this patch or any other. It's a massive effort unto itself. We
need to somehow draw a clean line between what this patch does and
what it does not do, such that the scope of this patch remains
something achievable. Otherwise, we'll end up with nothing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Wed, May 13, 2020 at 12:22 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I think this is a good summary of the problems in this area. On the
> one hand, I think it's hideous that we sanity check user input to
> death, but blindly trust the bytes on disk to the point of seg
> faulting if they're wrong. The idea that int4 + int4 has to have
> overflow checking because otherwise a user might be sad when they get
> a negative result from adding two negative numbers, while at the same
> time supposing that the same user will be unwilling to accept the
> performance hit to avoid crashing if they have a bad tuple, is quite
> suspect in my mind. The overflow checking is also expensive, but we do
> it because it's the right thing to do, and then we try to minimize the
> overhead. It is unclear to me why we shouldn't also take that approach
> with bytes that come from disk. In particular, using Assert() checks
> for such things instead of elog() is basically Assert(there is no such
> thing as a corrupted database).

I think that it depends. It's nice to be able to add an Assert()
without really having to worry about the overhead at all. I sometimes
call relatively expensive functions in assertions. For example, there
is an assert that calls _bt_compare() within _bt_check_unique() that I
added at one point -- it caught a real bug a few weeks later. You
could always be doing more.

In general we don't exactly trust the bytes blindly. I've found that
corrupting tuples in a creative way with pg_hexedit doesn't usually
result in a segfault. Sometimes we'll do things like display NULL
values when heap line pointers are corrupt, which isn't as good as an
error but is still okay. We ought to protect against Murphy, not
Machiavelli. ISTM that access method code naturally evolves towards
avoiding the most disruptive errors in the event of real world
corruption, in particular avoiding segfaulting. It's very hard to
prove that, though.

Do you recall seeing corruption resulting in segfaults in production?
I personally don't recall seeing that. If it happened, the segfaults
themselves probably wouldn't be the main concern.

> On the other hand, that problem is clearly way above this patch's pay
> grade. There's a lot of stuff all over the code base that would have
> to be changed to fix it. It can't be done as an incidental thing as
> part of this patch or any other. It's a massive effort unto itself. We
> need to somehow draw a clean line between what this patch does and
> what it does not do, such that the scope of this patch remains
> something achievable. Otherwise, we'll end up with nothing.

I can easily come up with an adversarial input that will segfault a
backend, even amcheck, but it'll be somewhat contrived. It's hard to
fool amcheck currently because it doesn't exactly trust line pointers.
But I'm sure I could get the backend to segfault amcheck if I tried.
I'd probably try to play around with varlena headers. It would require
a certain amount of craftiness.

It's not exactly clear where you draw the line here. And I don't think
that the line will be very clearly defined, in the end. It'll be
something that is subject to change over time, as new information
comes to light. I think that it's necessary to accept a certain amount
of ambiguity here.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
On 2020-May-12, Peter Geoghegan wrote:

> > The point is that when checking the table for corruption I avoid
> > calling anything that might assert (or segfault, or whatever).
> 
> I don't think that you can expect to avoid assertion failures in
> general.

Hmm.  I think we should (try to?) write code that avoids all crashes
with production builds, but not extend that to assertion failures.
Sticking again with the provided example,

> I'll stick with your example. You're calling
> TransactionIdDidCommit() from check_tuphdr_xids(), which will
> interrogate the commit log and pg_subtrans. It's just not under your
> control.

in a production build this would just fail with an error that the
pg_xact file cannot be found, which is fine -- if this happens in a
production system, you're not disturbing any other sessions.  Or maybe
the file is there and the byte can be read, in which case you would get
the correct response; but that's fine too.

I don't know to what extent this is possible.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> Hmm.  I think we should (try to?) write code that avoids all crashes
> with production builds, but not extend that to assertion failures.

Assertions are only a problem at all because Mark would like to write
tests that involve a selection of truly corrupt data. That's a new
requirement, and one that I have my doubts about.

> > I'll stick with your example. You're calling
> > TransactionIdDidCommit() from check_tuphdr_xids(), which will
> > interrogate the commit log and pg_subtrans. It's just not under your
> > control.
>
> in a production build this would just fail with an error that the
> pg_xact file cannot be found, which is fine -- if this happens in a
> production system, you're not disturbing any other sessions.  Or maybe
> the file is there and the byte can be read, in which case you would get
> the correct response; but that's fine too.

I think that this is fine, too, since I don't consider assertion
failures with corrupt data all that important. I'd make some effort to
avoid it, but not too much, and not at the expense of a useful general
purpose assertion that could catch bugs in many different contexts.

I would be willing to make a larger effort to avoid crashing a
backend, since that affects production. I might go to some effort to
not crash with downright adversarial inputs, for example. But it seems
inappropriate to take extreme measures just to avoid a crash with
extremely contrived inputs that will probably never occur. My sense is
that this is subject to sharply diminishing returns. Completely
nailing down hard crashes from corrupt data seems like the wrong
priority, at the very least. Pursuing that objective over other
objectives sounds like zero-risk bias.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
On 2020-May-13, Peter Geoghegan wrote:

> On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> > Hmm.  I think we should (try to?) write code that avoids all crashes
> > with production builds, but not extend that to assertion failures.
> 
> Assertions are only a problem at all because Mark would like to write
> tests that involve a selection of truly corrupt data. That's a new
> requirement, and one that I have my doubts about.

I agree that this (a test tool that exercises our code against
arbitrarily corrupted data pages) is not going to work as a test that
all buildfarm members run -- it seems something for specialized
buildfarm members to run, or even something that's run outside of the
buildfarm, like sqlsmith.  Obviously such a tool would not be able to
run against an assertion-enabled build, and we shouldn't even try.

> I would be willing to make a larger effort to avoid crashing a
> backend, since that affects production. I might go to some effort to
> not crash with downright adversarial inputs, for example. But it seems
> inappropriate to take extreme measures just to avoid a crash with
> extremely contrived inputs that will probably never occur. My sense is
> that this is subject to sharply diminishing returns. Completely
> nailing down hard crashes from corrupt data seems like the wrong
> priority, at the very least. Pursuing that objective over other
> objectives sounds like zero-risk bias.

I think my initial approach for this would be to use a fuzzing tool that
generates data blocks semi-randomly, then uses them as Postgres data
pages somehow, and see what happens -- examine any resulting crashes and
make individual judgement calls about the fix(es) necessary to prevent
each of them.  I expect that many such pages would be rejected as
corrupt by page header checks.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Wed, May 13, 2020 at 4:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I think my initial approach for this would be to use a fuzzing tool that
> generates data blocks semi-randomly, then uses them as Postgres data
> pages somehow, and see what happens -- examine any resulting crashes and
> make individual judgement calls about the fix(es) necessary to prevent
> each of them.  I expect that many such pages would be rejected as
> corrupt by page header checks.

As I mentioned in my response to Robert earlier, that's more or less
been my experience with adversarial corruption generated using
pg_hexedit. Within nbtree, as well as heapam. I put a lot of work into
that tool, and have used it to simulate all kinds of weird scenarios.
I've done things like corrupt individual tuple header fields, swap
line pointers, create circular sibling links in indexes, corrupt
varlena headers, and corrupt line pointer flags/status bits. Postgres
itself rarely segfaults, and amcheck will only segfault with a truly
contrived input.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On May 13, 2020, at 3:29 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>> Hmm.  I think we should (try to?) write code that avoids all crashes
>> with production builds, but not extend that to assertion failures.
>
> Assertions are only a problem at all because Mark would like to write
> tests that involve a selection of truly corrupt data. That's a new
> requirement, and one that I have my doubts about.
>
>>> I'll stick with your example. You're calling
>>> TransactionIdDidCommit() from check_tuphdr_xids(), which will
>>> interrogate the commit log and pg_subtrans. It's just not under your
>>> control.
>>
>> in a production build this would just fail with an error that the
>> pg_xact file cannot be found, which is fine -- if this happens in a
>> production system, you're not disturbing any other sessions.  Or maybe
>> the file is there and the byte can be read, in which case you would get
>> the correct response; but that's fine too.
>
> I think that this is fine, too, since I don't consider assertion
> failures with corrupt data all that important. I'd make some effort to
> avoid it, but not too much, and not at the expense of a useful general
> purpose assertion that could catch bugs in many different contexts.

I am not removing any assertions.  I do not propose to remove any assertions. When I talk about "hardening against
assertions",that is not in any way a proposal to remove assertions from the code.  What I'm talking about is writing
theamcheck contrib module code in such a way that it only calls a function that could assert on bad data after checking
thatthe data is not bad. 

I don't know that hardening against assertions in this manner is worth doing, but this is none the less what I'm
talkingabout.  You have made decent arguments that it probably isn't worth doing for the btree checking code.  And in
anyevent, it is probably something that could be addressed in a future patch after getting this patch committed. 

There is a separate but related question in the offing about whether the backend code, independently of any amcheck
contribstuff, should be more paranoid in how it processes tuples to check for corruption.  The heap deform tuple code
inquestion is on a pretty hot code path, and I don't know that folks would accept the performance hit of more checks
beingdone in that part of the system, but that's pretty far from relevant to this patch.  That should be hashed out, or
not,at some other time on some other thread. 

> I would be willing to make a larger effort to avoid crashing a
> backend, since that affects production. I might go to some effort to
> not crash with downright adversarial inputs, for example. But it seems
> inappropriate to take extreme measures just to avoid a crash with
> extremely contrived inputs that will probably never occur.

I think this is a misrepresentation of the tests that I've been running.  There are two kinds of tests that I have
done:

First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived.  That was included in the
regressiontest suite because it needed to be something other developers could read, verify, "yeah, I can see why that
wouldbe corruption, and would give an error message of the sort the test expects", and then could be run to verify that
indeedthat expected error message was generated. 

The second kind of corruption test I have been running is nothing more than writing random nonsense into randomly
chosenlocations within heap files and then running verify_heapam against those heap relations.  It's much more Murphy
thanMachiavelli when it's just generated by calling random().  When I initially did this kind of testing, the heapam
checkingcode had lots of problems.  Now it doesn't.  There's very little contrived about that which I can see. It's the
kindof corruption you'd expect from any number of faulty storage systems.  The one "contrived" aspect of my testing in
thisregard is that the script I use to write random nonsense to random locations in heap files is smart enough not to
writerandom junk to the page headers.  This is because if I corrupt the page headers, the backend never even gets as
faras running the verify_heapam functions, as the page cache rejects loading the page. 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Wed, May 13, 2020 at 5:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I am not removing any assertions.  I do not propose to remove any assertions. When I talk about "hardening against
assertions",that is not in any way a proposal to remove assertions from the code. 

I'm sorry if I seemed to suggest that you wanted to remove assertions,
rather than test more things earlier. I recognize that that could be a
useful thing to do, both in general, and maybe even in the specific
example you gave -- on general robustness grounds. At the same time,
it's something that can only be taken so far. It's probably not going
to make it practical to corrupt data in a regression test or tap test.

> There is a separate but related question in the offing about whether the backend code, independently of any amcheck
contribstuff, should be more paranoid in how it processes tuples to check for corruption. 

I bet that there is something that we could do to be a bit more
defensive. Of course, we do a certain amount of that on general
robustness grounds already. A systematic review of that could be quite
useful. But as you point out, it's not really in scope here.

> > I would be willing to make a larger effort to avoid crashing a
> > backend, since that affects production. I might go to some effort to
> > not crash with downright adversarial inputs, for example. But it seems
> > inappropriate to take extreme measures just to avoid a crash with
> > extremely contrived inputs that will probably never occur.
>
> I think this is a misrepresentation of the tests that I've been running.

I didn't actually mean it that way, but I can see how my words could
reasonably be interpreted that way. I apologize.

> There are two kinds of tests that I have done:
>
> First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived.  That was included in the
regressiontest suite because it needed to be something other developers could read, verify, "yeah, I can see why that
wouldbe corruption, and would give an error message of the sort the test expects", and then could be run to verify that
indeedthat expected error message was generated. 

I still don't think that this is necessary. It could work for one type
of corruption, that happens to not have any of the problems, but just
testing that one type of corruption seems rather arbitrary to me.

> The second kind of corruption test I have been running is nothing more than writing random nonsense into randomly
chosenlocations within heap files and then running verify_heapam against those heap relations.  It's much more Murphy
thanMachiavelli when it's just generated by calling random(). 

That sounds like a good initial test case, to guide your intuitions
about how to make the feature robust.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On May 13, 2020, at 5:36 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, May 13, 2020 at 5:18 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I am not removing any assertions.  I do not propose to remove any assertions. When I talk about "hardening against
assertions",that is not in any way a proposal to remove assertions from the code. 
>
> I'm sorry if I seemed to suggest that you wanted to remove assertions

Not a problem at all.  As always, I appreciate your involvement in this code and design review.


>> I think this is a misrepresentation of the tests that I've been running.
>
> I didn't actually mean it that way, but I can see how my words could
> reasonably be interpreted that way. I apologize.

Again, no worries.

>> There are two kinds of tests that I have done:
>>
>> First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived.  That was included in
theregression test suite because it needed to be something other developers could read, verify, "yeah, I can see why
thatwould be corruption, and would give an error message of the sort the test expects", and then could be run to verify
thatindeed that expected error message was generated. 
>
> I still don't think that this is necessary. It could work for one type
> of corruption, that happens to not have any of the problems, but just
> testing that one type of corruption seems rather arbitrary to me.

As discussed with Robert off list, this probably doesn't matter.  The patch can be committed with or without this
particularTAP test. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, May 13, 2020 at 5:33 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Do you recall seeing corruption resulting in segfaults in production?

I have seen that, I believe. I think it's more common to fail with
errors about not being able to palloc>1GB, not being able to look up
an xid or mxid, etc. but I am pretty sure I've seen multiple cases
involving seg faults, too. Unfortunately for my credibility, I can't
remember the details right now.

> I personally don't recall seeing that. If it happened, the segfaults
> themselves probably wouldn't be the main concern.

I don't really agree. Hypothetically speaking, suppose you corrupt
your only copy of a critical table in such a way that every time you
select from it, the system seg faults. A user in this situation might
ask questions like:

1. How did my table get corrupted?
2. Why do I only have one copy of it?
3. How do I retrieve the non-corrupted portion of my data from that
table and get back up and running?

In the grand scheme of things, #1 and #2 are the most important
questions, but when something like this actually happens, #3 tends to
be the most urgent question, and it's a lot harder to get the
uncorrupted data out if the system keeps crashing.

Also, a seg fault tends to lead customers to think that the database
has a bug, rather than that the database is corrupted.

Slightly off-topic here, but I think our error reporting in this area
is pretty lame. I've learned over the years that when a customer
reports that they get a complaint about a too-large memory allocation
every time they access a table, they've probably got a corrupted
varlena header. However, that's extremely non-obvious to a typical
user. We should try to report errors indicative of corruption in a way
that gives the user some clue that corruption has happened. Peter made
a stab at improving things there by adding
errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of
users will never see the error code, only the message, and a lot of
corruption produces still produces errors that weren't changed by that
commit.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, May 13, 2020 at 7:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I agree that this (a test tool that exercises our code against
> arbitrarily corrupted data pages) is not going to work as a test that
> all buildfarm members run -- it seems something for specialized
> buildfarm members to run, or even something that's run outside of the
> buildfarm, like sqlsmith.  Obviously such a tool would not be able to
> run against an assertion-enabled build, and we shouldn't even try.

I have a question about what you mean here by "arbitrarily."

If you mean that we shouldn't have the buildfarm run the proposed heap
corruption checker against heap pages full of randomly-generated
garbage, I tend to agree. Such a test wouldn't be very stable and
might fail in lots of low-probability ways that could require
unreasonable effort to find and fix.

If you mean that we shouldn't have the buildfarm run the proposed heap
corruption checker against any corrupted heap pages at all, I tend to
disagree. If we did that, then we'd basically be releasing a heap
corruption checker with very limited test coverage. Like, we shouldn't
only have negative test cases, where the absence of corruption
produces no results. We should also have positive test cases, where
the thing finds some problem...

At least, that's what I think.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, May 14, 2020 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I have seen that, I believe. I think it's more common to fail with
> errors about not being able to palloc>1GB, not being able to look up
> an xid or mxid, etc. but I am pretty sure I've seen multiple cases
> involving seg faults, too. Unfortunately for my credibility, I can't
> remember the details right now.

I believe you, both in general, and also because what you're saying
here is plausible, even if it doesn't fit my own experience.

Corruption is by its very nature exceptional. At least, if that isn't
true then something must be seriously wrong, so the idea that it will
be different in some way each time seems like a good working
assumption. Your exceptional cases are not necessarily the same as
mine, especially where hardware problems are concerned. On the other
hand, it's also possible for corruption that originates from very
different sources to exhibit the same basic inconsistencies and
symptoms.

I've noticed that SLRU corruption is often a leading indicator of
general storage problems. The inconsistencies between certain SLRU
state and the heap happens to be far easier to notice in practice,
particularly when VACUUM runs. But it's not fundamentally different to
inconsistencies from pages within one single main fork of some heap
relation.

> > I personally don't recall seeing that. If it happened, the segfaults
> > themselves probably wouldn't be the main concern.
>
> I don't really agree. Hypothetically speaking, suppose you corrupt
> your only copy of a critical table in such a way that every time you
> select from it, the system seg faults. A user in this situation might
> ask questions like:

I agree that that could be a problem. But that's not what I've seen
happen in production systems myself.

Maybe there is some low hanging fruit here. Perhaps we can make the
real PageGetItemId() a little closer to PageGetItemIdCareful() without
noticeable overhead, as I suggested already. Are there any real
generalizations that we can make about why backends segfault with
corrupt data? Maybe there is. That seems important.

> Slightly off-topic here, but I think our error reporting in this area
> is pretty lame. I've learned over the years that when a customer
> reports that they get a complaint about a too-large memory allocation
> every time they access a table, they've probably got a corrupted
> varlena header.

I certainlt learned the same lesson in the same way.

> However, that's extremely non-obvious to a typical
> user. We should try to report errors indicative of corruption in a way
> that gives the user some clue that corruption has happened. Peter made
> a stab at improving things there by adding
> errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of
> users will never see the error code, only the message, and a lot of
> corruption produces still produces errors that weren't changed by that
> commit.

The theory is that "can't happen" errors having an errcode that should
be considered similar to or equivalent to ERRCODE_DATA_CORRUPTED. I
doubt that it works out that way in practice, though.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
On 2020-May-14, Robert Haas wrote:

> I have a question about what you mean here by "arbitrarily."
> 
> If you mean that we shouldn't have the buildfarm run the proposed heap
> corruption checker against heap pages full of randomly-generated
> garbage, I tend to agree. Such a test wouldn't be very stable and
> might fail in lots of low-probability ways that could require
> unreasonable effort to find and fix.

This is what I meant.  I was thinking of blocks generated randomly.

> If you mean that we shouldn't have the buildfarm run the proposed heap
> corruption checker against any corrupted heap pages at all, I tend to
> disagree.

Yeah, IMV those would not be arbitrarily corrupted -- instead they're
crafted to be corrupted in some specific way.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Tom Lane
Date:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> On 2020-May-14, Robert Haas wrote:
>> If you mean that we shouldn't have the buildfarm run the proposed heap
>> corruption checker against heap pages full of randomly-generated
>> garbage, I tend to agree. Such a test wouldn't be very stable and
>> might fail in lots of low-probability ways that could require
>> unreasonable effort to find and fix.

> This is what I meant.  I was thinking of blocks generated randomly.

Yeah, -1 for using random data --- when it fails, how you gonna
reproduce the problem?

>> If you mean that we shouldn't have the buildfarm run the proposed heap
>> corruption checker against any corrupted heap pages at all, I tend to
>> disagree.

> Yeah, IMV those would not be arbitrarily corrupted -- instead they're
> crafted to be corrupted in some specific way.

I think there's definitely value in corrupting data in some predictable
(reproducible) way and verifying that the check code catches it and
responds as expected.  Sure, this will not be 100% coverage, but it'll be
a lot better than 0% coverage.

            regards, tom lane



Re: new heapcheck contrib module

From
Peter Eisentraut
Date:
On 2020-05-11 19:21, Mark Dilger wrote:
> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users.
 

Why is this useful over just using the extension's functions via psql?

I suppose you could make an argument for a command-line wrapper around 
almost every admin-focused contrib module (pageinspect, pg_prewarm, 
pgstattuple, ...), but that doesn't seem very sensible.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On May 14, 2020, at 1:02 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
>
> On 2020-05-11 19:21, Mark Dilger wrote:
>> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 
>
> Why is this useful over just using the extension's functions via psql?

The tool doesn't hold a single snapshot or transaction for the lifetime of checking the entire database.  A future
improvementto the tool might add parallelism.  Users could do all of this in scripts, but having a single tool with the
mostcommonly useful options avoids duplication of effort. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Mon, May 11, 2020 at 10:51 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
> Here is v5 of the patch.  Major changes in this version include:
>
> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database.
Internallyit functions by querying the database for a list of tables which are appropriate given the command line
switches,and then calls amcheck's functions to validate each table and/or index.  The options for selecting/excluding
tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 
>
> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in
whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function.
Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the
pg_amcheckcommand line tool. 
>
> 3) The regression test which generates corruption within a table uses the pageinspect module to determine the
locationof each tuple on disk for corrupting.  This was suggested upthread. 
>
> Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently
crashesthe backend on certain corruptions.  When I corrupt relation files for tables and indexes in the backend and
thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code.  I
thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so.  Here are
someexample problems I'm seeing.  Note the stack trace when calling from the command line tool includes the new
verify_btreeamfunction, but you can get the same crashes using the old interface via psql: 
>
> From psql, first error:
>
> test=# select bt_index_parent_check('corrupted_idx', true, true);
> TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663)
> 0   postgres                            0x0000000106872977 ExceptionalCondition + 103
> 1   postgres                            0x00000001063a33e2 _bt_compare + 1090
> 2   amcheck.so                          0x0000000106d62921 bt_target_page_check + 6033
> 3   amcheck.so                          0x0000000106d5fd2f bt_index_check_internal + 2847
> 4   amcheck.so                          0x0000000106d60433 bt_index_parent_check + 67
> 5   postgres                            0x00000001064d6762 ExecInterpExpr + 1634
> 6   postgres                            0x000000010650d071 ExecResult + 321
> 7   postgres                            0x00000001064ddc3d standard_ExecutorRun + 301
> 8   postgres                            0x00000001066600c5 PortalRunSelect + 389
> 9   postgres                            0x000000010665fc7f PortalRun + 527
> 10  postgres                            0x000000010665ed59 exec_simple_query + 1641
> 11  postgres                            0x000000010665c99d PostgresMain + 3661
> 12  postgres                            0x00000001065d6a8a BackendRun + 410
> 13  postgres                            0x00000001065d61c4 ServerLoop + 3044
> 14  postgres                            0x00000001065d2fe9 PostmasterMain + 3769
> 15  postgres                            0x000000010652e3b0 help + 0
> 16  libdyld.dylib                       0x00007fff6725fcc9 start + 1
> server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.
> The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG:  server process
(PID41309) was terminated by signal 6: Abort trap: 6 
>
>
>
> From commandline, second error:
>
> pgtest % pg_amcheck -i test
> (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 3289393 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
> tuple xmin = 12593 is in the future
> (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=)
>
> <snip>
>
> (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 306 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmax = 0 precedes relation relminmxid = 1
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> tuple xmin = 305 precedes relation relfrozenxid = 487
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff > lp_len (54 > 34)
> (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=)
> t_hoff not max-aligned (54)
> TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319)
> 0   postgres                            0x0000000105b22977 ExceptionalCondition + 103
> 1   postgres                            0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158
> 2   postgres                            0x0000000105634aa1 heapam_index_build_range_scan + 1089
> 3   amcheck.so                          0x00000001060100f3 bt_index_check_internal + 3811
> 4   amcheck.so                          0x000000010601057c verify_btreeam + 316
> 5   postgres                            0x0000000105796266 ExecMakeTableFunctionResult + 422
> 6   postgres                            0x00000001057a8c35 FunctionNext + 101
> 7   postgres                            0x00000001057bbf3e ExecNestLoop + 478
> 8   postgres                            0x000000010578dc3d standard_ExecutorRun + 301
> 9   postgres                            0x00000001059100c5 PortalRunSelect + 389
> 10  postgres                            0x000000010590fc7f PortalRun + 527
> 11  postgres                            0x000000010590ed59 exec_simple_query + 1641
> 12  postgres                            0x000000010590c99d PostgresMain + 3661
> 13  postgres                            0x0000000105886a8a BackendRun + 410
> 14  postgres                            0x00000001058861c4 ServerLoop + 3044
> 15  postgres                            0x0000000105882fe9 PostmasterMain + 3769
> 16  postgres                            0x00000001057de3b0 help + 0
> 17  libdyld.dylib                       0x00007fff6725fcc9 start + 1
> pg_amcheck: error: query failed: server closed the connection unexpectedly
>         This probably means the server terminated abnormally
>         before or while processing the request.

I have just browsed through the patch and the idea is quite
interesting.  I think we can expand it to check that whether the flags
set in the infomask are sane or not w.r.t other flags and xid status.
Some examples are

- If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
should not be set in new_infomask2.
- If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
actually cross verify the transaction status from the CLOG and check
whether is matching the hint bit or not.

While browsing through the code I could not find that we are doing
this kind of check,  ignore if we are already checking this.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On May 11, 2020, at 10:21 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> <v5-0001-Adding-verify_heapam-and-pg_amcheck.patch>

Rebased with some whitespace fixes, but otherwise unmodified from v5.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have just browsed through the patch and the idea is quite
> interesting.  I think we can expand it to check that whether the flags
> set in the infomask are sane or not w.r.t other flags and xid status.
> Some examples are
>
> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> should not be set in new_infomask2.
> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> actually cross verify the transaction status from the CLOG and check
> whether is matching the hint bit or not.
>
> While browsing through the code I could not find that we are doing
> this kind of check,  ignore if we are already checking this.

Thanks for taking a look!

Having both of those bits set simultaneously appears to fall into a different category than what I wrote
verify_heapam.cto detect.  It doesn't violate any assertion in the backend, nor does it cause the code to crash.  (At
least,I don't immediately see how it does either of those things.)  At first glance it appears invalid to have those
bitsboth set simultaneously, but I'm hesitant to enforce that without good reason.  If it is a good thing to enforce,
shouldwe also change the backend code to Assert? 

I integrated your idea into one of the regression tests.  It now sets these two bits in the header of one of the rows
ina table.  The verify_heapam check output (which includes all detected corruptions) does not change, which verifies
yourobservation that verify_heapam is not checking for this.  I've attached that as a patch to this email.  Note that
thispatch should be applied atop the v6 patch recently posted in another email. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have just browsed through the patch and the idea is quite
> > interesting.  I think we can expand it to check that whether the flags
> > set in the infomask are sane or not w.r.t other flags and xid status.
> > Some examples are
> >
> > - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> > should not be set in new_infomask2.
> > - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> > actually cross verify the transaction status from the CLOG and check
> > whether is matching the hint bit or not.
> >
> > While browsing through the code I could not find that we are doing
> > this kind of check,  ignore if we are already checking this.
>
> Thanks for taking a look!
>
> Having both of those bits set simultaneously appears to fall into a different category than what I wrote
verify_heapam.cto detect. 

Ok

  It doesn't violate any assertion in the backend, nor does it cause
the code to crash.  (At least, I don't immediately see how it does
either of those things.)  At first glance it appears invalid to have
those bits both set simultaneously, but I'm hesitant to enforce that
without good reason.  If it is a good thing to enforce, should we also
change the backend code to Assert?

Yeah, it may not hit assert or crash but it could lead to a wrong
result.  But I agree that it could be an assertion in the backend
code.  What about the other check, like hint bit is saying the
transaction is committed but actually as per the clog the status is
something else.  I think in general processing it is hard to check
such things in backend no? because if the hint bit is set saying that
the transaction is committed then we will directly check its
visibility with the snapshot.  I think a corruption checker may be a
good tool for catching such anomalies.

> I integrated your idea into one of the regression tests.  It now sets these two bits in the header of one of the rows
ina table.  The verify_heapam check output (which includes all detected corruptions) does not change, which verifies
yourobservation that verifies _heapam is not checking for this.  I've attached that as a patch to this email.  Note
thatthis patch should be applied atop the v6 patch recently posted in another email. 

Ok.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jun 11, 2020, at 11:35 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>
>>
>>
>>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>>
>>> I have just browsed through the patch and the idea is quite
>>> interesting.  I think we can expand it to check that whether the flags
>>> set in the infomask are sane or not w.r.t other flags and xid status.
>>> Some examples are
>>>
>>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
>>> should not be set in new_infomask2.
>>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
>>> actually cross verify the transaction status from the CLOG and check
>>> whether is matching the hint bit or not.
>>>
>>> While browsing through the code I could not find that we are doing
>>> this kind of check,  ignore if we are already checking this.
>>
>> Thanks for taking a look!
>>
>> Having both of those bits set simultaneously appears to fall into a different category than what I wrote
verify_heapam.cto detect. 
>
> Ok
>
>
>>  It doesn't violate any assertion in the backend, nor does it cause
>> the code to crash.  (At least, I don't immediately see how it does
>> either of those things.)  At first glance it appears invalid to have
>> those bits both set simultaneously, but I'm hesitant to enforce that
>> without good reason.  If it is a good thing to enforce, should we also
>> change the backend code to Assert?
>
> Yeah, it may not hit assert or crash but it could lead to a wrong
> result.  But I agree that it could be an assertion in the backend
> code.

For v7, I've added an assertion for this.  Per heap/README.tuplock, "We currently never set the HEAP_XMAX_COMMITTED
whenthe HEAP_XMAX_IS_MULTI bit is set."  I added an assertion for that, too.  Both new assertions are in
RelationPutHeapTuple(). I'm not sure if that is the best place to put the assertion, but I am confident that the
assertionneeds to only check tuples destined for disk, as in memory tuples can and do violate the assertion. 

Also for v7, I've updated contrib/amcheck to report these two conditions as corruption.

> What about the other check, like hint bit is saying the
> transaction is committed but actually as per the clog the status is
> something else.  I think in general processing it is hard to check
> such things in backend no? because if the hint bit is set saying that
> the transaction is committed then we will directly check its
> visibility with the snapshot.  I think a corruption checker may be a
> good tool for catching such anomalies.

I already made some design changes to this patch to avoid taking the CLogTruncationLock too often.  I'm happy to
incorporatethis idea, but perhaps you could provide a design on how to do it without all the extra locking?  If not, I
cantry to get this into v8 as an optional check, so users can turn it on at their discretion.  Having the check enabled
bydefault is probably a non-starter. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module (typos)

From
Erik Rijkers
Date:
On 2020-06-12 23:06, Mark Dilger wrote:

> [v7-0001-Adding-verify_heapam-and-pg_amcheck.patch]
> [v7-0002-Adding-checks-o...ations-of-hint-bit.patch]

I came across these typos in the sgml:

--exclude-scheam   should be
--exclude-schema

<option>table</option>     should be
<option>--table</option>


I found this connection problem (or perhaps it is as designed):

$ env | grep ^PG
PGPORT=6965
PGPASSFILE=/home/aardvark/.pg_aardvark
PGDATABASE=testdb
PGDATA=/home/aardvark/pg_stuff/pg_installations/pgsql.amcheck/data

-- just to show that psql is connecting (via $PGPASSFILE and $PGPORT and 
$PGDATABASE):
-- and showing a  table t  that I made earlier

$ psql
SET
Timing is on.
psql (14devel_amcheck_0612_2f48)
Type "help" for help.

testdb=# \dt+ t
                            List of relations
  Schema | Name | Type  |  Owner   | Persistence |  Size  | Description
--------+------+-------+----------+-------------+--------+-------------
  public | t    | table | aardvark | permanent   | 346 MB |
(1 row)

testdb=# \q

I think this should work:

$ pg_amcheck -i -t t
pg_amcheck: error: no matching tables were found

It seems a bug that I have to add  '-d testdb':

This works OK:
pg_amcheck -i -t t -d testdb

Is that error as expected?


thanks,

Erik Rijkers



Re: new heapcheck contrib module (typos)

From
Mark Dilger
Date:

> On Jun 13, 2020, at 2:13 PM, Erik Rijkers <er@xs4all.nl> wrote:

Thanks for the review!

> On 2020-06-12 23:06, Mark Dilger wrote:
>
>> [v7-0001-Adding-verify_heapam-and-pg_amcheck.patch]
>> [v7-0002-Adding-checks-o...ations-of-hint-bit.patch]
>
> I came across these typos in the sgml:
>
> --exclude-scheam   should be
> --exclude-schema
>
> <option>table</option>     should be
> <option>--table</option>

Yeah, I agree and have made these changes for v8.

> I found this connection problem (or perhaps it is as designed):
>
> $ env | grep ^PG
> PGPORT=6965
> PGPASSFILE=/home/aardvark/.pg_aardvark
> PGDATABASE=testdb
> PGDATA=/home/aardvark/pg_stuff/pg_installations/pgsql.amcheck/data
>
> -- just to show that psql is connecting (via $PGPASSFILE and $PGPORT and $PGDATABASE):
> -- and showing a  table t  that I made earlier
>
> $ psql
> SET
> Timing is on.
> psql (14devel_amcheck_0612_2f48)
> Type "help" for help.
>
> testdb=# \dt+ t
>                           List of relations
> Schema | Name | Type  |  Owner   | Persistence |  Size  | Description
> --------+------+-------+----------+-------------+--------+-------------
> public | t    | table | aardvark | permanent   | 346 MB |
> (1 row)
>
> testdb=# \q
>
> I think this should work:
>
> $ pg_amcheck -i -t t
> pg_amcheck: error: no matching tables were found
>
> It seems a bug that I have to add  '-d testdb':
>
> This works OK:
> pg_amcheck -i -t t -d testdb
>
> Is that error as expected?

It was expected, but looking more broadly at other tools, your expectations seem to be more typical.  I've changed it
inv8.  Thanks again for having a look at this patch! 

Note that I've merge the two patches (v7-0001 and v7-0002) back into a single patch, since the separation introduced in
v7was only for illustration of changes in v7. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Sat, Jun 13, 2020 at 2:36 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 11, 2020, at 11:35 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger
> > <mark.dilger@enterprisedb.com> wrote:
> >>
> >>
> >>
> >>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >>>
> >>> I have just browsed through the patch and the idea is quite
> >>> interesting.  I think we can expand it to check that whether the flags
> >>> set in the infomask are sane or not w.r.t other flags and xid status.
> >>> Some examples are
> >>>
> >>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED
> >>> should not be set in new_infomask2.
> >>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we
> >>> actually cross verify the transaction status from the CLOG and check
> >>> whether is matching the hint bit or not.
> >>>
> >>> While browsing through the code I could not find that we are doing
> >>> this kind of check,  ignore if we are already checking this.
> >>
> >> Thanks for taking a look!
> >>
> >> Having both of those bits set simultaneously appears to fall into a different category than what I wrote
verify_heapam.cto detect. 
> >
> > Ok
> >
> >
> >>  It doesn't violate any assertion in the backend, nor does it cause
> >> the code to crash.  (At least, I don't immediately see how it does
> >> either of those things.)  At first glance it appears invalid to have
> >> those bits both set simultaneously, but I'm hesitant to enforce that
> >> without good reason.  If it is a good thing to enforce, should we also
> >> change the backend code to Assert?
> >
> > Yeah, it may not hit assert or crash but it could lead to a wrong
> > result.  But I agree that it could be an assertion in the backend
> > code.
>
> For v7, I've added an assertion for this.  Per heap/README.tuplock, "We currently never set the HEAP_XMAX_COMMITTED
whenthe HEAP_XMAX_IS_MULTI bit is set."  I added an assertion for that, too.  Both new assertions are in
RelationPutHeapTuple(). I'm not sure if that is the best place to put the assertion, but I am confident that the
assertionneeds to only check tuples destined for disk, as in memory tuples can and do violate the assertion. 
>
> Also for v7, I've updated contrib/amcheck to report these two conditions as corruption.
>
> > What about the other check, like hint bit is saying the
> > transaction is committed but actually as per the clog the status is
> > something else.  I think in general processing it is hard to check
> > such things in backend no? because if the hint bit is set saying that
> > the transaction is committed then we will directly check its
> > visibility with the snapshot.  I think a corruption checker may be a
> > good tool for catching such anomalies.
>
> I already made some design changes to this patch to avoid taking the CLogTruncationLock too often.  I'm happy to
incorporatethis idea, but perhaps you could provide a design on how to do it without all the extra locking?  If not, I
cantry to get this into v8 as an optional check, so users can turn it on at their discretion.  Having the check enabled
bydefault is probably a non-starter. 

Okay,  even I can't think a way to do it without an extra locking.

I have looked into 0001 patch and I have a few comments.

1.
+
+ /* Skip over unused/dead/redirected line pointers */
+ if (!ItemIdIsUsed(ctx.itemid) ||
+ ItemIdIsDead(ctx.itemid) ||
+ ItemIdIsRedirected(ctx.itemid))
+ continue;

Isn't it a good idea to verify the Redirected Itemtid?  Because we
will still access the redirected item id to find the
actual tuple from the index scan.  Maybe not exactly at this level,
but we can verify that the link itemid store in that
is within the itemid range of the page or not.

2.

+ /* Check for tuple header corruption */
+ if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
+ {
+ confess(ctx,
+ psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)",
+ ctx->tuphdr->t_hoff,
+ (unsigned) SizeofHeapTupleHeader));
+ fatal = true;
+ }

I think we can also check that if there is no NULL attributes (if
(!(t_infomask & HEAP_HASNULL)) then
ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader.


3.
+ ctx->offset = 0;
+ for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
+ {
+ if (!check_tuple_attribute(ctx))
+ break;
+ }
+ ctx->offset = -1;
+ ctx->attnum = -1;

So we are first setting ctx->offset to 0, then inside
check_tuple_attribute, we will keep updating the offset as we process
the attributes and after the loop is over we set ctx->offset to -1,  I
did not understand that why we need to reset it to -1, do we ever
check for that.  We don't even initialize the ctx->offset to -1 while
initializing the context for the tuple so I do not understand what is
the meaning of the random value -1.

4.
+ if (!VARATT_IS_EXTENDED(chunk))
+ {
+ chunksize = VARSIZE(chunk) - VARHDRSZ;
+ chunkdata = VARDATA(chunk);
+ }
+ else if (VARATT_IS_SHORT(chunk))
+ {
+ /*
+ * could happen due to heap_form_tuple doing its thing
+ */
+ chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
+ chunkdata = VARDATA_SHORT(chunk);
+ }
+ else
+ {
+ /* should never happen */
+ confess(ctx,
+ pstrdup("toast chunk is neither short nor extended"));
+ return;
+ }

I think the error message "toast chunk is neither short nor extended".
Because ideally, the toast chunk should not be further toasted.
So I think the check is correct, but the error message is not correct.

5.

+ ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
+ check_relation_relkind_and_relam(ctx.rel);
+
+ /*
+ * Open the toast relation, if any, also protected from concurrent
+ * vacuums.
+ */
+ if (ctx.rel->rd_rel->reltoastrelid)
+ {
+ int offset;
+
+ /* Main relation has associated toast relation */
+ ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid,
+   ShareUpdateExclusiveLock);
+ offset = toast_open_indexes(ctx.toastrel,
....
+ if (TransactionIdIsNormal(ctx.relfrozenxid) &&
+ TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid))
+ {
+ confess(&ctx, psprintf("relfrozenxid %u precedes global "
+    "oldest valid xid %u ",
+    ctx.relfrozenxid, ctx.oldestValidXid));
+ PG_RETURN_NULL();
+ }

Don't we need to close the relation/toastrel/toastindexrel in such
return which is without an abort? IIRC, we
will get relcache leak WARNING on commit if we left them open in commit path.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> I have looked into 0001 patch and I have a few comments.
>
> 1.
> +
> + /* Skip over unused/dead/redirected line pointers */
> + if (!ItemIdIsUsed(ctx.itemid) ||
> + ItemIdIsDead(ctx.itemid) ||
> + ItemIdIsRedirected(ctx.itemid))
> + continue;
>
> Isn't it a good idea to verify the Redirected Itemtid?  Because we
> will still access the redirected item id to find the
> actual tuple from the index scan.  Maybe not exactly at this level,
> but we can verify that the link itemid store in that
> is within the itemid range of the page or not.

Good idea.  I've added checks that the redirection is valid, both in terms of being within bounds and in terms of
alignment.

> 2.
>
> + /* Check for tuple header corruption */
> + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
> + {
> + confess(ctx,
> + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)",
> + ctx->tuphdr->t_hoff,
> + (unsigned) SizeofHeapTupleHeader));
> + fatal = true;
> + }
>
> I think we can also check that if there is no NULL attributes (if
> (!(t_infomask & HEAP_HASNULL)) then
> ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader.

You have to take alignment padding into account, but otherwise yes, and I've added a check for that.

> 3.
> + ctx->offset = 0;
> + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
> + {
> + if (!check_tuple_attribute(ctx))
> + break;
> + }
> + ctx->offset = -1;
> + ctx->attnum = -1;
>
> So we are first setting ctx->offset to 0, then inside
> check_tuple_attribute, we will keep updating the offset as we process
> the attributes and after the loop is over we set ctx->offset to -1,  I
> did not understand that why we need to reset it to -1, do we ever
> check for that.  We don't even initialize the ctx->offset to -1 while
> initializing the context for the tuple so I do not understand what is
> the meaning of the random value -1.

Ahh, right, those are left over from a previous design of the code.  Thanks for pointing them out.  They are now
removed.

> 4.
> + if (!VARATT_IS_EXTENDED(chunk))
> + {
> + chunksize = VARSIZE(chunk) - VARHDRSZ;
> + chunkdata = VARDATA(chunk);
> + }
> + else if (VARATT_IS_SHORT(chunk))
> + {
> + /*
> + * could happen due to heap_form_tuple doing its thing
> + */
> + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
> + chunkdata = VARDATA_SHORT(chunk);
> + }
> + else
> + {
> + /* should never happen */
> + confess(ctx,
> + pstrdup("toast chunk is neither short nor extended"));
> + return;
> + }
>
> I think the error message "toast chunk is neither short nor extended".
> Because ideally, the toast chunk should not be further toasted.
> So I think the check is correct, but the error message is not correct.

I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what I
cameup with, "corrupt toast chunk va_header". 

> 5.
>
> + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> + check_relation_relkind_and_relam(ctx.rel);
> +
> + /*
> + * Open the toast relation, if any, also protected from concurrent
> + * vacuums.
> + */
> + if (ctx.rel->rd_rel->reltoastrelid)
> + {
> + int offset;
> +
> + /* Main relation has associated toast relation */
> + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid,
> +   ShareUpdateExclusiveLock);
> + offset = toast_open_indexes(ctx.toastrel,
> ....
> + if (TransactionIdIsNormal(ctx.relfrozenxid) &&
> + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid))
> + {
> + confess(&ctx, psprintf("relfrozenxid %u precedes global "
> +    "oldest valid xid %u ",
> +    ctx.relfrozenxid, ctx.oldestValidXid));
> + PG_RETURN_NULL();
> + }
>
> Don't we need to close the relation/toastrel/toastindexrel in such
> return which is without an abort? IIRC, we
> will get relcache leak WARNING on commit if we left them open in commit path.

Ok, I've added logic to close them.

All changes inspired by your review are included in the v9-0001 patch.  The differences since v8 are pulled out into
v9_diffsfor easier review. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Mon, Jun 22, 2020 at 5:44 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > I have looked into 0001 patch and I have a few comments.
> >
> > 1.
> > +
> > + /* Skip over unused/dead/redirected line pointers */
> > + if (!ItemIdIsUsed(ctx.itemid) ||
> > + ItemIdIsDead(ctx.itemid) ||
> > + ItemIdIsRedirected(ctx.itemid))
> > + continue;
> >
> > Isn't it a good idea to verify the Redirected Itemtid?  Because we
> > will still access the redirected item id to find the
> > actual tuple from the index scan.  Maybe not exactly at this level,
> > but we can verify that the link itemid store in that
> > is within the itemid range of the page or not.
>
> Good idea.  I've added checks that the redirection is valid, both in terms of being within bounds and in terms of
alignment.
>
> > 2.
> >
> > + /* Check for tuple header corruption */
> > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
> > + {
> > + confess(ctx,
> > + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)",
> > + ctx->tuphdr->t_hoff,
> > + (unsigned) SizeofHeapTupleHeader));
> > + fatal = true;
> > + }
> >
> > I think we can also check that if there is no NULL attributes (if
> > (!(t_infomask & HEAP_HASNULL)) then
> > ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader.
>
> You have to take alignment padding into account, but otherwise yes, and I've added a check for that.
>
> > 3.
> > + ctx->offset = 0;
> > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
> > + {
> > + if (!check_tuple_attribute(ctx))
> > + break;
> > + }
> > + ctx->offset = -1;
> > + ctx->attnum = -1;
> >
> > So we are first setting ctx->offset to 0, then inside
> > check_tuple_attribute, we will keep updating the offset as we process
> > the attributes and after the loop is over we set ctx->offset to -1,  I
> > did not understand that why we need to reset it to -1, do we ever
> > check for that.  We don't even initialize the ctx->offset to -1 while
> > initializing the context for the tuple so I do not understand what is
> > the meaning of the random value -1.
>
> Ahh, right, those are left over from a previous design of the code.  Thanks for pointing them out.  They are now
removed.
>
> > 4.
> > + if (!VARATT_IS_EXTENDED(chunk))
> > + {
> > + chunksize = VARSIZE(chunk) - VARHDRSZ;
> > + chunkdata = VARDATA(chunk);
> > + }
> > + else if (VARATT_IS_SHORT(chunk))
> > + {
> > + /*
> > + * could happen due to heap_form_tuple doing its thing
> > + */
> > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
> > + chunkdata = VARDATA_SHORT(chunk);
> > + }
> > + else
> > + {
> > + /* should never happen */
> > + confess(ctx,
> > + pstrdup("toast chunk is neither short nor extended"));
> > + return;
> > + }
> >
> > I think the error message "toast chunk is neither short nor extended".
> > Because ideally, the toast chunk should not be further toasted.
> > So I think the check is correct, but the error message is not correct.
>
> I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what I
cameup with, "corrupt toast chunk va_header".
 
>
> > 5.
> >
> > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > + check_relation_relkind_and_relam(ctx.rel);
> > +
> > + /*
> > + * Open the toast relation, if any, also protected from concurrent
> > + * vacuums.
> > + */
> > + if (ctx.rel->rd_rel->reltoastrelid)
> > + {
> > + int offset;
> > +
> > + /* Main relation has associated toast relation */
> > + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid,
> > +   ShareUpdateExclusiveLock);
> > + offset = toast_open_indexes(ctx.toastrel,
> > ....
> > + if (TransactionIdIsNormal(ctx.relfrozenxid) &&
> > + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid))
> > + {
> > + confess(&ctx, psprintf("relfrozenxid %u precedes global "
> > +    "oldest valid xid %u ",
> > +    ctx.relfrozenxid, ctx.oldestValidXid));
> > + PG_RETURN_NULL();
> > + }
> >
> > Don't we need to close the relation/toastrel/toastindexrel in such
> > return which is without an abort? IIRC, we
> > will get relcache leak WARNING on commit if we left them open in commit path.
>
> Ok, I've added logic to close them.
>
> All changes inspired by your review are included in the v9-0001 patch.  The differences since v8 are pulled out into
v9_diffsfor easier review.
 

I have reviewed the changes in v9_diffs and looks fine to me.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Sun, Jun 28, 2020 at 8:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Jun 22, 2020 at 5:44 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> >
> >
> >
> > > On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> > >
> > > I have looked into 0001 patch and I have a few comments.
> > >
> > > 1.
> > > +
> > > + /* Skip over unused/dead/redirected line pointers */
> > > + if (!ItemIdIsUsed(ctx.itemid) ||
> > > + ItemIdIsDead(ctx.itemid) ||
> > > + ItemIdIsRedirected(ctx.itemid))
> > > + continue;
> > >
> > > Isn't it a good idea to verify the Redirected Itemtid?  Because we
> > > will still access the redirected item id to find the
> > > actual tuple from the index scan.  Maybe not exactly at this level,
> > > but we can verify that the link itemid store in that
> > > is within the itemid range of the page or not.
> >
> > Good idea.  I've added checks that the redirection is valid, both in terms of being within bounds and in terms of
alignment.
> >
> > > 2.
> > >
> > > + /* Check for tuple header corruption */
> > > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
> > > + {
> > > + confess(ctx,
> > > + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)",
> > > + ctx->tuphdr->t_hoff,
> > > + (unsigned) SizeofHeapTupleHeader));
> > > + fatal = true;
> > > + }
> > >
> > > I think we can also check that if there is no NULL attributes (if
> > > (!(t_infomask & HEAP_HASNULL)) then
> > > ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader.
> >
> > You have to take alignment padding into account, but otherwise yes, and I've added a check for that.
> >
> > > 3.
> > > + ctx->offset = 0;
> > > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
> > > + {
> > > + if (!check_tuple_attribute(ctx))
> > > + break;
> > > + }
> > > + ctx->offset = -1;
> > > + ctx->attnum = -1;
> > >
> > > So we are first setting ctx->offset to 0, then inside
> > > check_tuple_attribute, we will keep updating the offset as we process
> > > the attributes and after the loop is over we set ctx->offset to -1,  I
> > > did not understand that why we need to reset it to -1, do we ever
> > > check for that.  We don't even initialize the ctx->offset to -1 while
> > > initializing the context for the tuple so I do not understand what is
> > > the meaning of the random value -1.
> >
> > Ahh, right, those are left over from a previous design of the code.  Thanks for pointing them out.  They are now
removed.
> >
> > > 4.
> > > + if (!VARATT_IS_EXTENDED(chunk))
> > > + {
> > > + chunksize = VARSIZE(chunk) - VARHDRSZ;
> > > + chunkdata = VARDATA(chunk);
> > > + }
> > > + else if (VARATT_IS_SHORT(chunk))
> > > + {
> > > + /*
> > > + * could happen due to heap_form_tuple doing its thing
> > > + */
> > > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
> > > + chunkdata = VARDATA_SHORT(chunk);
> > > + }
> > > + else
> > > + {
> > > + /* should never happen */
> > > + confess(ctx,
> > > + pstrdup("toast chunk is neither short nor extended"));
> > > + return;
> > > + }
> > >
> > > I think the error message "toast chunk is neither short nor extended".
> > > Because ideally, the toast chunk should not be further toasted.
> > > So I think the check is correct, but the error message is not correct.
> >
> > I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what
Icame up with, "corrupt toast chunk va_header".
 
> >
> > > 5.
> > >
> > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > > + check_relation_relkind_and_relam(ctx.rel);
> > > +
> > > + /*
> > > + * Open the toast relation, if any, also protected from concurrent
> > > + * vacuums.
> > > + */
> > > + if (ctx.rel->rd_rel->reltoastrelid)
> > > + {
> > > + int offset;
> > > +
> > > + /* Main relation has associated toast relation */
> > > + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid,
> > > +   ShareUpdateExclusiveLock);
> > > + offset = toast_open_indexes(ctx.toastrel,
> > > ....
> > > + if (TransactionIdIsNormal(ctx.relfrozenxid) &&
> > > + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid))
> > > + {
> > > + confess(&ctx, psprintf("relfrozenxid %u precedes global "
> > > +    "oldest valid xid %u ",
> > > +    ctx.relfrozenxid, ctx.oldestValidXid));
> > > + PG_RETURN_NULL();
> > > + }
> > >
> > > Don't we need to close the relation/toastrel/toastindexrel in such
> > > return which is without an abort? IIRC, we
> > > will get relcache leak WARNING on commit if we left them open in commit path.
> >
> > Ok, I've added logic to close them.
> >
> > All changes inspired by your review are included in the v9-0001 patch.  The differences since v8 are pulled out
intov9_diffs for easier review.
 
>
> I have reviewed the changes in v9_diffs and looks fine to me.

Some more comments on v9_0001.
1.
+ LWLockAcquire(XidGenLock, LW_SHARED);
+ nextFullXid = ShmemVariableCache->nextFullXid;
+ ctx.oldestValidXid = ShmemVariableCache->oldestXid;
+ LWLockRelease(XidGenLock);
+ ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid);
...
...
+
+ for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
+ {
+ int32 mapbits;
+ OffsetNumber maxoff;
+ PageHeader ph;
+
+ /* Optionally skip over all-frozen or all-visible blocks */
+ if (skip_all_frozen || skip_all_visible)
+ {
+ mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
+    &vmbuffer);
+ if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ continue;
+ if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ continue;
+ }
+
+ /* Read and lock the next page. */
+ ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno,
+ RBM_NORMAL, ctx.bstrategy);
+ LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE);

I might be missing something, but it appears that first we are getting
the nextFullXid and after that, we are scanning the block by block.
So while we are scanning the block if the nextXid is advanced and it
has updated some tuple in the heap pages,  then it seems the current
logic will complain about out of range xid.  I did not test this
behavior so please point me to the logic which is protecting this.

2.
/*
 * Helper function to construct the TupleDesc needed by verify_heapam.
 */
static TupleDesc
verify_heapam_tupdesc(void)

From function name, it appeared that it is verifying tuple descriptor
but this is just creating the tuple descriptor.

3.
+ /* Optionally skip over all-frozen or all-visible blocks */
+ if (skip_all_frozen || skip_all_visible)
+ {
+ mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
+    &vmbuffer);
+ if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
+ continue;
+ if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
+ continue;
+ }

Here, do we want to test that in VM the all visible bit is set whereas
on the page it is not set?  That can lead to a wrong result in an
index-only scan.

4. One cosmetic comment

+ /* Skip non-varlena values, but update offset first */
..
+
+ /* Ok, we're looking at a varlena attribute. */

Throughout the patch, I have noticed that some of your single-line
comments have "full stop" whereas other don't.  Can we keep them
consistent?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jun 28, 2020, at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> Some more comments on v9_0001.
> 1.
> + LWLockAcquire(XidGenLock, LW_SHARED);
> + nextFullXid = ShmemVariableCache->nextFullXid;
> + ctx.oldestValidXid = ShmemVariableCache->oldestXid;
> + LWLockRelease(XidGenLock);
> + ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid);
> ...
> ...
> +
> + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> + {
> + int32 mapbits;
> + OffsetNumber maxoff;
> + PageHeader ph;
> +
> + /* Optionally skip over all-frozen or all-visible blocks */
> + if (skip_all_frozen || skip_all_visible)
> + {
> + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
> +    &vmbuffer);
> + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> + continue;
> + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> + continue;
> + }
> +
> + /* Read and lock the next page. */
> + ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno,
> + RBM_NORMAL, ctx.bstrategy);
> + LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE);
>
> I might be missing something, but it appears that first we are getting
> the nextFullXid and after that, we are scanning the block by block.
> So while we are scanning the block if the nextXid is advanced and it
> has updated some tuple in the heap pages,  then it seems the current
> logic will complain about out of range xid.  I did not test this
> behavior so please point me to the logic which is protecting this.

We know the oldest valid Xid cannot advance, because we hold a lock that would prevent it from doing so.  We cannot
knowthat the newest Xid will not advance, but when we see an Xid beyond the end of the known valid range, we check its
validity,and either report it as a corruption or advance our idea of the newest valid Xid, depending on that check.
Thatlogic is in TransactionIdValidInRel. 

> 2.
> /*
> * Helper function to construct the TupleDesc needed by verify_heapam.
> */
> static TupleDesc
> verify_heapam_tupdesc(void)
>
> From function name, it appeared that it is verifying tuple descriptor
> but this is just creating the tuple descriptor.

In amcheck--1.2--1.3.sql we define a function named verify_heapam which returns a set of records.  This is the tuple
descriptorfor that function.  I understand that the name can be parsed as verify_(heapam_tupdesc), but it is meant as
(verify_heapam)_tupdesc. Do you have a name you would prefer? 

> 3.
> + /* Optionally skip over all-frozen or all-visible blocks */
> + if (skip_all_frozen || skip_all_visible)
> + {
> + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
> +    &vmbuffer);
> + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> + continue;
> + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> + continue;
> + }
>
> Here, do we want to test that in VM the all visible bit is set whereas
> on the page it is not set?  That can lead to a wrong result in an
> index-only scan.

If the caller has specified that the corruption check should skip over all-frozen or all-visible data, then we cannot
loadthe page that the VM claims is all-frozen or all-visible without defeating the purpose of the caller having
specifiedthese options.  Without loading the page, we cannot check the page's header bits. 

When not skipping all-visible or all-frozen blocks, we might like to pin both the heap page and the visibility map page
inorder to compare the two, being careful not to hold a pin on the one while performing I/O on the other.  See for
examplethe logic in heap_delete().  But I'm not sure what guarantees the system makes about agreement between these two
bits. Certainly, the VM should not claim a page is all visible when it isn't, but are we guaranteed that a page that is
all-visiblewill always have its all-visible bit set?  I don't know if (possibly transient) disagreement between these
twobits constitutes corruption.  Perhaps others following this thread can advise? 

> 4. One cosmetic comment
>
> + /* Skip non-varlena values, but update offset first */
> ..
> +
> + /* Ok, we're looking at a varlena attribute. */
>
> Throughout the patch, I have noticed that some of your single-line
> comments have "full stop" whereas other don't.  Can we keep them
> consistent?

I try to use a "full stop" at the end of sentences, but not at the end of sentence fragments.  To me, a "full stop"
meansthat a sentence has reached its conclusion.  I don't intentionally use one at the end of a fragment, unless the
fragmentprecedes a full sentence, in which case the "full stop" is needed to separate the two.  Of course, I may have
violatedmy own rule in a few places, but before I submit a v10 patch with comment punctuation changes, perhaps we can
agreeon what the rule is?  (This has probably been discussed before and agreed before.  A link to the appropriate email
threadwould be sufficient.) 

For example:

    /* red, green, or blue */
    /* set to pink */
    /* set to blue.  We have not closed the file. */
    /* At this point, we have chosen the color. */

The first comment is not a sentence, but the fourth is.  The third comment is a fragment followed by a full sentence,
anda "full stop" separates the two.  As for the second comment, as I recall, verb phrases can be interpreted as a full
sentence,as in "Close the door!", when they are meant as commands to the listener, but not otherwise.  "set to pink" is
nota command to the reader, but rather a description of what the code is doing at that point, so I think of it as a
mereverb phrase and not a full sentence. 

Making matters even more complicated, portions of the logic in verify_heapam were taken from sections of code that
wouldereport(), elog(), or Assert() on corruption, and when I took such code, I sometimes also took the comments in
unmodifiedform.  That means that my normal commenting rules don't apply, as I'm not the comment author in such cases. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
I think there are two very large patches here.  One adds checking of
heapam tables to amcheck, and the other adds a binary that eases calling
amcheck from the command line.  I think these should be two separate
patches.

I don't know what to think of a module contrib/pg_amcheck.  I kinda lean
towards fitting it in src/bin/scripts rather than as a contrib module.
However, it seems a bit weird that it depends on a contrib module.
Maybe amcheck should not be a contrib module at all but rather a new
extension in src/extensions/ that is compiled and installed (in the
filesystem, not in databases) by default.

I strongly agree with hardening backend code so that all the crashes
that Mark has found can be repaired.  (We discussed this topic
before[1]: we'd repair all crashes when run with production code, not
all assertion crashes.)

[1] https://postgr.es/m/20200513221051.GA26592@alvherre.pgsql

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jun 30, 2020, at 11:44 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> I think there are two very large patches here.  One adds checking of
> heapam tables to amcheck, and the other adds a binary that eases calling
> amcheck from the command line.  I think these should be two separate
> patches.

contrib/amcheck has pretty limited regression test coverage.  I wrote pg_amcheck in large part because the
infrastructureI was writing for testing contrib/amcheck was starting to look like a stand-alone tool, so I made it one.
I can split contrib/pg_amcheck into a separate patch, but I would expect reviewers to use it to review contrib/amcheck
Saythe word, and I'll resubmit as two separate patches. 

> I don't know what to think of a module contrib/pg_amcheck.  I kinda lean
> towards fitting it in src/bin/scripts rather than as a contrib module.
> However, it seems a bit weird that it depends on a contrib module.

Agreed.

> Maybe amcheck should not be a contrib module at all but rather a new
> extension in src/extensions/ that is compiled and installed (in the
> filesystem, not in databases) by default.

Fine with me, but I'll have to see what others think about that.

> I strongly agree with hardening backend code so that all the crashes
> that Mark has found can be repaired.  (We discussed this topic
> before[1]: we'd repair all crashes when run with production code, not
> all assertion crashes.)

I'm guessing that hardening the backend would be a separate patch?  Or did you want that as part of this one?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
On 2020-Jun-30, Mark Dilger wrote:

> I'm guessing that hardening the backend would be a separate patch?  Or
> did you want that as part of this one?

Lately, to me the foremost criterion to determine what is a separate
patch and what isn't is the way the commit message is structured.  If it
looks too much like a bullet list of unrelated things, that suggests
that the commit should be split into one commit per bullet point; of
course, there are counterexamples.  But when I have a commit message
that says "I do A, and I also do B because I need it for A", then it
makes more sense to do B first standalone and then A on top.  OTOH if
two things are done because they're heavily intermixed (e.g. commit
850196b610d2, bullet points galore), that suggests that one commit is a
decent approach.

Just my opinion, of course.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: new heapcheck contrib module

From
Dilip Kumar
Date:
On Sun, Jun 28, 2020 at 11:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jun 28, 2020, at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
> >
> > Some more comments on v9_0001.
> > 1.
> > + LWLockAcquire(XidGenLock, LW_SHARED);
> > + nextFullXid = ShmemVariableCache->nextFullXid;
> > + ctx.oldestValidXid = ShmemVariableCache->oldestXid;
> > + LWLockRelease(XidGenLock);
> > + ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid);
> > ...
> > ...
> > +
> > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> > + {
> > + int32 mapbits;
> > + OffsetNumber maxoff;
> > + PageHeader ph;
> > +
> > + /* Optionally skip over all-frozen or all-visible blocks */
> > + if (skip_all_frozen || skip_all_visible)
> > + {
> > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
> > +    &vmbuffer);
> > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> > + continue;
> > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> > + continue;
> > + }
> > +
> > + /* Read and lock the next page. */
> > + ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno,
> > + RBM_NORMAL, ctx.bstrategy);
> > + LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE);
> >
> > I might be missing something, but it appears that first we are getting
> > the nextFullXid and after that, we are scanning the block by block.
> > So while we are scanning the block if the nextXid is advanced and it
> > has updated some tuple in the heap pages,  then it seems the current
> > logic will complain about out of range xid.  I did not test this
> > behavior so please point me to the logic which is protecting this.
>
> We know the oldest valid Xid cannot advance, because we hold a lock that would prevent it from doing so.  We cannot
knowthat the newest Xid will not advance, but when we see an Xid beyond the end of the known valid range, we check its
validity,and either report it as a corruption or advance our idea of the newest valid Xid, depending on that check.
Thatlogic is in TransactionIdValidInRel. 

That makes sense to me.

>
> > 2.
> > /*
> > * Helper function to construct the TupleDesc needed by verify_heapam.
> > */
> > static TupleDesc
> > verify_heapam_tupdesc(void)
> >
> > From function name, it appeared that it is verifying tuple descriptor
> > but this is just creating the tuple descriptor.
>
> In amcheck--1.2--1.3.sql we define a function named verify_heapam which returns a set of records.  This is the tuple
descriptorfor that function.  I understand that the name can be parsed as verify_(heapam_tupdesc), but it is meant as
(verify_heapam)_tupdesc. Do you have a name you would prefer? 

Not very particular, but if we have a name like
verify_heapam_get_tupdesc, But, just a suggestion so it's your choice
if you prefer the current name I have no objection.

>
> > 3.
> > + /* Optionally skip over all-frozen or all-visible blocks */
> > + if (skip_all_frozen || skip_all_visible)
> > + {
> > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno,
> > +    &vmbuffer);
> > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0)
> > + continue;
> > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0)
> > + continue;
> > + }
> >
> > Here, do we want to test that in VM the all visible bit is set whereas
> > on the page it is not set?  That can lead to a wrong result in an
> > index-only scan.
>
> If the caller has specified that the corruption check should skip over all-frozen or all-visible data, then we cannot
loadthe page that the VM claims is all-frozen or all-visible without defeating the purpose of the caller having
specifiedthese options.  Without loading the page, we cannot check the page's header bits. 
>
> When not skipping all-visible or all-frozen blocks, we might like to pin both the heap page and the visibility map
pagein order to compare the two, being careful not to hold a pin on the one while performing I/O on the other.  See for
examplethe logic in heap_delete().  But I'm not sure what guarantees the system makes about agreement between these two
bits. Certainly, the VM should not claim a page is all visible when it isn't, but are we guaranteed that a page that is
all-visiblewill always have its all-visible bit set?  I don't know if (possibly transient) disagreement between these
twobits constitutes corruption.  Perhaps others following this thread can advise? 

Right, the VM should not claim its all visible when it actually not.
But, IIRC, it is not guaranteed that if the page is all visible then
the VM must set the all visible flag.

> > 4. One cosmetic comment
> >
> > + /* Skip non-varlena values, but update offset first */
> > ..
> > +
> > + /* Ok, we're looking at a varlena attribute. */
> >
> > Throughout the patch, I have noticed that some of your single-line
> > comments have "full stop" whereas other don't.  Can we keep them
> > consistent?
>
> I try to use a "full stop" at the end of sentences, but not at the end of sentence fragments.  To me, a "full stop"
meansthat a sentence has reached its conclusion.  I don't intentionally use one at the end of a fragment, unless the
fragmentprecedes a full sentence, in which case the "full stop" is needed to separate the two.  Of course, I may have
violatedmy own rule in a few places, but before I submit a v10 patch with comment punctuation changes, perhaps we can
agreeon what the rule is?  (This has probably been discussed before and agreed before.  A link to the appropriate email
threadwould be sufficient.) 

I can see in different files we have followed different rules.  I am
fine as far as those are consistent across the file.

> For example:
>
>         /* red, green, or blue */
>         /* set to pink */
>         /* set to blue.  We have not closed the file. */
>         /* At this point, we have chosen the color. */
>
> The first comment is not a sentence, but the fourth is.  The third comment is a fragment followed by a full sentence,
anda "full stop" separates the two.  As for the second comment, as I recall, verb phrases can be interpreted as a full
sentence,as in "Close the door!", when they are meant as commands to the listener, but not otherwise.  "set to pink" is
nota command to the reader, but rather a description of what the code is doing at that point, so I think of it as a
mereverb phrase and not a full sentence. 

> Making matters even more complicated, portions of the logic in verify_heapam were taken from sections of code that
wouldereport(), elog(), or Assert() on corruption, and when I took such code, I sometimes also took the comments in
unmodifiedform.  That means that my normal commenting rules don't apply, as I'm not the comment author in such cases. 

I agree.

A few more comments.
1.

+ if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+ {
+ confess(ctx,
+ pstrdup("attribute is external but not marked as on disk"));
+ return true;
+ }
+
....
+
+ /*
+ * Must dereference indirect toast pointers before we can check them
+ */
+ if (VARATT_IS_EXTERNAL_INDIRECT(attr))
+ {


So first we are checking that if the varatt is not
VARATT_IS_EXTERNAL_ONDISK then we are returning,  but just a
few statements down we are checking if the varatt is
VARATT_IS_EXTERNAL_INDIRECT, so seems like unreachable code.

2. Another point related to the same code is that toast_save_datum
always set the VARTAG_ONDISK tag.  IIUC, we use
VARTAG_INDIRECT in reorderbuffer for generating temp tuple so ideally
while scanning the heap we should never get
VARATT_IS_EXTERNAL_INDIRECT tuple.  Am I missing something here?

3.
+ if (VARATT_IS_1B_E(tp + ctx->offset))
+ {
+ uint8 va_tag = va_tag = VARTAG_EXTERNAL(tp + ctx->offset);
+
+ if (va_tag != VARTAG_ONDISK)
+ {
+ confess(ctx, psprintf("unexpected TOAST vartag %u for "
+   "attribute #%u at t_hoff = %u, "
+   "offset = %u",
+   va_tag, ctx->attnum,
+   ctx->tuphdr->t_hoff, ctx->offset));
+ return false; /* We can't know where the next attribute
+ * begins */
+ }
+ }

+ /* Skip values that are not external */
+ if (!VARATT_IS_EXTERNAL(attr))
+ return true;
+
+ /* It is external, and we're looking at a page on disk */
+ if (!VARATT_IS_EXTERNAL_ONDISK(attr))
+ {
+ confess(ctx,
+ pstrdup("attribute is external but not marked as on disk"));
+ return true;
+ }

First, we are checking that if VARATT_IS_1B_E and if so we will check
whether its tag is VARTAG_ONDISK or not.  But just after that, we will
get the actual attribute pointer and
Again check the same thing with 2 different checks.  Can you explain
why this is necessary?

4.
+ if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
+ (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED))
+ {
+ confess(ctx,
+ psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set"));
+ }
+ if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
+ (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
+ {
+ confess(ctx,
+ psprintf("HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI both set"));
+ }

Maybe we can further expand these checks,  like if the tuple is
HEAP_XMAX_LOCK_ONLY then HEAP_UPDATED or HEAP_HOT_UPDATED should not
be set.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 4, 2020, at 6:04 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> A few more comments.

Your comments all pertain to function check_tuple_attribute(), which follows the logic of heap_deform_tuple() and
detoast_external_attr(). The idea is that any error that could result in an assertion or crash in those functions
shouldbe checked carefully by check_tuple_attribute(), and checked *before* any such asserts or crashes might be
triggered.

I obviously did not explain this thinking in the function comment.  That is rectified in the v10 patch, attached.

> 1.
>
> + if (!VARATT_IS_EXTERNAL_ONDISK(attr))
> + {
> + confess(ctx,
> + pstrdup("attribute is external but not marked as on disk"));
> + return true;
> + }
> +
> ....
> +
> + /*
> + * Must dereference indirect toast pointers before we can check them
> + */
> + if (VARATT_IS_EXTERNAL_INDIRECT(attr))
> + {
>
>
> So first we are checking that if the varatt is not
> VARATT_IS_EXTERNAL_ONDISK then we are returning,  but just a
> few statements down we are checking if the varatt is
> VARATT_IS_EXTERNAL_INDIRECT, so seems like unreachable code.

True.  I've removed the VARATT_IS_EXTERNAL_INDIRECT check.


> 2. Another point related to the same code is that toast_save_datum
> always set the VARTAG_ONDISK tag.  IIUC, we use
> VARTAG_INDIRECT in reorderbuffer for generating temp tuple so ideally
> while scanning the heap we should never get
> VARATT_IS_EXTERNAL_INDIRECT tuple.  Am I missing something here?

I think you are right that we cannot get a VARATT_IS_EXTERNAL_INDIRECT tuple. That check is removed in v10.

> 3.
> + if (VARATT_IS_1B_E(tp + ctx->offset))
> + {
> + uint8 va_tag = va_tag = VARTAG_EXTERNAL(tp + ctx->offset);
> +
> + if (va_tag != VARTAG_ONDISK)
> + {
> + confess(ctx, psprintf("unexpected TOAST vartag %u for "
> +   "attribute #%u at t_hoff = %u, "
> +   "offset = %u",
> +   va_tag, ctx->attnum,
> +   ctx->tuphdr->t_hoff, ctx->offset));
> + return false; /* We can't know where the next attribute
> + * begins */
> + }
> + }
>
> + /* Skip values that are not external */
> + if (!VARATT_IS_EXTERNAL(attr))
> + return true;
> +
> + /* It is external, and we're looking at a page on disk */
> + if (!VARATT_IS_EXTERNAL_ONDISK(attr))
> + {
> + confess(ctx,
> + pstrdup("attribute is external but not marked as on disk"));
> + return true;
> + }
>
> First, we are checking that if VARATT_IS_1B_E and if so we will check
> whether its tag is VARTAG_ONDISK or not.  But just after that, we will
> get the actual attribute pointer and
> Again check the same thing with 2 different checks.  Can you explain
> why this is necessary?

The code that calls check_tuple_attribute() expects it to check the current attribute, but also to safely advance the
ctx->offsetvalue to the next attribute, as the caller is iterating over all attributes.  The first check verifies that
itis safe to call att_addlength_pointer, as we must not call att_addlength_pointer on a corrupt datum.  The second
checksimply returns on non-external attributes, having advanced ctx->offset, there is nothing left to do.  The third
checkis validating the external attribute, now that we know that it is external.  You are right that the third check
cannotfail, as the first check would already have confess()ed and returned false.  The third check is removed in v10,
attached.

> 4.
> + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
> + (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED))
> + {
> + confess(ctx,
> + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set"));
> + }
> + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
> + (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
> + {
> + confess(ctx,
> + psprintf("HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI both set"));
> + }
>
> Maybe we can further expand these checks,  like if the tuple is
> HEAP_XMAX_LOCK_ONLY then HEAP_UPDATED or HEAP_HOT_UPDATED should not
> be set.

Adding Asserts in src/backend/access/heap/hio.c against those two conditions, the regression tests fail in quite a lot
ofplaces where HEAP_XMAX_LOCK_ONLY and HEAP_UPDATED are both true.  I'm leaving this idea out for v10, since it doesn't
work,but in case you want to tell me what I did wrong, here are the changed I made on top of v10: 

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 00de10b7c9..76d23e141a 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -57,6 +57,10 @@ RelationPutHeapTuple(Relation relation,
                         (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED)));
        Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_COMMITTED) &&
                         (tuple->t_data->t_infomask & HEAP_XMAX_IS_MULTI)));
+       Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
+                        (tuple->t_data->t_infomask & HEAP_UPDATED)));
+       Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
+                        (tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED)));

diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c
index 49d3d5618a..60e4ad5be0 100644
--- a/contrib/amcheck/verify_heapam.c
+++ b/contrib/amcheck/verify_heapam.c
@@ -969,12 +969,19 @@ check_tuple(HeapCheckContext * ctx)
                                                          ctx->tuphdr->t_hoff));
                fatal = true;
        }
-       if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) &&
-               (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED))
+       if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY)
        {
-               confess(ctx,
-                       psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set"));
+               if (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED)
+                       confess(ctx,
+                               psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set"));
+               if (ctx->tuphdr->t_infomask & HEAP_UPDATED)
+                       confess(ctx,
+                               psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_UPDATED both set"));
+               if (ctx->tuphdr->t_infomask2 & HEAP_HOT_UPDATED)
+                       confess(ctx,
+                               psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_HOT_UPDATED both set"));
        }
+
        if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) &&
                (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI))
        {


The v10 patch without these ideas is here:




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> The v10 patch without these ideas is here:

Along the lines of what Alvaro was saying before, I think this
definitely needs to be split up into a series of patches. The commit
message for v10 describes it doing three pretty separate things, and I
think that argues for splitting it into a series of three patches. I'd
argue for this ordering:

0001 Refactoring existing amcheck btree checking functions to optionally
return corruption information rather than ereport'ing it.  This is
used by the new pg_amcheck command line tool for reporting back to
the caller.

0002 Adding new function verify_heapam for checking a heap relation and
associated toast relation, if any, to contrib/amcheck.

0003 Adding new contrib module pg_amcheck, which is a command line
interface for running amcheck's verifications against tables and
indexes.

It's too hard to review things like this when it's all mixed together.

+++ b/contrib/amcheck/t/skipping.pl

The name of this file is inconsistent with the tree's usual
convention, which is all stuff like 001_whatever.pl, except for
src/test/modules/brin, which randomly decided to use two digits
instead of three. There's no precedent for a test file with no leading
numeric digits. Also, what does "skipping" even have to do with what
the test is checking? Maybe it's intended to refer to the new error
handling "skipping" the actual error in favor of just reporting it
without stopping, but that's not really what the word "skipping"
normally means. Finally, it seems a bit over-engineered: do we really
need 183 test cases to check that detecting a problem doesn't lead to
an abort? Like, if that's the purpose of the test, I'd expect it to
check one corrupt relation and one non-corrupt relation, each with and
without the no-error behavior. And that's about it. Or maybe it's
talking about skipping pages during the checks, because those pages
are all-visible or all-frozen? It's not very clear to me what's going
on here.

+ TransactionId nextKnownValidXid;
+ TransactionId oldestValidXid;

Please add explanatory comments indicating what these are intended to
mean. For most of the the structure members, the brief comments
already present seem sufficient; but here, more explanation looks
necessary and less is provided. The "Values for returning tuples"
could possibly also use some more detail.

+#define HEAPCHECK_RELATION_COLS 8

I think this should really be at the top of the file someplace.
Sometimes people have adopted this style when the #define is only used
within the function that contains it, but that's not the case here.

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("unrecognized parameter for 'skip': %s", skip),
+ errhint("please choose from 'all visible', 'all frozen', "
+ "or NULL")));

I think it would be better if we had three string values selecting the
different behaviors, and made the parameter NOT NULL but with a
default. It seems like that would be easier to understand. Right now,
I can tell that my options for what to skip are "all visible", "all
frozen", and, uh, some other thing that I don't know what it is. I'm
gonna guess the third option is to skip nothing, but it seems best to
make that explicit. Also, should we maybe consider spelling this
'all-visible' and 'all-frozen' with dashes, instead of using spaces?
Spaces in an option value seems a little icky to me somehow.

+ int64 startblock = -1;
+ int64 endblock = -1;
...
+ if (!PG_ARGISNULL(3))
+ startblock = PG_GETARG_INT64(3);
+ if (!PG_ARGISNULL(4))
+ endblock = PG_GETARG_INT64(4);
...
+ if (startblock < 0)
+ startblock = 0;
+ if (endblock < 0 || endblock > ctx.nblocks)
+ endblock = ctx.nblocks;
+
+ for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)

So, the user can specify a negative value explicitly and it will be
treated as the default, and an endblock value that's larger than the
relation size will be treated as the relation size. The way pg_prewarm
does the corresponding checks seems superior: null indicates the
default value, and any non-null value must be within range or you get
an error. Also, you seem to be treating endblock as the first block
that should not be checked, whereas pg_prewarm takes what seems to me
to be the more natural interpretation: the end block is the last block
that IS checked. If you do it this way, then someone who specifies the
same start and end block will check no blocks -- silently, I think.

+               if (skip_all_frozen || skip_all_visible)

Since you can't skip all frozen without skipping all visible, this
test could be simplified. Or you could introduce a three-valued enum
and test that skip_pages != SKIP_PAGES_NONE, which might be even
better.

+ /* We must unlock the page from the prior iteration, if any */
+ Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);

I don't understand this assertion, and I don't understand the comment,
either. I think ctx.blkno can never be equal to InvalidBlockNumber
because we never set it to anything outside the range of 0..(endblock
- 1), and I think ctx.buffer must always be unequal to InvalidBuffer
because we just initialized it by calling ReadBufferExtended(). So I
think this assertion would still pass if we wrote && rather than ||.
But even then, I don't know what that has to do with the comment or
why it even makes sense to have an assertion for that in the first
place.

+       /*
+        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
+        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
+        * global oldestXid to be newer than those.  This protection
saves us from
+        * having to reacquire the locks and recheck those minimums for every
+        * tuple, which would be expensive.
+        */
+       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);

I don't think we'd need to recheck for every tuple, would we? Just for
cases where there's an apparent violation of the rules. I guess that
could still be expensive if there's a lot of them, but needing
ShareUpdateExclusiveLock rather than only AccessShareLock is a little
unfortunate.

It's also unclear to me why this concerns itself with relfrozenxid and
the cluster-wide oldestXid value but not with datfrozenxid. It seems
like if we're going to sanity-check the relfrozenxid against the
cluster-wide value, we ought to also check it against the
database-wide value. Checking neither would also seem like a plausible
choice. But it seems very strange to only check against the
cluster-wide value.

+               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
+                                                "InvalidOffsetNumber
increments to FirstOffsetNumber");

If you are going to rely on this property, I agree that it is good to
check it. But it would be better to NOT rely on this property, and I
suspect the code can be written quite cleanly without relying on it.
And actually, that's what you did, because you first set ctx.offnum =
InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
the loop initializer. So AFAICS the first initializer, and the static
assert, are pointless.

+                       if (ItemIdIsRedirected(ctx.itemid))
+                       {
+                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
+                               if (redirect <= SizeOfPageHeaderData
|| redirect >= ph->pd_lower)
...
+                               if ((redirect - SizeOfPageHeaderData)
% sizeof(uint16))

I think that ItemIdGetRedirect() returns an offset, not a byte
position. So the expectation that I would have is that it would be any
integer >= 0 and <= maxoff. Am I confused? BTW, it seems like it might
be good to complain if the item to which it points is LP_UNUSED...
AFAIK that shouldn't happen.

+                                errmsg("\"%s\" is not a heap AM",

I think the correct wording would be just "is not a heap." The "heap
AM" is the thing in pg_am, not a specific table.

+confess(HeapCheckContext * ctx, char *msg)
+TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
+check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)

This is what happens when you pgindent without adding all the right
things to typedefs.list first ... or when you don't pgindent and have
odd ideas about how to indent things.


+       /*
+        * In principle, there is nothing to prevent a scan over a large, highly
+        * corrupted table from using workmem worth of memory building up the
+        * tuplestore.  Don't leak the msg argument memory.
+        */
+       pfree(msg);

Maybe change the second sentence to something like: "That should be
OK, else the user can lower work_mem, but we'd better not leak any
additional memory."

+/*
+ * check_tuphdr_xids
+ *
+ *     Determine whether tuples are visible for verification.  Similar to
+ *  HeapTupleSatisfiesVacuum, but with critical differences.
+ *
+ *  1) Does not touch hint bits.  It seems imprudent to write hint bits
+ *     to a table during a corruption check.
+ *  2) Only makes a boolean determination of whether verification should
+ *     see the tuple, rather than doing extra work for vacuum-related
+ *     categorization.
+ *
+ *  The caller should already have checked that xmin and xmax are not out of
+ *  bounds for the relation.
+ */

First, check_tuphdr_xids() doesn't seem like a very good name. If you
have a function with that name and, like this one, it returns Boolean,
what does true mean? What does false mean? Kinda hard to tell. And
also, check the tuple header XIDs *for what*? If you called it, say,
tuple_is_visible(), that would be self-evident.

Second, consider that we hold at least AccessShareLock on the relation
- actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
cannot be a concurrent modification to the tuple descriptor in
progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
potentially using a non-current schema. If the tuple is
HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
inserting transaction, or that transaction committed before we got our
lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
before we got our lock. Or if it's both inserted and deleted in the
same transaction, say, then that transaction committed before we got
our lock or else contains no relevant DDL. IOW, I think you can check
everything but dead tuples here.

Capitalization and punctuation for messages complaining about problems
need to be consistent. verify_heapam() has "Invalid redirect line
pointer offset %u out of bounds" which starts with a capital letter,
but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
case, but in any event it should be the same. Also,
check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
debugging leftover or a very unclear complaint. I think some real work
needs to be put into the phrasing of these messages so that it's more
clear exactly what is going on and why it's bad. For example the first
example in this paragraph is clearly a problem of some kind, but it's
not very clear exactly what is happening: is %u the offset of the
invalid line redirect or the value to which it points? I don't think
the phrasing is very grammatical, which makes it hard to tell which is
meant, and I actually think it would be a good idea to include both
things.

Project policy is generally against splitting a string across multiple
lines to fit within 80 characters. We like to fit within 80
characters, but we like to be able to grep for strings more, and
breaking them up like this makes that harder.

+               confess(ctx,
+                               pstrdup("corrupt toast chunk va_header"));

This is another message that I don't think is very clear. There's two
elements to that. One is that the phrasing is not very good, and the
other is that there are no % escapes. What's somebody going to do when
they see this message? First, they're probably going to have to look
at the code to figure out in which circumstances it gets generated;
that's a sign that the message isn't phrased clearly enough. That will
tell them that an unexpected bit pattern has been found, but not what
that unexpected bit pattern actually was. So then, they're going to
have to try to find the relevant va_header by some other means and
fish out the relevant bit so that they can see what actually went
wrong.

+ *   Checks the current attribute as tracked in ctx for corruption.  Records
+ *   any corruption found in ctx->corruption.
+ *
+ *

Extra blank line.

+       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
+
                   ctx->attnum);

Maybe you could avoid the line wrap by declaring this without
initializing it, and then initializing it as a separate statement.

+               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
+
ctx->tuphdr->t_hoff, ctx->offset,
+                                                         ctx->lp_len));

Uggh! This isn't even remotely an English sentence. I don't think
formulas are the way to go here, but I like the idea of formulas in
some places and written-out messages in others even less. I guess the
complaint here in English is something like "tuple attribute %d should
start at offset %u, but tuple length is only %u" or something of that
sort. Also, it seems like this complaint really ought to have been
reported on the *preceding* loop iteration, either complaining that
(1) the fixed length attribute is more than the number of remaining
bytes in the tuple or (2) the varlena header for the tuple specifies
an excessively high length. It seems like you're blaming the wrong
attribute for the problem.

BTW, the header comments for this function (check_tuple_attribute)
neglect to document the meaning of the return value.

+                       confess(ctx, psprintf("tuple xmax = %u
precedes relation "
+
"relfrozenxid = %u",

This is another example of these messages needing  work. The
corresponding message from heap_prepare_freeze_tuple() is "found
update xid %u from before relfrozenxid %u". That's better, because we
don't normally include equals signs in our messages like this, and
also because "relation relfrozenxid" is redundant. I think this should
say something like "tuple xmax %u precedes relfrozenxid %u".

+                       confess(ctx, psprintf("tuple xmax = %u is in
the future",
+                                                                 xmax));

And then this could be something like "tuple xmax %u follows
last-assigned xid %u". That would be more symmetric and more
informative.

+               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
ctx->tuphdr->t_hoff)

I think we should be able to predict the exact value of t_hoff and
complain if it isn't precisely equal to the expected value. Or is that
not possible for some reason?

Is there some place that's checking that lp_len >=
SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
the header? If not, there should be.

+$node->command_ok(

+       [
+               'pg_amcheck', '-p', $port, 'postgres'
+       ],
+       'pg_amcheck all schemas and tables implicitly');
+
+$node->command_ok(
+       [
+               'pg_amcheck', '-i', '-p', $port, 'postgres'
+       ],
+       'pg_amcheck all schemas, tables and indexes');

I haven't really looked through the btree-checking and pg_amcheck
parts of this much yet, but this caught my eye. Why would the default
be to check tables but not indexes? I think the default ought to be to
check everything we know how to check.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Michael Paquier
Date:
On Thu, May 14, 2020 at 03:50:52PM -0400, Tom Lane wrote:
> I think there's definitely value in corrupting data in some predictable
> (reproducible) way and verifying that the check code catches it and
> responds as expected.  Sure, this will not be 100% coverage, but it'll be
> a lot better than 0% coverage.

Skimming quickly through the patch, that's what is done in a way
similar to pg_checksums's 002_actions.pl.  So it seems fine to me to
use something like that for some basic coverage.  We may want to
refactor the test APIs to unify all that though.
--
Michael

Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>> The v10 patch without these ideas is here:
>
> Along the lines of what Alvaro was saying before, I think this
> definitely needs to be split up into a series of patches. The commit
> message for v10 describes it doing three pretty separate things, and I
> think that argues for splitting it into a series of three patches. I'd
> argue for this ordering:
>
> 0001 Refactoring existing amcheck btree checking functions to optionally
> return corruption information rather than ereport'ing it.  This is
> used by the new pg_amcheck command line tool for reporting back to
> the caller.
>
> 0002 Adding new function verify_heapam for checking a heap relation and
> associated toast relation, if any, to contrib/amcheck.
>
> 0003 Adding new contrib module pg_amcheck, which is a command line
> interface for running amcheck's verifications against tables and
> indexes.
>
> It's too hard to review things like this when it's all mixed together.

The v11 patch series is broken up as you suggest.

> +++ b/contrib/amcheck/t/skipping.pl
>
> The name of this file is inconsistent with the tree's usual
> convention, which is all stuff like 001_whatever.pl, except for
> src/test/modules/brin, which randomly decided to use two digits
> instead of three. There's no precedent for a test file with no leading
> numeric digits. Also, what does "skipping" even have to do with what
> the test is checking? Maybe it's intended to refer to the new error
> handling "skipping" the actual error in favor of just reporting it
> without stopping, but that's not really what the word "skipping"
> normally means. Finally, it seems a bit over-engineered: do we really
> need 183 test cases to check that detecting a problem doesn't lead to
> an abort? Like, if that's the purpose of the test, I'd expect it to
> check one corrupt relation and one non-corrupt relation, each with and
> without the no-error behavior. And that's about it. Or maybe it's
> talking about skipping pages during the checks, because those pages
> are all-visible or all-frozen? It's not very clear to me what's going
> on here.

The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks.  I
haverenamed it 001_verify_heapam.pl, since it tests that function. 

>
> + TransactionId nextKnownValidXid;
> + TransactionId oldestValidXid;
>
> Please add explanatory comments indicating what these are intended to
> mean.

Done.

> For most of the the structure members, the brief comments
> already present seem sufficient; but here, more explanation looks
> necessary and less is provided. The "Values for returning tuples"
> could possibly also use some more detail.

Ok, I've expanded the comments for these.

> +#define HEAPCHECK_RELATION_COLS 8
>
> I think this should really be at the top of the file someplace.
> Sometimes people have adopted this style when the #define is only used
> within the function that contains it, but that's not the case here.

Done.

>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("unrecognized parameter for 'skip': %s", skip),
> + errhint("please choose from 'all visible', 'all frozen', "
> + "or NULL")));
>
> I think it would be better if we had three string values selecting the
> different behaviors, and made the parameter NOT NULL but with a
> default. It seems like that would be easier to understand. Right now,
> I can tell that my options for what to skip are "all visible", "all
> frozen", and, uh, some other thing that I don't know what it is. I'm
> gonna guess the third option is to skip nothing, but it seems best to
> make that explicit. Also, should we maybe consider spelling this
> 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> Spaces in an option value seems a little icky to me somehow.

I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.  I did not mark the function as
strict,as I think NULL is a reasonable value (and the default) for startblock and endblock.   

> + int64 startblock = -1;
> + int64 endblock = -1;
> ...
> + if (!PG_ARGISNULL(3))
> + startblock = PG_GETARG_INT64(3);
> + if (!PG_ARGISNULL(4))
> + endblock = PG_GETARG_INT64(4);
> ...
> + if (startblock < 0)
> + startblock = 0;
> + if (endblock < 0 || endblock > ctx.nblocks)
> + endblock = ctx.nblocks;
> +
> + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
>
> So, the user can specify a negative value explicitly and it will be
> treated as the default, and an endblock value that's larger than the
> relation size will be treated as the relation size. The way pg_prewarm
> does the corresponding checks seems superior: null indicates the
> default value, and any non-null value must be within range or you get
> an error. Also, you seem to be treating endblock as the first block
> that should not be checked, whereas pg_prewarm takes what seems to me
> to be the more natural interpretation: the end block is the last block
> that IS checked. If you do it this way, then someone who specifies the
> same start and end block will check no blocks -- silently, I think.

Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block",
andfor relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception.  I
don'tlike that too much, but I'm happy to defer to precedent.  Since you say pg_prewarm works this way (I did not
check),I have changed verify_heapam to do likewise. 

> +               if (skip_all_frozen || skip_all_visible)
>
> Since you can't skip all frozen without skipping all visible, this
> test could be simplified. Or you could introduce a three-valued enum
> and test that skip_pages != SKIP_PAGES_NONE, which might be even
> better.

It works now with a three-valued enum.

> + /* We must unlock the page from the prior iteration, if any */
> + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
>
> I don't understand this assertion, and I don't understand the comment,
> either. I think ctx.blkno can never be equal to InvalidBlockNumber
> because we never set it to anything outside the range of 0..(endblock
> - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> because we just initialized it by calling ReadBufferExtended(). So I
> think this assertion would still pass if we wrote && rather than ||.
> But even then, I don't know what that has to do with the comment or
> why it even makes sense to have an assertion for that in the first
> place.

Yes, it is vestigial.  Removed.

> +       /*
> +        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
> +        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> +        * global oldestXid to be newer than those.  This protection
> saves us from
> +        * having to reacquire the locks and recheck those minimums for every
> +        * tuple, which would be expensive.
> +        */
> +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
>
> I don't think we'd need to recheck for every tuple, would we? Just for
> cases where there's an apparent violation of the rules.

It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and
arbitrarilymuch.  It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how
oftenyou check.  Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really
defensiblechoice for that.  I removed the offending sentence. 

> I guess that
> could still be expensive if there's a lot of them, but needing
> ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> unfortunate.

I welcome strategies that would allow for taking a lesser lock.

> It's also unclear to me why this concerns itself with relfrozenxid and
> the cluster-wide oldestXid value but not with datfrozenxid. It seems
> like if we're going to sanity-check the relfrozenxid against the
> cluster-wide value, we ought to also check it against the
> database-wide value. Checking neither would also seem like a plausible
> choice. But it seems very strange to only check against the
> cluster-wide value.

If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid.
Otherwise,each row needs to be compared against some other minimum xid value. 

Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at
leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. 

Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them
correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in
pg_upgrade. This is awful.  If I compare the xid of a row in a table against the oldest xid value for the database, and
thexid of the row is older, what can I do?  I don't have a principled basis for determining which one of them is wrong.
 

The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if
itdoes report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding.  I think this is a
goodchoice; a tool with only false negatives is much more useful than one with both false positives and false
negatives.  

I have added a comment about my reasoning to verify_heapam.c.  I'm happy to be convinced of a better strategy for
handlingthis situation. 

>
> +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> +                                                "InvalidOffsetNumber
> increments to FirstOffsetNumber");
>
> If you are going to rely on this property, I agree that it is good to
> check it. But it would be better to NOT rely on this property, and I
> suspect the code can be written quite cleanly without relying on it.
> And actually, that's what you did, because you first set ctx.offnum =
> InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> the loop initializer. So AFAICS the first initializer, and the static
> assert, are pointless.

Ah, right you are.  Removed.

>
> +                       if (ItemIdIsRedirected(ctx.itemid))
> +                       {
> +                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> +                               if (redirect <= SizeOfPageHeaderData
> || redirect >= ph->pd_lower)
> ...
> +                               if ((redirect - SizeOfPageHeaderData)
> % sizeof(uint16))
>
> I think that ItemIdGetRedirect() returns an offset, not a byte
> position. So the expectation that I would have is that it would be any
> integer >= 0 and <= maxoff. Am I confused?

I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive.  I
haveupdated the checks. 

> BTW, it seems like it might
> be good to complain if the item to which it points is LP_UNUSED...
> AFAIK that shouldn't happen.

Thanks for mentioning that.  It now checks for that.

> +                                errmsg("\"%s\" is not a heap AM",
>
> I think the correct wording would be just "is not a heap." The "heap
> AM" is the thing in pg_am, not a specific table.

Fixed.

> +confess(HeapCheckContext * ctx, char *msg)
> +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
>
> This is what happens when you pgindent without adding all the right
> things to typedefs.list first ... or when you don't pgindent and have
> odd ideas about how to indent things.

Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?

>
> +       /*
> +        * In principle, there is nothing to prevent a scan over a large, highly
> +        * corrupted table from using workmem worth of memory building up the
> +        * tuplestore.  Don't leak the msg argument memory.
> +        */
> +       pfree(msg);
>
> Maybe change the second sentence to something like: "That should be
> OK, else the user can lower work_mem, but we'd better not leak any
> additional memory."

It may be a little wordy, but I went with

    /*
     * In principle, there is nothing to prevent a scan over a large, highly
     * corrupted table from using workmem worth of memory building up the
     * tuplestore.  That's ok, but if we also leak the msg argument memory
     * until the end of the query, we could exceed workmem by more than a
     * trivial amount.  Therefore, free the msg argument each time we are
     * called rather than waiting for our current memory context to be freed.
     */

> +/*
> + * check_tuphdr_xids
> + *
> + *     Determine whether tuples are visible for verification.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
> + *
> + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> + *     to a table during a corruption check.
> + *  2) Only makes a boolean determination of whether verification should
> + *     see the tuple, rather than doing extra work for vacuum-related
> + *     categorization.
> + *
> + *  The caller should already have checked that xmin and xmax are not out of
> + *  bounds for the relation.
> + */
>
> First, check_tuphdr_xids() doesn't seem like a very good name. If you
> have a function with that name and, like this one, it returns Boolean,
> what does true mean? What does false mean? Kinda hard to tell. And
> also, check the tuple header XIDs *for what*? If you called it, say,
> tuple_is_visible(), that would be self-evident.

Changed.

> Second, consider that we hold at least AccessShareLock on the relation
> - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> cannot be a concurrent modification to the tuple descriptor in
> progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> potentially using a non-current schema. If the tuple is
> HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> inserting transaction, or that transaction committed before we got our
> lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> before we got our lock. Or if it's both inserted and deleted in the
> same transaction, say, then that transaction committed before we got
> our lock or else contains no relevant DDL. IOW, I think you can check
> everything but dead tuples here.

Ok, I have changed tuple_is_visible to return true rather than false for those other cases.

> Capitalization and punctuation for messages complaining about problems
> need to be consistent. verify_heapam() has "Invalid redirect line
> pointer offset %u out of bounds" which starts with a capital letter,
> but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> case, but in any event it should be the same.

I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.

> Also,
> check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> debugging leftover or a very unclear complaint.

Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".

> I think some real work
> needs to be put into the phrasing of these messages so that it's more
> clear exactly what is going on and why it's bad. For example the first
> example in this paragraph is clearly a problem of some kind, but it's
> not very clear exactly what is happening: is %u the offset of the
> invalid line redirect or the value to which it points? I don't think
> the phrasing is very grammatical, which makes it hard to tell which is
> meant, and I actually think it would be a good idea to include both
> things.

Beware that every row returned from amcheck has more fields than just the error message.

    blkno OUT bigint,
    offnum OUT integer,
    lp_off OUT smallint,
    lp_flags OUT smallint,
    lp_len OUT smallint,
    attnum OUT integer,
    chunk OUT integer,
    msg OUT text

Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to
removethese things from messages that include them.  For the specific message under consideration, I've converted the
textto "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u".  That avoids
duplicatingthe offset information of the referring item, while reporting to offset of the referred item. 

> Project policy is generally against splitting a string across multiple
> lines to fit within 80 characters. We like to fit within 80
> characters, but we like to be able to grep for strings more, and
> breaking them up like this makes that harder.

Thanks for clarifying the project policy.  I joined these message strings back together.

> +               confess(ctx,
> +                               pstrdup("corrupt toast chunk va_header"));
>
> This is another message that I don't think is very clear. There's two
> elements to that. One is that the phrasing is not very good, and the
> other is that there are no % escapes

Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x".  I think all the
otherinformation about where the corruption was found is already present in the other returned columns. 

> What's somebody going to do when
> they see this message? First, they're probably going to have to look
> at the code to figure out in which circumstances it gets generated;
> that's a sign that the message isn't phrased clearly enough. That will
> tell them that an unexpected bit pattern has been found, but not what
> that unexpected bit pattern actually was. So then, they're going to
> have to try to find the relevant va_header by some other means and
> fish out the relevant bit so that they can see what actually went
> wrong.

Right.

>
> + *   Checks the current attribute as tracked in ctx for corruption.  Records
> + *   any corruption found in ctx->corruption.
> + *
> + *
>
> Extra blank line.

Fixed.

> +       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> +
>                   ctx->attnum);
>
> Maybe you could avoid the line wrap by declaring this without
> initializing it, and then initializing it as a separate statement.

Yes, I like that better.  I did not need to do the same with infomask, but it looks better to me to break the
declarationand initialization for both, so I did that. 

>
> +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> +
> ctx->tuphdr->t_hoff, ctx->offset,
> +                                                         ctx->lp_len));
>
> Uggh! This isn't even remotely an English sentence. I don't think
> formulas are the way to go here, but I like the idea of formulas in
> some places and written-out messages in others even less. I guess the
> complaint here in English is something like "tuple attribute %d should
> start at offset %u, but tuple length is only %u" or something of that
> sort. Also, it seems like this complaint really ought to have been
> reported on the *preceding* loop iteration, either complaining that
> (1) the fixed length attribute is more than the number of remaining
> bytes in the tuple or (2) the varlena header for the tuple specifies
> an excessively high length. It seems like you're blaming the wrong
> attribute for the problem.

Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute to
blameit on.  I've changed it to report as you suggest, although it also still complains if the first attribute starts
outsidethe bounds of the tuple.  The two error messages now read as "tuple attribute should start at offset %u, but
tuplelength is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u". 

> BTW, the header comments for this function (check_tuple_attribute)
> neglect to document the meaning of the return value.

Fixed.

> +                       confess(ctx, psprintf("tuple xmax = %u
> precedes relation "
> +
> "relfrozenxid = %u",
>
> This is another example of these messages needing  work. The
> corresponding message from heap_prepare_freeze_tuple() is "found
> update xid %u from before relfrozenxid %u". That's better, because we
> don't normally include equals signs in our messages like this, and
> also because "relation relfrozenxid" is redundant. I think this should
> say something like "tuple xmax %u precedes relfrozenxid %u".
>
> +                       confess(ctx, psprintf("tuple xmax = %u is in
> the future",
> +                                                                 xmax));
>
> And then this could be something like "tuple xmax %u follows
> last-assigned xid %u". That would be more symmetric and more
> informative.

Both of these have been changed.

> +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> ctx->tuphdr->t_hoff)
>
> I think we should be able to predict the exact value of t_hoff and
> complain if it isn't precisely equal to the expected value. Or is that
> not possible for some reason?

That is possible, and I've updated the error message to match.  There are cases where you can't know if the
HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on
theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the
expected. That sidesteps the problem of not knowing exactly which value to blame. 

> Is there some place that's checking that lp_len >=
> SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> the header? If not, there should be.

Good catch.  check_tuple() now does that before reading the header.

> +$node->command_ok(
>
> +       [
> +               'pg_amcheck', '-p', $port, 'postgres'
> +       ],
> +       'pg_amcheck all schemas and tables implicitly');
> +
> +$node->command_ok(
> +       [
> +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> +       ],
> +       'pg_amcheck all schemas, tables and indexes');
>
> I haven't really looked through the btree-checking and pg_amcheck
> parts of this much yet, but this caught my eye. Why would the default
> be to check tables but not indexes? I think the default ought to be to
> check everything we know how to check.

I have changed the default to match your expectations.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Amul Sul
Date:
Hi Mark,

I think new structures should be listed in src/tools/pgindent/typedefs.list,
otherwise, pgindent might disturb its indentation.

Regards,
Amul


On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> >> The v10 patch without these ideas is here:
> >
> > Along the lines of what Alvaro was saying before, I think this
> > definitely needs to be split up into a series of patches. The commit
> > message for v10 describes it doing three pretty separate things, and I
> > think that argues for splitting it into a series of three patches. I'd
> > argue for this ordering:
> >
> > 0001 Refactoring existing amcheck btree checking functions to optionally
> > return corruption information rather than ereport'ing it.  This is
> > used by the new pg_amcheck command line tool for reporting back to
> > the caller.
> >
> > 0002 Adding new function verify_heapam for checking a heap relation and
> > associated toast relation, if any, to contrib/amcheck.
> >
> > 0003 Adding new contrib module pg_amcheck, which is a command line
> > interface for running amcheck's verifications against tables and
> > indexes.
> >
> > It's too hard to review things like this when it's all mixed together.
>
> The v11 patch series is broken up as you suggest.
>
> > +++ b/contrib/amcheck/t/skipping.pl
> >
> > The name of this file is inconsistent with the tree's usual
> > convention, which is all stuff like 001_whatever.pl, except for
> > src/test/modules/brin, which randomly decided to use two digits
> > instead of three. There's no precedent for a test file with no leading
> > numeric digits. Also, what does "skipping" even have to do with what
> > the test is checking? Maybe it's intended to refer to the new error
> > handling "skipping" the actual error in favor of just reporting it
> > without stopping, but that's not really what the word "skipping"
> > normally means. Finally, it seems a bit over-engineered: do we really
> > need 183 test cases to check that detecting a problem doesn't lead to
> > an abort? Like, if that's the purpose of the test, I'd expect it to
> > check one corrupt relation and one non-corrupt relation, each with and
> > without the no-error behavior. And that's about it. Or maybe it's
> > talking about skipping pages during the checks, because those pages
> > are all-visible or all-frozen? It's not very clear to me what's going
> > on here.
>
> The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks.  I
haverenamed it 001_verify_heapam.pl, since it tests that function. 
>
> >
> > + TransactionId nextKnownValidXid;
> > + TransactionId oldestValidXid;
> >
> > Please add explanatory comments indicating what these are intended to
> > mean.
>
> Done.
>
> > For most of the the structure members, the brief comments
> > already present seem sufficient; but here, more explanation looks
> > necessary and less is provided. The "Values for returning tuples"
> > could possibly also use some more detail.
>
> Ok, I've expanded the comments for these.
>
> > +#define HEAPCHECK_RELATION_COLS 8
> >
> > I think this should really be at the top of the file someplace.
> > Sometimes people have adopted this style when the #define is only used
> > within the function that contains it, but that's not the case here.
>
> Done.
>
> >
> > + ereport(ERROR,
> > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > + errhint("please choose from 'all visible', 'all frozen', "
> > + "or NULL")));
> >
> > I think it would be better if we had three string values selecting the
> > different behaviors, and made the parameter NOT NULL but with a
> > default. It seems like that would be easier to understand. Right now,
> > I can tell that my options for what to skip are "all visible", "all
> > frozen", and, uh, some other thing that I don't know what it is. I'm
> > gonna guess the third option is to skip nothing, but it seems best to
> > make that explicit. Also, should we maybe consider spelling this
> > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > Spaces in an option value seems a little icky to me somehow.
>
> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.  I did not mark the function
asstrict, as I think NULL is a reasonable value (and the default) for startblock and endblock. 
>
> > + int64 startblock = -1;
> > + int64 endblock = -1;
> > ...
> > + if (!PG_ARGISNULL(3))
> > + startblock = PG_GETARG_INT64(3);
> > + if (!PG_ARGISNULL(4))
> > + endblock = PG_GETARG_INT64(4);
> > ...
> > + if (startblock < 0)
> > + startblock = 0;
> > + if (endblock < 0 || endblock > ctx.nblocks)
> > + endblock = ctx.nblocks;
> > +
> > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> >
> > So, the user can specify a negative value explicitly and it will be
> > treated as the default, and an endblock value that's larger than the
> > relation size will be treated as the relation size. The way pg_prewarm
> > does the corresponding checks seems superior: null indicates the
> > default value, and any non-null value must be within range or you get
> > an error. Also, you seem to be treating endblock as the first block
> > that should not be checked, whereas pg_prewarm takes what seems to me
> > to be the more natural interpretation: the end block is the last block
> > that IS checked. If you do it this way, then someone who specifies the
> > same start and end block will check no blocks -- silently, I think.
>
> Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block",
andfor relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception.  I
don'tlike that too much, but I'm happy to defer to precedent.  Since you say pg_prewarm works this way (I did not
check),I have changed verify_heapam to do likewise. 
>
> > +               if (skip_all_frozen || skip_all_visible)
> >
> > Since you can't skip all frozen without skipping all visible, this
> > test could be simplified. Or you could introduce a three-valued enum
> > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > better.
>
> It works now with a three-valued enum.
>
> > + /* We must unlock the page from the prior iteration, if any */
> > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> >
> > I don't understand this assertion, and I don't understand the comment,
> > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > because we never set it to anything outside the range of 0..(endblock
> > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > because we just initialized it by calling ReadBufferExtended(). So I
> > think this assertion would still pass if we wrote && rather than ||.
> > But even then, I don't know what that has to do with the comment or
> > why it even makes sense to have an assertion for that in the first
> > place.
>
> Yes, it is vestigial.  Removed.
>
> > +       /*
> > +        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
> > +        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> > +        * global oldestXid to be newer than those.  This protection
> > saves us from
> > +        * having to reacquire the locks and recheck those minimums for every
> > +        * tuple, which would be expensive.
> > +        */
> > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> >
> > I don't think we'd need to recheck for every tuple, would we? Just for
> > cases where there's an apparent violation of the rules.
>
> It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and
arbitrarilymuch.  It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how
oftenyou check.  Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really
defensiblechoice for that.  I removed the offending sentence. 
>
> > I guess that
> > could still be expensive if there's a lot of them, but needing
> > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > unfortunate.
>
> I welcome strategies that would allow for taking a lesser lock.
>
> > It's also unclear to me why this concerns itself with relfrozenxid and
> > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > like if we're going to sanity-check the relfrozenxid against the
> > cluster-wide value, we ought to also check it against the
> > database-wide value. Checking neither would also seem like a plausible
> > choice. But it seems very strange to only check against the
> > cluster-wide value.
>
> If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid.
Otherwise,each row needs to be compared against some other minimum xid value. 
>
> Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at
leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. 
>
> Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them
correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in
pg_upgrade. This is awful.  If I compare the xid of a row in a table against the oldest xid value for the database, and
thexid of the row is older, what can I do?  I don't have a principled basis for determining which one of them is wrong. 
>
> The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if
itdoes report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding.  I think this is a
goodchoice; a tool with only false negatives is much more useful than one with both false positives and false
negatives.
>
> I have added a comment about my reasoning to verify_heapam.c.  I'm happy to be convinced of a better strategy for
handlingthis situation. 
>
> >
> > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > +                                                "InvalidOffsetNumber
> > increments to FirstOffsetNumber");
> >
> > If you are going to rely on this property, I agree that it is good to
> > check it. But it would be better to NOT rely on this property, and I
> > suspect the code can be written quite cleanly without relying on it.
> > And actually, that's what you did, because you first set ctx.offnum =
> > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > the loop initializer. So AFAICS the first initializer, and the static
> > assert, are pointless.
>
> Ah, right you are.  Removed.
>
> >
> > +                       if (ItemIdIsRedirected(ctx.itemid))
> > +                       {
> > +                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> > +                               if (redirect <= SizeOfPageHeaderData
> > || redirect >= ph->pd_lower)
> > ...
> > +                               if ((redirect - SizeOfPageHeaderData)
> > % sizeof(uint16))
> >
> > I think that ItemIdGetRedirect() returns an offset, not a byte
> > position. So the expectation that I would have is that it would be any
> > integer >= 0 and <= maxoff. Am I confused?
>
> I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive.
Ihave updated the checks. 
>
> > BTW, it seems like it might
> > be good to complain if the item to which it points is LP_UNUSED...
> > AFAIK that shouldn't happen.
>
> Thanks for mentioning that.  It now checks for that.
>
> > +                                errmsg("\"%s\" is not a heap AM",
> >
> > I think the correct wording would be just "is not a heap." The "heap
> > AM" is the thing in pg_am, not a specific table.
>
> Fixed.
>
> > +confess(HeapCheckContext * ctx, char *msg)
> > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> >
> > This is what happens when you pgindent without adding all the right
> > things to typedefs.list first ... or when you don't pgindent and have
> > odd ideas about how to indent things.
>
> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>
> >
> > +       /*
> > +        * In principle, there is nothing to prevent a scan over a large, highly
> > +        * corrupted table from using workmem worth of memory building up the
> > +        * tuplestore.  Don't leak the msg argument memory.
> > +        */
> > +       pfree(msg);
> >
> > Maybe change the second sentence to something like: "That should be
> > OK, else the user can lower work_mem, but we'd better not leak any
> > additional memory."
>
> It may be a little wordy, but I went with
>
>     /*
>      * In principle, there is nothing to prevent a scan over a large, highly
>      * corrupted table from using workmem worth of memory building up the
>      * tuplestore.  That's ok, but if we also leak the msg argument memory
>      * until the end of the query, we could exceed workmem by more than a
>      * trivial amount.  Therefore, free the msg argument each time we are
>      * called rather than waiting for our current memory context to be freed.
>      */
>
> > +/*
> > + * check_tuphdr_xids
> > + *
> > + *     Determine whether tuples are visible for verification.  Similar to
> > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > + *
> > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > + *     to a table during a corruption check.
> > + *  2) Only makes a boolean determination of whether verification should
> > + *     see the tuple, rather than doing extra work for vacuum-related
> > + *     categorization.
> > + *
> > + *  The caller should already have checked that xmin and xmax are not out of
> > + *  bounds for the relation.
> > + */
> >
> > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > have a function with that name and, like this one, it returns Boolean,
> > what does true mean? What does false mean? Kinda hard to tell. And
> > also, check the tuple header XIDs *for what*? If you called it, say,
> > tuple_is_visible(), that would be self-evident.
>
> Changed.
>
> > Second, consider that we hold at least AccessShareLock on the relation
> > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > cannot be a concurrent modification to the tuple descriptor in
> > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > potentially using a non-current schema. If the tuple is
> > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > inserting transaction, or that transaction committed before we got our
> > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > before we got our lock. Or if it's both inserted and deleted in the
> > same transaction, say, then that transaction committed before we got
> > our lock or else contains no relevant DDL. IOW, I think you can check
> > everything but dead tuples here.
>
> Ok, I have changed tuple_is_visible to return true rather than false for those other cases.
>
> > Capitalization and punctuation for messages complaining about problems
> > need to be consistent. verify_heapam() has "Invalid redirect line
> > pointer offset %u out of bounds" which starts with a capital letter,
> > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > case, but in any event it should be the same.
>
> I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.
>
> > Also,
> > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > debugging leftover or a very unclear complaint.
>
> Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".
>
> > I think some real work
> > needs to be put into the phrasing of these messages so that it's more
> > clear exactly what is going on and why it's bad. For example the first
> > example in this paragraph is clearly a problem of some kind, but it's
> > not very clear exactly what is happening: is %u the offset of the
> > invalid line redirect or the value to which it points? I don't think
> > the phrasing is very grammatical, which makes it hard to tell which is
> > meant, and I actually think it would be a good idea to include both
> > things.
>
> Beware that every row returned from amcheck has more fields than just the error message.
>
>     blkno OUT bigint,
>     offnum OUT integer,
>     lp_off OUT smallint,
>     lp_flags OUT smallint,
>     lp_len OUT smallint,
>     attnum OUT integer,
>     chunk OUT integer,
>     msg OUT text
>
> Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to
removethese things from messages that include them.  For the specific message under consideration, I've converted the
textto "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u".  That avoids
duplicatingthe offset information of the referring item, while reporting to offset of the referred item. 
>
> > Project policy is generally against splitting a string across multiple
> > lines to fit within 80 characters. We like to fit within 80
> > characters, but we like to be able to grep for strings more, and
> > breaking them up like this makes that harder.
>
> Thanks for clarifying the project policy.  I joined these message strings back together.
>
> > +               confess(ctx,
> > +                               pstrdup("corrupt toast chunk va_header"));
> >
> > This is another message that I don't think is very clear. There's two
> > elements to that. One is that the phrasing is not very good, and the
> > other is that there are no % escapes
>
> Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x".  I think all the
otherinformation about where the corruption was found is already present in the other returned columns. 
>
> > What's somebody going to do when
> > they see this message? First, they're probably going to have to look
> > at the code to figure out in which circumstances it gets generated;
> > that's a sign that the message isn't phrased clearly enough. That will
> > tell them that an unexpected bit pattern has been found, but not what
> > that unexpected bit pattern actually was. So then, they're going to
> > have to try to find the relevant va_header by some other means and
> > fish out the relevant bit so that they can see what actually went
> > wrong.
>
> Right.
>
> >
> > + *   Checks the current attribute as tracked in ctx for corruption.  Records
> > + *   any corruption found in ctx->corruption.
> > + *
> > + *
> >
> > Extra blank line.
>
> Fixed.
>
> > +       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> > +
> >                   ctx->attnum);
> >
> > Maybe you could avoid the line wrap by declaring this without
> > initializing it, and then initializing it as a separate statement.
>
> Yes, I like that better.  I did not need to do the same with infomask, but it looks better to me to break the
declarationand initialization for both, so I did that. 
>
> >
> > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> > +
> > ctx->tuphdr->t_hoff, ctx->offset,
> > +                                                         ctx->lp_len));
> >
> > Uggh! This isn't even remotely an English sentence. I don't think
> > formulas are the way to go here, but I like the idea of formulas in
> > some places and written-out messages in others even less. I guess the
> > complaint here in English is something like "tuple attribute %d should
> > start at offset %u, but tuple length is only %u" or something of that
> > sort. Also, it seems like this complaint really ought to have been
> > reported on the *preceding* loop iteration, either complaining that
> > (1) the fixed length attribute is more than the number of remaining
> > bytes in the tuple or (2) the varlena header for the tuple specifies
> > an excessively high length. It seems like you're blaming the wrong
> > attribute for the problem.
>
> Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute
toblame it on.  I've changed it to report as you suggest, although it also still complains if the first attribute
startsoutside the bounds of the tuple.  The two error messages now read as "tuple attribute should start at offset %u,
buttuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u". 
>
> > BTW, the header comments for this function (check_tuple_attribute)
> > neglect to document the meaning of the return value.
>
> Fixed.
>
> > +                       confess(ctx, psprintf("tuple xmax = %u
> > precedes relation "
> > +
> > "relfrozenxid = %u",
> >
> > This is another example of these messages needing  work. The
> > corresponding message from heap_prepare_freeze_tuple() is "found
> > update xid %u from before relfrozenxid %u". That's better, because we
> > don't normally include equals signs in our messages like this, and
> > also because "relation relfrozenxid" is redundant. I think this should
> > say something like "tuple xmax %u precedes relfrozenxid %u".
> >
> > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > the future",
> > +                                                                 xmax));
> >
> > And then this could be something like "tuple xmax %u follows
> > last-assigned xid %u". That would be more symmetric and more
> > informative.
>
> Both of these have been changed.
>
> > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > ctx->tuphdr->t_hoff)
> >
> > I think we should be able to predict the exact value of t_hoff and
> > complain if it isn't precisely equal to the expected value. Or is that
> > not possible for some reason?
>
> That is possible, and I've updated the error message to match.  There are cases where you can't know if the
HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on
theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the
expected. That sidesteps the problem of not knowing exactly which value to blame. 
>
> > Is there some place that's checking that lp_len >=
> > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > the header? If not, there should be.
>
> Good catch.  check_tuple() now does that before reading the header.
>
> > +$node->command_ok(
> >
> > +       [
> > +               'pg_amcheck', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas and tables implicitly');
> > +
> > +$node->command_ok(
> > +       [
> > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > +       ],
> > +       'pg_amcheck all schemas, tables and indexes');
> >
> > I haven't really looked through the btree-checking and pg_amcheck
> > parts of this much yet, but this caught my eye. Why would the default
> > be to check tables but not indexes? I think the default ought to be to
> > check everything we know how to check.
>
> I have changed the default to match your expectations.
>
>
>
> —
> Mark Dilger
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
>
>



Re: new heapcheck contrib module

From
Amul Sul
Date:
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sulamul@gmail.com> wrote:
>
> Hi Mark,
>
> I think new structures should be listed in src/tools/pgindent/typedefs.list,
> otherwise, pgindent might disturb its indentation.
>
> Regards,
> Amul
>
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> >
> >
> >
> > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > >> The v10 patch without these ideas is here:
> > >
> > > Along the lines of what Alvaro was saying before, I think this
> > > definitely needs to be split up into a series of patches. The commit
> > > message for v10 describes it doing three pretty separate things, and I
> > > think that argues for splitting it into a series of three patches. I'd
> > > argue for this ordering:
> > >
> > > 0001 Refactoring existing amcheck btree checking functions to optionally
> > > return corruption information rather than ereport'ing it.  This is
> > > used by the new pg_amcheck command line tool for reporting back to
> > > the caller.
> > >
> > > 0002 Adding new function verify_heapam for checking a heap relation and
> > > associated toast relation, if any, to contrib/amcheck.
> > >
> > > 0003 Adding new contrib module pg_amcheck, which is a command line
> > > interface for running amcheck's verifications against tables and
> > > indexes.
> > >
> > > It's too hard to review things like this when it's all mixed together.
> >
> > The v11 patch series is broken up as you suggest.
> >
> > > +++ b/contrib/amcheck/t/skipping.pl
> > >
> > > The name of this file is inconsistent with the tree's usual
> > > convention, which is all stuff like 001_whatever.pl, except for
> > > src/test/modules/brin, which randomly decided to use two digits
> > > instead of three. There's no precedent for a test file with no leading
> > > numeric digits. Also, what does "skipping" even have to do with what
> > > the test is checking? Maybe it's intended to refer to the new error
> > > handling "skipping" the actual error in favor of just reporting it
> > > without stopping, but that's not really what the word "skipping"
> > > normally means. Finally, it seems a bit over-engineered: do we really
> > > need 183 test cases to check that detecting a problem doesn't lead to
> > > an abort? Like, if that's the purpose of the test, I'd expect it to
> > > check one corrupt relation and one non-corrupt relation, each with and
> > > without the no-error behavior. And that's about it. Or maybe it's
> > > talking about skipping pages during the checks, because those pages
> > > are all-visible or all-frozen? It's not very clear to me what's going
> > > on here.
> >
> > The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks.
Ihave renamed it 001_verify_heapam.pl, since it tests that function. 
> >
> > >
> > > + TransactionId nextKnownValidXid;
> > > + TransactionId oldestValidXid;
> > >
> > > Please add explanatory comments indicating what these are intended to
> > > mean.
> >
> > Done.
> >
> > > For most of the the structure members, the brief comments
> > > already present seem sufficient; but here, more explanation looks
> > > necessary and less is provided. The "Values for returning tuples"
> > > could possibly also use some more detail.
> >
> > Ok, I've expanded the comments for these.
> >
> > > +#define HEAPCHECK_RELATION_COLS 8
> > >
> > > I think this should really be at the top of the file someplace.
> > > Sometimes people have adopted this style when the #define is only used
> > > within the function that contains it, but that's not the case here.
> >
> > Done.
> >
> > >
> > > + ereport(ERROR,
> > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> > > + errmsg("unrecognized parameter for 'skip': %s", skip),
> > > + errhint("please choose from 'all visible', 'all frozen', "
> > > + "or NULL")));
> > >
> > > I think it would be better if we had three string values selecting the
> > > different behaviors, and made the parameter NOT NULL but with a
> > > default. It seems like that would be easier to understand. Right now,
> > > I can tell that my options for what to skip are "all visible", "all
> > > frozen", and, uh, some other thing that I don't know what it is. I'm
> > > gonna guess the third option is to skip nothing, but it seems best to
> > > make that explicit. Also, should we maybe consider spelling this
> > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces?
> > > Spaces in an option value seems a little icky to me somehow.
> >
> > I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.  I did not mark the function
asstrict, as I think NULL is a reasonable value (and the default) for startblock and endblock. 
> >
> > > + int64 startblock = -1;
> > > + int64 endblock = -1;
> > > ...
> > > + if (!PG_ARGISNULL(3))
> > > + startblock = PG_GETARG_INT64(3);
> > > + if (!PG_ARGISNULL(4))
> > > + endblock = PG_GETARG_INT64(4);
> > > ...
> > > + if (startblock < 0)
> > > + startblock = 0;
> > > + if (endblock < 0 || endblock > ctx.nblocks)
> > > + endblock = ctx.nblocks;
> > > +
> > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++)
> > >
> > > So, the user can specify a negative value explicitly and it will be
> > > treated as the default, and an endblock value that's larger than the
> > > relation size will be treated as the relation size. The way pg_prewarm
> > > does the corresponding checks seems superior: null indicates the
> > > default value, and any non-null value must be within range or you get
> > > an error. Also, you seem to be treating endblock as the first block
> > > that should not be checked, whereas pg_prewarm takes what seems to me
> > > to be the more natural interpretation: the end block is the last block
> > > that IS checked. If you do it this way, then someone who specifies the
> > > same start and end block will check no blocks -- silently, I think.
> >
> > Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th
block",and for relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an
exception. I don't like that too much, but I'm happy to defer to precedent.  Since you say pg_prewarm works this way (I
didnot check), I have changed verify_heapam to do likewise. 
> >
> > > +               if (skip_all_frozen || skip_all_visible)
> > >
> > > Since you can't skip all frozen without skipping all visible, this
> > > test could be simplified. Or you could introduce a three-valued enum
> > > and test that skip_pages != SKIP_PAGES_NONE, which might be even
> > > better.
> >
> > It works now with a three-valued enum.
> >
> > > + /* We must unlock the page from the prior iteration, if any */
> > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer);
> > >
> > > I don't understand this assertion, and I don't understand the comment,
> > > either. I think ctx.blkno can never be equal to InvalidBlockNumber
> > > because we never set it to anything outside the range of 0..(endblock
> > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer
> > > because we just initialized it by calling ReadBufferExtended(). So I
> > > think this assertion would still pass if we wrote && rather than ||.
> > > But even then, I don't know what that has to do with the comment or
> > > why it even makes sense to have an assertion for that in the first
> > > place.
> >
> > Yes, it is vestigial.  Removed.
> >
> > > +       /*
> > > +        * Open the relation.  We use ShareUpdateExclusive to prevent concurrent
> > > +        * vacuums from changing the relfrozenxid, relminmxid, or advancing the
> > > +        * global oldestXid to be newer than those.  This protection
> > > saves us from
> > > +        * having to reacquire the locks and recheck those minimums for every
> > > +        * tuple, which would be expensive.
> > > +        */
> > > +       ctx.rel = relation_open(relid, ShareUpdateExclusiveLock);
> > >
> > > I don't think we'd need to recheck for every tuple, would we? Just for
> > > cases where there's an apparent violation of the rules.
> >
> > It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and
arbitrarilymuch.  It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how
oftenyou check.  Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really
defensiblechoice for that.  I removed the offending sentence. 
> >
> > > I guess that
> > > could still be expensive if there's a lot of them, but needing
> > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > > unfortunate.
> >
> > I welcome strategies that would allow for taking a lesser lock.
> >
> > > It's also unclear to me why this concerns itself with relfrozenxid and
> > > the cluster-wide oldestXid value but not with datfrozenxid. It seems
> > > like if we're going to sanity-check the relfrozenxid against the
> > > cluster-wide value, we ought to also check it against the
> > > database-wide value. Checking neither would also seem like a plausible
> > > choice. But it seems very strange to only check against the
> > > cluster-wide value.
> >
> > If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid.
Otherwise, each row needs to be compared against some other minimum xid value. 
> >
> > Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at
leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. 
> >
> > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them
correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in
pg_upgrade. This is awful.  If I compare the xid of a row in a table against the oldest xid value for the database, and
thexid of the row is older, what can I do?  I don't have a principled basis for determining which one of them is wrong. 
> >
> > The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but
ifit does report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding.  I think this
isa good choice; a tool with only false negatives is much more useful than one with both false positives and false
negatives.
> >
> > I have added a comment about my reasoning to verify_heapam.c.  I'm happy to be convinced of a better strategy for
handlingthis situation. 
> >
> > >
> > > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > > +                                                "InvalidOffsetNumber
> > > increments to FirstOffsetNumber");
> > >
> > > If you are going to rely on this property, I agree that it is good to
> > > check it. But it would be better to NOT rely on this property, and I
> > > suspect the code can be written quite cleanly without relying on it.
> > > And actually, that's what you did, because you first set ctx.offnum =
> > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > > the loop initializer. So AFAICS the first initializer, and the static
> > > assert, are pointless.
> >
> > Ah, right you are.  Removed.
> >
> > >
> > > +                       if (ItemIdIsRedirected(ctx.itemid))
> > > +                       {
> > > +                               uint16 redirect = ItemIdGetRedirect(ctx.itemid);
> > > +                               if (redirect <= SizeOfPageHeaderData
> > > || redirect >= ph->pd_lower)
> > > ...
> > > +                               if ((redirect - SizeOfPageHeaderData)
> > > % sizeof(uint16))
> > >
> > > I think that ItemIdGetRedirect() returns an offset, not a byte
> > > position. So the expectation that I would have is that it would be any
> > > integer >= 0 and <= maxoff. Am I confused?
> >
> > I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff,
inclusive. I have updated the checks. 
> >
> > > BTW, it seems like it might
> > > be good to complain if the item to which it points is LP_UNUSED...
> > > AFAIK that shouldn't happen.
> >
> > Thanks for mentioning that.  It now checks for that.
> >
> > > +                                errmsg("\"%s\" is not a heap AM",
> > >
> > > I think the correct wording would be just "is not a heap." The "heap
> > > AM" is the thing in pg_am, not a specific table.
> >
> > Fixed.
> >
> > > +confess(HeapCheckContext * ctx, char *msg)
> > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> > >
> > > This is what happens when you pgindent without adding all the right
> > > things to typedefs.list first ... or when you don't pgindent and have
> > > odd ideas about how to indent things.
> >
> > Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
> >
> > >
> > > +       /*
> > > +        * In principle, there is nothing to prevent a scan over a large, highly
> > > +        * corrupted table from using workmem worth of memory building up the
> > > +        * tuplestore.  Don't leak the msg argument memory.
> > > +        */
> > > +       pfree(msg);
> > >
> > > Maybe change the second sentence to something like: "That should be
> > > OK, else the user can lower work_mem, but we'd better not leak any
> > > additional memory."
> >
> > It may be a little wordy, but I went with
> >
> >     /*
> >      * In principle, there is nothing to prevent a scan over a large, highly
> >      * corrupted table from using workmem worth of memory building up the
> >      * tuplestore.  That's ok, but if we also leak the msg argument memory
> >      * until the end of the query, we could exceed workmem by more than a
> >      * trivial amount.  Therefore, free the msg argument each time we are
> >      * called rather than waiting for our current memory context to be freed.
> >      */
> >
> > > +/*
> > > + * check_tuphdr_xids
> > > + *
> > > + *     Determine whether tuples are visible for verification.  Similar to
> > > + *  HeapTupleSatisfiesVacuum, but with critical differences.
> > > + *
> > > + *  1) Does not touch hint bits.  It seems imprudent to write hint bits
> > > + *     to a table during a corruption check.
> > > + *  2) Only makes a boolean determination of whether verification should
> > > + *     see the tuple, rather than doing extra work for vacuum-related
> > > + *     categorization.
> > > + *
> > > + *  The caller should already have checked that xmin and xmax are not out of
> > > + *  bounds for the relation.
> > > + */
> > >
> > > First, check_tuphdr_xids() doesn't seem like a very good name. If you
> > > have a function with that name and, like this one, it returns Boolean,
> > > what does true mean? What does false mean? Kinda hard to tell. And
> > > also, check the tuple header XIDs *for what*? If you called it, say,
> > > tuple_is_visible(), that would be self-evident.
> >
> > Changed.
> >
> > > Second, consider that we hold at least AccessShareLock on the relation
> > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there
> > > cannot be a concurrent modification to the tuple descriptor in
> > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is
> > > potentially using a non-current schema. If the tuple is
> > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the
> > > inserting transaction, or that transaction committed before we got our
> > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or
> > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed
> > > before we got our lock. Or if it's both inserted and deleted in the
> > > same transaction, say, then that transaction committed before we got
> > > our lock or else contains no relevant DDL. IOW, I think you can check
> > > everything but dead tuples here.
> >
> > Ok, I have changed tuple_is_visible to return true rather than false for those other cases.
> >
> > > Capitalization and punctuation for messages complaining about problems
> > > need to be consistent. verify_heapam() has "Invalid redirect line
> > > pointer offset %u out of bounds" which starts with a capital letter,
> > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither
> > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower
> > > case, but in any event it should be the same.
> >
> > I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone.
> >
> > > Also,
> > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a
> > > debugging leftover or a very unclear complaint.
> >
> > Right.  That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation".
> >
> > > I think some real work
> > > needs to be put into the phrasing of these messages so that it's more
> > > clear exactly what is going on and why it's bad. For example the first
> > > example in this paragraph is clearly a problem of some kind, but it's
> > > not very clear exactly what is happening: is %u the offset of the
> > > invalid line redirect or the value to which it points? I don't think
> > > the phrasing is very grammatical, which makes it hard to tell which is
> > > meant, and I actually think it would be a good idea to include both
> > > things.
> >
> > Beware that every row returned from amcheck has more fields than just the error message.
> >
> >     blkno OUT bigint,
> >     offnum OUT integer,
> >     lp_off OUT smallint,
> >     lp_flags OUT smallint,
> >     lp_len OUT smallint,
> >     attnum OUT integer,
> >     chunk OUT integer,
> >     msg OUT text
> >
> > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better
toremove these things from messages that include them.  For the specific message under consideration, I've converted
thetext to "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u".  That avoids
duplicatingthe offset information of the referring item, while reporting to offset of the referred item. 
> >
> > > Project policy is generally against splitting a string across multiple
> > > lines to fit within 80 characters. We like to fit within 80
> > > characters, but we like to be able to grep for strings more, and
> > > breaking them up like this makes that harder.
> >
> > Thanks for clarifying the project policy.  I joined these message strings back together.

In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
be joined.

e.g:

+ /* check to see if caller supports us returning a tuplestore */
+ if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+ ereport(ERROR,
+ (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("set-valued function called in context that cannot "
+ "accept a set")));
+ if (!(rsinfo->allowedModes & SFRM_Materialize))
+ ereport(ERROR,
+ (errcode(ERRCODE_SYNTAX_ERROR),
+ errmsg("materialize mode required, but it is not allowed "
+ "in this context")));

> >
> > > +               confess(ctx,
> > > +                               pstrdup("corrupt toast chunk va_header"));
> > >
> > > This is another message that I don't think is very clear. There's two
> > > elements to that. One is that the phrasing is not very good, and the
> > > other is that there are no % escapes
> >
> > Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x".  I think all the
otherinformation about where the corruption was found is already present in the other returned columns. 
> >
> > > What's somebody going to do when
> > > they see this message? First, they're probably going to have to look
> > > at the code to figure out in which circumstances it gets generated;
> > > that's a sign that the message isn't phrased clearly enough. That will
> > > tell them that an unexpected bit pattern has been found, but not what
> > > that unexpected bit pattern actually was. So then, they're going to
> > > have to try to find the relevant va_header by some other means and
> > > fish out the relevant bit so that they can see what actually went
> > > wrong.
> >
> > Right.
> >
> > >
> > > + *   Checks the current attribute as tracked in ctx for corruption.  Records
> > > + *   any corruption found in ctx->corruption.
> > > + *
> > > + *
> > >
> > > Extra blank line.
> >
> > Fixed.
> >
> > > +       Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel),
> > > +
> > >                   ctx->attnum);
> > >
> > > Maybe you could avoid the line wrap by declaring this without
> > > initializing it, and then initializing it as a separate statement.
> >
> > Yes, I like that better.  I did not need to do the same with infomask, but it looks better to me to break the
declarationand initialization for both, so I did that. 
> >
> > >
> > > +               confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)",
> > > +
> > > ctx->tuphdr->t_hoff, ctx->offset,
> > > +                                                         ctx->lp_len));
> > >
> > > Uggh! This isn't even remotely an English sentence. I don't think
> > > formulas are the way to go here, but I like the idea of formulas in
> > > some places and written-out messages in others even less. I guess the
> > > complaint here in English is something like "tuple attribute %d should
> > > start at offset %u, but tuple length is only %u" or something of that
> > > sort. Also, it seems like this complaint really ought to have been
> > > reported on the *preceding* loop iteration, either complaining that
> > > (1) the fixed length attribute is more than the number of remaining
> > > bytes in the tuple or (2) the varlena header for the tuple specifies
> > > an excessively high length. It seems like you're blaming the wrong
> > > attribute for the problem.
> >
> > Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next
attributeto blame it on.  I've changed it to report as you suggest, although it also still complains if the first
attributestarts outside the bounds of the tuple.  The two error messages now read as "tuple attribute should start at
offset%u, but tuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only
%u".
> >
> > > BTW, the header comments for this function (check_tuple_attribute)
> > > neglect to document the meaning of the return value.
> >
> > Fixed.
> >
> > > +                       confess(ctx, psprintf("tuple xmax = %u
> > > precedes relation "
> > > +
> > > "relfrozenxid = %u",
> > >
> > > This is another example of these messages needing  work. The
> > > corresponding message from heap_prepare_freeze_tuple() is "found
> > > update xid %u from before relfrozenxid %u". That's better, because we
> > > don't normally include equals signs in our messages like this, and
> > > also because "relation relfrozenxid" is redundant. I think this should
> > > say something like "tuple xmax %u precedes relfrozenxid %u".
> > >
> > > +                       confess(ctx, psprintf("tuple xmax = %u is in
> > > the future",
> > > +                                                                 xmax));
> > >
> > > And then this could be something like "tuple xmax %u follows
> > > last-assigned xid %u". That would be more symmetric and more
> > > informative.
> >
> > Both of these have been changed.
> >
> > > +               if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) >
> > > ctx->tuphdr->t_hoff)
> > >
> > > I think we should be able to predict the exact value of t_hoff and
> > > complain if it isn't precisely equal to the expected value. Or is that
> > > not possible for some reason?
> >
> > That is possible, and I've updated the error message to match.  There are cases where you can't know if the
HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on
theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the
expected. That sidesteps the problem of not knowing exactly which value to blame. 
> >
> > > Is there some place that's checking that lp_len >=
> > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into
> > > the header? If not, there should be.
> >
> > Good catch.  check_tuple() now does that before reading the header.
> >
> > > +$node->command_ok(
> > >
> > > +       [
> > > +               'pg_amcheck', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas and tables implicitly');
> > > +
> > > +$node->command_ok(
> > > +       [
> > > +               'pg_amcheck', '-i', '-p', $port, 'postgres'
> > > +       ],
> > > +       'pg_amcheck all schemas, tables and indexes');
> > >
> > > I haven't really looked through the btree-checking and pg_amcheck
> > > parts of this much yet, but this caught my eye. Why would the default
> > > be to check tables but not indexes? I think the default ought to be to
> > > check everything we know how to check.
> >
> > I have changed the default to match your expectations.
> >
> >
> >
> > —
> > Mark Dilger
> > EnterpriseDB: http://www.enterprisedb.com
> > The Enterprise PostgreSQL Company
> >
> >
> >



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 20, 2020, at 11:50 PM, Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sulamul@gmail.com> wrote:
>>
>> Hi Mark,
>>
>> I think new structures should be listed in src/tools/pgindent/typedefs.list,
>> otherwise, pgindent might disturb its indentation.
>>

<snip>

>
> In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to
> be joined.
>
> e.g:
>
> + /* check to see if caller supports us returning a tuplestore */
> + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
> + ereport(ERROR,
> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> + errmsg("set-valued function called in context that cannot "
> + "accept a set")));
> + if (!(rsinfo->allowedModes & SFRM_Materialize))
> + ereport(ERROR,
> + (errcode(ERRCODE_SYNTAX_ERROR),
> + errmsg("materialize mode required, but it is not allowed "
> + "in this context")));

Thanks for the review!

I believe these v12 patches resolve the two issues you raised.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Amul Sul
Date:
On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> [....]
> >
> > +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
> > +                                                "InvalidOffsetNumber
> > increments to FirstOffsetNumber");
> >
> > If you are going to rely on this property, I agree that it is good to
> > check it. But it would be better to NOT rely on this property, and I
> > suspect the code can be written quite cleanly without relying on it.
> > And actually, that's what you did, because you first set ctx.offnum =
> > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
> > the loop initializer. So AFAICS the first initializer, and the static
> > assert, are pointless.
>
> Ah, right you are.  Removed.
>

I can see the same assert and the unnecessary assignment in v12-0002,  is that
the same thing that is supposed to be removed, or am I missing something?

> [....]
> > +confess(HeapCheckContext * ctx, char *msg)
> > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
> > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
> >
> > This is what happens when you pgindent without adding all the right
> > things to typedefs.list first ... or when you don't pgindent and have
> > odd ideas about how to indent things.
>
> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>

I think it was the same thing related to my previous suggestion to list new
structures to typedefs.list.  V12 has listed new structures but I think there
are still some more adjustments needed in the code e.g. see space between
HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the
pgindent will do that or not.

Here are a few more minor comments for the v12-0002 patch & some of them
apply to other patches as well:

 #include "utils/snapmgr.h"
-
+#include "amcheck.h"

Doesn't seem to be at the correct place -- need to be in sorted order.


+ if (!PG_ARGISNULL(3))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("starting block " INT64_FORMAT
+ " is out of bounds for relation with no blocks",
+ PG_GETARG_INT64(3))));
+ if (!PG_ARGISNULL(4))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("ending block " INT64_FORMAT
+ " is out of bounds for relation with no blocks",
+ PG_GETARG_INT64(4))));

I think these errmsg() strings also should be in one line.


+ if (fatal)
+ {
+ if (ctx.toast_indexes)
+ toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
+ ShareUpdateExclusiveLock);
+ if (ctx.toastrel)
+ table_close(ctx.toastrel, ShareUpdateExclusiveLock);

Toast index and rel closing block style is not the same as at the ending of
verify_heapam().


+ /* If we get this far, we know the relation has at least one block */
+ startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3);
+ endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4);
+ if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT
+ " is out of bounds for relation with block count %u",
+ startblock, endblock, ctx.nblocks)));
+
...
...
+ if (startblock < 0)
+ startblock = 0;
+ if (endblock < 0 || endblock > ctx.nblocks)
+ endblock = ctx.nblocks;

Other than endblock < 0 case, do we really need that?  I think due to the above
error check the rest of the cases will not reach this place.


+ confess(ctx, psprintf(
+   "tuple xmax %u follows last assigned xid %u",
+   xmax, ctx->nextKnownValidXid));
+ fatal = true;
+ }
+ }
+
+ /* Check for tuple header corruption */
+ if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
+ {
+ confess(ctx,
+ psprintf("tuple's header size is %u bytes which is less than the %u
byte minimum valid header size",
+ ctx->tuphdr->t_hoff,
+ (unsigned) SizeofHeapTupleHeader));

confess() call has two different code styles, first one where psprintf()'s only
argument got its own line and second style where psprintf has its own line with
the argument. I think the 2nd style is what we do follow & correct, not the
former.


+ if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
+ ereport(ERROR,
+ (errcode(ERRCODE_WRONG_OBJECT_TYPE),
+ errmsg("\"%s\" is not a heap",
+ RelationGetRelationName(rel))));

Like elsewhere,  can we have errmsg as "only heap AM is supported" and error
code is ERRCODE_FEATURE_NOT_SUPPORTED ?


That all, for now, apologize for multiple review emails.

Regards,
Amul



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 26, 2020, at 9:27 PM, Amul Sul <sulamul@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> [....]
>>>
>>> +               StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber,
>>> +                                                "InvalidOffsetNumber
>>> increments to FirstOffsetNumber");
>>>
>>> If you are going to rely on this property, I agree that it is good to
>>> check it. But it would be better to NOT rely on this property, and I
>>> suspect the code can be written quite cleanly without relying on it.
>>> And actually, that's what you did, because you first set ctx.offnum =
>>> InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in
>>> the loop initializer. So AFAICS the first initializer, and the static
>>> assert, are pointless.
>>
>> Ah, right you are.  Removed.
>>
>
> I can see the same assert and the unnecessary assignment in v12-0002,  is that
> the same thing that is supposed to be removed, or am I missing something?

That's the same thing.  I removed it, but obviously I somehow removed the removal prior to making the patch.  My best
guessis that I reverted some set of changes that unintentionally included this one. 

>
>> [....]
>>> +confess(HeapCheckContext * ctx, char *msg)
>>> +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx)
>>> +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx)
>>>
>>> This is what happens when you pgindent without adding all the right
>>> things to typedefs.list first ... or when you don't pgindent and have
>>> odd ideas about how to indent things.
>>
>> Hmm.  I don't see the three lines of code you are quoting.  Which patch is that from?
>>
>
> I think it was the same thing related to my previous suggestion to list new
> structures to typedefs.list.  V12 has listed new structures but I think there
> are still some more adjustments needed in the code e.g. see space between
> HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the
> pgindent will do that or not.

Hmm.  I'm not seeing an example of HeapCheckContext with wrong spacing.  Can you provide a file and line number?  There
wasa problem with enum SkipPages.  I've added that to the typedefs.list and rerun pgindent. 

While looking at that, I noticed that the function and variable naming conventions in this patch were irregular, with
nameslike TransactionIdValidInRel (init-caps) and tuple_is_visible (underscores), so I spent some time cleaning that up
forv13. 

> Here are a few more minor comments for the v12-0002 patch & some of them
> apply to other patches as well:
>
> #include "utils/snapmgr.h"
> -
> +#include "amcheck.h"
>
> Doesn't seem to be at the correct place -- need to be in sorted order.

Fixed.

> + if (!PG_ARGISNULL(3))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("starting block " INT64_FORMAT
> + " is out of bounds for relation with no blocks",
> + PG_GETARG_INT64(3))));
> + if (!PG_ARGISNULL(4))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("ending block " INT64_FORMAT
> + " is out of bounds for relation with no blocks",
> + PG_GETARG_INT64(4))));
>
> I think these errmsg() strings also should be in one line.

I chose not to do so, because the INT64_FORMAT bit breaks up the text even if placed all on one line.  I don't feel
stronglyabout that, though, so I'll join them for v13. 

> + if (fatal)
> + {
> + if (ctx.toast_indexes)
> + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes,
> + ShareUpdateExclusiveLock);
> + if (ctx.toastrel)
> + table_close(ctx.toastrel, ShareUpdateExclusiveLock);
>
> Toast index and rel closing block style is not the same as at the ending of
> verify_heapam().

I've harmonized the two.  Thanks for noticing.

> + /* If we get this far, we know the relation has at least one block */
> + startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3);
> + endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4);
> + if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock)
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT
> + " is out of bounds for relation with block count %u",
> + startblock, endblock, ctx.nblocks)));
> +
> ...
> ...
> + if (startblock < 0)
> + startblock = 0;
> + if (endblock < 0 || endblock > ctx.nblocks)
> + endblock = ctx.nblocks;
>
> Other than endblock < 0 case

This case does not need special checking, either.  The combination of checking that startblock >= 0 and that startblock
<=endblock already handles it. 

> , do we really need that?  I think due to the above
> error check the rest of the cases will not reach this place.

We don't need any of that.  Removed in v13.

> + confess(ctx, psprintf(
> +   "tuple xmax %u follows last assigned xid %u",
> +   xmax, ctx->nextKnownValidXid));
> + fatal = true;
> + }
> + }
> +
> + /* Check for tuple header corruption */
> + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader)
> + {
> + confess(ctx,
> + psprintf("tuple's header size is %u bytes which is less than the %u
> byte minimum valid header size",
> + ctx->tuphdr->t_hoff,
> + (unsigned) SizeofHeapTupleHeader));
>
> confess() call has two different code styles, first one where psprintf()'s only
> argument got its own line and second style where psprintf has its own line with
> the argument. I think the 2nd style is what we do follow & correct, not the
> former.

Ok, standardized in v13.

> + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID)
> + ereport(ERROR,
> + (errcode(ERRCODE_WRONG_OBJECT_TYPE),
> + errmsg("\"%s\" is not a heap",
> + RelationGetRelationName(rel))));
>
> Like elsewhere,  can we have errmsg as "only heap AM is supported" and error
> code is ERRCODE_FEATURE_NOT_SUPPORTED ?

I'm indifferent about that change.  Done for v13.

> That all, for now, apologize for multiple review emails.

Not at all!  I appreciate all the reviews.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.

That looks nice.

> > I guess that
> > could still be expensive if there's a lot of them, but needing
> > ShareUpdateExclusiveLock rather than only AccessShareLock is a little
> > unfortunate.
>
> I welcome strategies that would allow for taking a lesser lock.

I guess I'm not seeing why you need any particular strategy here. Say
that at the beginning you note the starting relfrozenxid of the table
-- I think I would lean toward just ignoring datfrozenxid and the
cluster-wide value completely. You also note the current value of the
transaction ID counter. Those are the two ends of the acceptable
range.

Let's first consider the oldest acceptable XID, bounded by
relfrozenxid. If you see a value that is older than the relfrozenxid
value that you noted at the start, it is definitely invalid. If you
see a newer value, it could still be older than the table's current
relfrozenxid, but that doesn't seem very worrisome. If the user
vacuumed the table while they were running this tool, they can always
run the tool again afterward if they wish. Forcing the vacuum to wait
by taking ShareUpdateExclusiveLock doesn't actually solve anything
anyway: you STILL won't notice any problems the vacuum introduces, and
in fact you are now GUARANTEED not to notice them, plus now the vacuum
happens later.

Now let's consider the newest acceptable XID, bounded by the value of
the transaction ID counter. Any time you see a newer XID than the last
value of the transaction ID counter that you observed, you go observe
it again. If the value from the table still looks invalid, then you
complain about it. Either way, you remember the new observation and
check future tuples against that value. I think the patch is already
doing this anyway; if it weren't, you'd need an even stronger lock,
one sufficient to prevent any insert/update/delete activity on the
table altogether.

Maybe I'm just being dense here -- exactly what problem are you worried about?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Not at all!  I appreciate all the reviews.

Reviewing 0002, reading through verify_heapam.c:

+typedef enum SkipPages
+{
+ SKIP_ALL_FROZEN_PAGES,
+ SKIP_ALL_VISIBLE_PAGES,
+ SKIP_PAGES_NONE
+} SkipPages;

This looks inconsistent. Maybe just start them all with SKIP_PAGES_.

+ if (PG_ARGISNULL(0))
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("missing required parameter for 'rel'")));

This doesn't look much like other error messages in the code. Do
something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study
the comparables.

+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("unrecognized parameter for 'skip': %s", skip),
+ errhint("please choose from 'all-visible', 'all-frozen', or 'none'")));

Same problem. Check pg_prewarm's handling of the prewarm type, or
EXPLAIN's handling of the FORMAT option, or similar examples. Read the
message style guidelines concerning punctuation of hint and detail
messages.

+ * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572)
+ * to have sometimes rendered the oldest xid value for a database invalid.
+ * It seems unwise to report rows as corrupt for failing to be newer than
+ * a value which itself may be corrupt.  We instead use the oldest xid for
+ * the entire cluster, which must be at least as old as the oldest xid for
+ * our database.

This kind of reference to another comment will not age well; line
numbers and files change a lot. But I think the right thing to do here
is just rely on relfrozenxid and relminmxid. If the table is
inconsistent with those, then something needs fixing. datfrozenxid and
the cluster-wide value can look out for themselves. The corruption
detector shouldn't be trying to work around any bugs in setting
relfrozenxid itself; such problems are arguably precisely what we're
here to find.

+/*
+ * confess
+ *
+ *   Return a message about corruption, including information
+ *   about where in the relation the corruption was found.
+ *
+ *   The msg argument is pfree'd by this function.
+ */
+static void
+confess(HeapCheckContext *ctx, char *msg)

Contrary to what the comments say, the function doesn't return a
message about corruption or anything else. It returns void.

I don't really like the name, either. I get that it's probably
inspired by Perl, but I think it should be given a less-clever name
like report_corruption() or something.

+ * corrupted table from using workmem worth of memory building up the

This kind of thing destroys grep-ability. If you're going to refer to
work_mem, you gotta spell it the same way we do everywhere else.

+ * Helper function to construct the TupleDesc needed by verify_heapam.

Instead of saying it's the TupleDesc somebody needs, how about saying
that it's the TupleDesc that we'll use to report problems that we find
while scanning the heap, or something like that?

+ * Given a TransactionId, attempt to interpret it as a valid
+ * FullTransactionId, neither in the future nor overlong in
+ * the past.  Stores the inferred FullTransactionId in *fxid.

It really doesn't, because there's no such thing as 'fxid' referenced
anywhere here. You should really make the effort to proofread your
patches before posting, and adjust comments and so on as you go.
Otherwise reviewing takes longer, and if you keep introducing new
stuff like this as you fix other stuff, you can fail to ever produce a
committable patch.

+ * Determine whether tuples are visible for verification.  Similar to
+ *  HeapTupleSatisfiesVacuum, but with critical differences.

Not accurate, because it also reports problems, which is not mentioned
anywhere in the function header comment that purports to be a detailed
description of what the function does.

+ else if (TransactionIdIsCurrentTransactionId(raw_xmin))
+ return true; /* insert or delete in progress */
+ else if (TransactionIdIsInProgress(raw_xmin))
+ return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */
+ else if (!TransactionIdDidCommit(raw_xmin))
+ {
+ return false; /* HEAPTUPLE_DEAD */
+ }

One of these cases is not punctuated like the others.

+ pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor
has a valid xmax"));

1. I don't think that's very grammatical.

2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and
HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should
be referencing C constant names here at all, and if you are I don't
think you should abbreviate, and if you do abbreviate I don't think
you should omit different numbers of words depending on which constant
it is.

I wonder what the intended division of responsibility is here,
exactly. It seems like you've ended up with some sanity checks in
check_tuple() before tuple_is_visible() is called, and others in
tuple_is_visible() proper. As far as I can see the comments don't
really discuss the logic behind the split, but there's clearly a close
relationship between the two sets of checks, even to the point where
you have "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has
a valid xmax" in tuple_is_visible() and "tuple xmax marked
incompatibly as keys updated and locked only" in check_tuple(). Now,
those are not the same check, but they seem like closely related
things, so it's not ideal that they happen in different functions with
differently-formatted messages to report problems and no explanation
of why it's different.

I think it might make sense here to see whether you could either move
more stuff out of tuple_is_visible(), so that it really just checks
whether the tuple is visible, or move more stuff into it, so that it
has the job not only of checking whether we should continue with
checks on the tuple contents but also complaining about any other
visibility problems. Or if neither of those make sense then there
should be a stronger attempt to rationalize in the comments what
checks are going where and for what reason, and also a stronger
attempt to rationalize the message wording.

+ curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
+ ctx->toast_rel->rd_att, &isnull));

Should we be worrying about the possibility of fastgetattr crapping
out if the TOAST tuple is corrupted?

+ if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
+ {
+ confess(ctx,
+ psprintf("tuple attribute should start at offset %u, but tuple
length is only %u",
+ ctx->tuphdr->t_hoff + ctx->offset, ctx->lp_len));
+ return false;
+ }
+
+ /* Skip null values */
+ if (infomask & HEAP_HASNULL && att_isnull(ctx->attnum, ctx->tuphdr->t_bits))
+ return true;
+
+ /* Skip non-varlena values, but update offset first */
+ if (thisatt->attlen != -1)
+ {
+ ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
+ ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
+ tp + ctx->offset);
+ return true;
+ }

This looks like it's not going to complain about a fixed-length
attribute that overruns the tuple length. There's code further down
that handles that case for a varlena attribute, but there's nothing
comparable for the fixed-length case.

+ confess(ctx,
+ psprintf("%s toast at offset %u is unexpected",
+ va_tag == VARTAG_INDIRECT ? "indirect" :
+ va_tag == VARTAG_EXPANDED_RO ? "expanded" :
+ va_tag == VARTAG_EXPANDED_RW ? "expanded" :
+ "unexpected",
+ ctx->tuphdr->t_hoff + ctx->offset));

I suggest "unexpected TOAST tag %d", without trying to convert to a
string. Such a conversion will likely fail in the case of genuine
corruption, and isn't meaningful even if it works.

Again, let's try to standardize terminology here: most of the messages
in this function are now of the form "tuple attribute %d has some
problem" or "attribute %d has some problem", but some have neither.
Since we're separately returning attnum I don't see why it should be
in the message, and if we weren't separately returning attnum then it
ought to be in the message the same way all the time, rather than
sometimes writing "attribute" and other times "tuple attribute".

+ /* Check relminmxid against mxid, if any */
+ xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
+ if (infomask & HEAP_XMAX_IS_MULTI &&
+ MultiXactIdPrecedes(xmax, ctx->relminmxid))
+ {
+ confess(ctx,
+ psprintf("tuple xmax %u precedes relminmxid %u",
+ xmax, ctx->relminmxid));
+ fatal = true;
+ }

There are checks that an XID is neither too old nor too new, and
presumably something similar could be done for MultiXactIds, but here
you only check one end of the range. Seems like you should check both.

+ /* Check xmin against relfrozenxid */
+ xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
+ if (TransactionIdIsNormal(ctx->relfrozenxid) &&
+ TransactionIdIsNormal(xmin))
+ {
+ if (TransactionIdPrecedes(xmin, ctx->relfrozenxid))
+ {
+ confess(ctx,
+ psprintf("tuple xmin %u precedes relfrozenxid %u",
+ xmin, ctx->relfrozenxid));
+ fatal = true;
+ }
+ else if (!xid_valid_in_rel(xmin, ctx))
+ {
+ confess(ctx,
+ psprintf("tuple xmin %u follows last assigned xid %u",
+ xmin, ctx->next_valid_xid));
+ fatal = true;
+ }
+ }

Here you do check both ends of the range, but the comment claims
otherwise. Again, please proof-read for this kind of stuff.

+ /* Check xmax against relfrozenxid */

Ditto here.

+ psprintf("tuple's header size is %u bytes which is less than the %u
byte minimum valid header size",

I suggest: tuple data begins at byte %u, but the tuple header must be
at least %u bytes

+ psprintf("tuple's %u byte header size exceeds the %u byte length of
the entire tuple",

I suggest: tuple data begins at byte %u, but the entire tuple length
is only %u bytes

+ psprintf("tuple's user data offset %u not maximally aligned to %u",

I suggest: tuple data begins at byte %u, but that is not maximally aligned
Or: tuple data begins at byte %u, which is not a multiple of %u

That makes the messages look much more similar to each other
grammatically and is more consistent about calling things by the same
names.

+ psprintf("tuple with null values has user data offset %u rather than
the expected offset %u",
+ psprintf("tuple without null values has user data offset %u rather
than the expected offset %u",

I suggest merging these: tuple data offset %u, but expected offset %u
(%u attributes, %s)
where %s is either "has nulls" or "no nulls"

In fact, aren't several of the above checks redundant with this one?
Like, why check for a value less than SizeofHeapTupleHeader or that's
not properly aligned first? Just check this straightaway and call it
good.

+ * If we get this far, the tuple is visible to us, so it must not be
+ * incompatible with our relDesc.  The natts field could be legitimately
+ * shorter than rel's natts, but it cannot be longer than rel's natts.

This is yet another case where you didn't update the comments.
tuple_is_visible() now checks whether the tuple is visible to anyone,
not whether it's visible to us, but the comment doesn't agree. In some
sense I think this comment is redundant with the previous one anyway,
because that one already talks about the tuple being visible. Maybe
just write: The tuple is visible, so it must be compatible with the
current version of the relation descriptor. It might have fewer
columns than are present in the relation descriptor, but it cannot
have more.

+ psprintf("tuple has %u attributes in relation with only %u attributes",
+ ctx->natts,
+ RelationGetDescr(ctx->rel)->natts));

I suggest: tuple has %u attributes, but relation has only %u attributes

+ /*
+ * Iterate over the attributes looking for broken toast values. This
+ * roughly follows the logic of heap_deform_tuple, except that it doesn't
+ * bother building up isnull[] and values[] arrays, since nobody wants
+ * them, and it unrolls anything that might trip over an Assert when
+ * processing corrupt data.
+ */
+ ctx->offset = 0;
+ for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
+ {
+ if (!check_tuple_attribute(ctx))
+ break;
+ }

I think this comment is too wordy. This text belongs in the header
comment of check_tuple_attribute(), not at the place where it gets
called. Otherwise, as you update what check_tuple_attribute() does,
you have to remember to come find this comment and fix it to match,
and you might forget to do that. In fact... looks like that already
happened, because check_tuple_attribute() now checks more than broken
TOAST attributes. Seems like you could just simplify this down to
something like "Now check each attribute." Also, you could lose the
extra braces.

- bt_index_check |             relname             | relpages
+ bt_index_check |             relname             | relpages

Don't include unrelated changes in the patch.

I'm not really sure that the list of fields you're displaying for each
reported problem really makes sense. I think the theory here should be
that we want to report the information that the user needs to localize
the problem but not everything that they could find out from
inspecting the page, and not things that are too specific to
particular classes of errors. So I would vote for keeping blkno,
offnum, and attnum, but I would lose lp_flags, lp_len, and chunk.
lp_off feels like it's a more arguable case: technically, it's a
locator for the problem, because it gives you the byte offset within
the page, but normally we reference tuples by TID, i.e. (blkno,
offset), not byte offset. On balance I'd be inclined to omit it.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 29, 2020, at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I've made the options 'all-visible', 'all-frozen', and 'none'.  It defaults to 'none'.
>
> That looks nice.
>
>>> I guess that
>>> could still be expensive if there's a lot of them, but needing
>>> ShareUpdateExclusiveLock rather than only AccessShareLock is a little
>>> unfortunate.
>>
>> I welcome strategies that would allow for taking a lesser lock.
>
> I guess I'm not seeing why you need any particular strategy here. Say
> that at the beginning you note the starting relfrozenxid of the table
> -- I think I would lean toward just ignoring datfrozenxid and the
> cluster-wide value completely. You also note the current value of the
> transaction ID counter. Those are the two ends of the acceptable
> range.
>
> Let's first consider the oldest acceptable XID, bounded by
> relfrozenxid. If you see a value that is older than the relfrozenxid
> value that you noted at the start, it is definitely invalid. If you
> see a newer value, it could still be older than the table's current
> relfrozenxid, but that doesn't seem very worrisome. If the user
> vacuumed the table while they were running this tool, they can always
> run the tool again afterward if they wish. Forcing the vacuum to wait
> by taking ShareUpdateExclusiveLock doesn't actually solve anything
> anyway: you STILL won't notice any problems the vacuum introduces, and
> in fact you are now GUARANTEED not to notice them, plus now the vacuum
> happens later.
>
> Now let's consider the newest acceptable XID, bounded by the value of
> the transaction ID counter. Any time you see a newer XID than the last
> value of the transaction ID counter that you observed, you go observe
> it again. If the value from the table still looks invalid, then you
> complain about it. Either way, you remember the new observation and
> check future tuples against that value. I think the patch is already
> doing this anyway; if it weren't, you'd need an even stronger lock,
> one sufficient to prevent any insert/update/delete activity on the
> table altogether.
>
> Maybe I'm just being dense here -- exactly what problem are you worried about?

Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am
worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The
threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock,
forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting
PROC_IN_VACUUMand such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of
that,yet. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On 2020-07-30 13:18:01 -0700, Mark Dilger wrote:
> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am
worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The
threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock,
forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting
PROC_IN_VACUUMand such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of
that,yet.
 

I think it's not at all ok to look in the procarray or clog for xids
that are older than what you're announcing you may read. IOW I don't
think it's OK to just ignore the problem, or try to work around it by
holding XactTruncationLock.

Greetings,

Andres Freund



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > Maybe I'm just being dense here -- exactly what problem are you worried about?
>
> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I am
worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.  The
threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock,
forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting
PROC_IN_VACUUMand such.  Maybe I'm being dense and don't need to worry about this.  But I haven't convinced myself of
that,yet. 

I don't get it. If you've already checked that the XIDs are >=
relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be
a problem. It could be, if CLOG is hosed, which is possible, because
if the table is corrupted, why shouldn't CLOG also be corrupted? But
I'm not sure that's what your concern is here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 30, 2020, at 2:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>> Maybe I'm just being dense here -- exactly what problem are you worried about?
>>
>> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I
amworried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.
Thethree strategies I had for dealing with that were taking the XactTruncationLock (formerly known as
CLogTruncationLock,for those reading this thread from the beginning), locking out vacuum, and the idea upthread from
Andresabout setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't
convincedmyself of that, yet. 
>
> I don't get it. If you've already checked that the XIDs are >=
> relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be
> a problem. It could be, if CLOG is hosed, which is possible, because
> if the table is corrupted, why shouldn't CLOG also be corrupted? But
> I'm not sure that's what your concern is here.

No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of
concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.
Inthe absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm
lesscertain how it would play out. 

The kind of scenario I'm worried about may not be possible in practice.  I think it would depend on how vacuum behaves
whenscanning a corrupt table that is corrupt in some way that vacuum doesn't notice, and whether vacuum could finish
scanningthe table with the false belief that it has frozen all tuples with xids less than some cutoff. 

I thought it would be safer if that kind of thing were not happening during verify_heapam's scan of the table.  Even if
acareful analysis proved it was not an issue with the current coding of vacuum, I don't think there is any coding
conventionrequiring future versions of vacuum to be hardened against corruption, so I don't see how I can rely on
vacuumnot causing such problems. 

I don't think this is necessarily a too-rare-to-care-about type concern, either.  If corruption across multiple tables
preventsautovacuum from succeeding, and the DBA doesn't get involved in scanning tables for corruption until the lack
ofsuccessful vacuums impacts the production system, I imagine you could end up with vacuums repeatedly happening (or
tryingto happen) around the time the DBA is trying to fix tables, or perhaps drop them, or whatever, using
verify_heapamfor guidance on which tables are corrupted. 

Anyway, that's what I was thinking.  I was imagining that calling TransactionIdDidCommit might keep crashing the
backendwhile the DBA is trying to find and fix corruption, and that could get really annoying. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 30, 2020, at 1:47 PM, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2020-07-30 13:18:01 -0700, Mark Dilger wrote:
>> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit.  I
amworried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check.
Thethree strategies I had for dealing with that were taking the XactTruncationLock (formerly known as
CLogTruncationLock,for those reading this thread from the beginning), locking out vacuum, and the idea upthread from
Andresabout setting PROC_IN_VACUUM and such.  Maybe I'm being dense and don't need to worry about this.  But I haven't
convincedmyself of that, yet. 
>
> I think it's not at all ok to look in the procarray or clog for xids
> that are older than what you're announcing you may read. IOW I don't
> think it's OK to just ignore the problem, or try to work around it by
> holding XactTruncationLock.

The current state of the patch is that concurrent vacuums are kept out of the table being checked by means of taking a
ShareUpdateExclusivelock on the table being checked.  In response to Robert's review, I was contemplating whether that
wasnecessary, but you raise the interesting question of whether it is even sufficient.  The logic in verify_heapam is
currentlyrelying on the ShareUpdateExclusive lock to prevent any of the xids in the range relfrozenxid..nextFullXid
frombeing invalid arguments to TransactionIdDidCommit.  Ignoring whether that is a good choice vis-a-vis performance,
isthat even a valid strategy?  It sounds like you are saying it is not. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of
concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.
Inthe absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm
lesscertain how it would play out. 

Oh, hmm. I wasn't thinking about that problem. I think the only way
this can happen is if we read a page and then, before we try to look
up the CID, vacuum zooms past, finishes the whole table, and truncates
clog. But if that's possible, then it seems like it would be an issue
for SELECT as well, and it apparently isn't, or we would've done
something about it by now. I think the reason it's not possible is
because of the locking rules described in
src/backend/storage/buffer/README, which require that you hold a
buffer lock until you've determined that the tuple is visible. Since
you hold a share lock on the buffer, a VACUUM that hasn't already
processed that freeze the tuples in that buffer; it would need an
exclusive lock on the buffer to do that. Therefore it can't finish and
truncate clog either.

Now, you raise the question of whether this is still true if the table
is corrupt, but I don't really see why that makes any difference.
VACUUM is supposed to freeze each page it encounters, to the extent
that such freezing is necessary, and with Andres's changes, it's
supposed to ERROR out if things are messed up. We can postulate a bug
in that logic, but inserting a VACUUM-blocking lock into this tool to
guard against a hypothetical vacuum bug seems strange to me. Why would
the right solution not be to fix such a bug if and when we find that
there is one?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> No, that wasn't my concern.  I was thinking about CLOG entries disappearing during the scan as a consequence of
concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range.
Inthe absence of corruption, I don't immediately see how this would cause any problems.  But for a corrupt table, I'm
lesscertain how it would play out. 
>
> Oh, hmm. I wasn't thinking about that problem. I think the only way
> this can happen is if we read a page and then, before we try to look
> up the CID, vacuum zooms past, finishes the whole table, and truncates
> clog. But if that's possible, then it seems like it would be an issue
> for SELECT as well, and it apparently isn't, or we would've done
> something about it by now. I think the reason it's not possible is
> because of the locking rules described in
> src/backend/storage/buffer/README, which require that you hold a
> buffer lock until you've determined that the tuple is visible. Since
> you hold a share lock on the buffer, a VACUUM that hasn't already
> processed that freeze the tuples in that buffer; it would need an
> exclusive lock on the buffer to do that. Therefore it can't finish and
> truncate clog either.
>
> Now, you raise the question of whether this is still true if the table
> is corrupt, but I don't really see why that makes any difference.
> VACUUM is supposed to freeze each page it encounters, to the extent
> that such freezing is necessary, and with Andres's changes, it's
> supposed to ERROR out if things are messed up. We can postulate a bug
> in that logic, but inserting a VACUUM-blocking lock into this tool to
> guard against a hypothetical vacuum bug seems strange to me. Why would
> the right solution not be to fix such a bug if and when we find that
> there is one?

Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about,
I'llwithdraw the argument.  But that leaves me wondering about a comment that Andres made upthread: 

> On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote:

> I don't think random interspersed uses of CLogTruncationLock are a good
> idea. If you move to only checking visibility after tuple fits into
> [relfrozenxid, nextXid), then you don't need to take any locks here, as
> long as a lock against vacuum is taken (which I think this should do
> anyway).

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
> Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying
about,I'll withdraw the argument.  But that leaves me wondering about a comment that Andres made upthread:
 
>
> > On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote:
>
> > I don't think random interspersed uses of CLogTruncationLock are a good
> > idea. If you move to only checking visibility after tuple fits into
> > [relfrozenxid, nextXid), then you don't need to take any locks here, as
> > long as a lock against vacuum is taken (which I think this should do
> > anyway).

The version of the patch I'm looking at doesn't seem to mention
CLogTruncationLock at all, so I'm confused about the comment. But what
it is that you are wondering about exactly?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 31, 2020, at 5:02 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>> On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger
>> Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying
about,I'll withdraw the argument.  But that leaves me wondering about a comment that Andres made upthread: 
>>
>>> On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote:
>>
>>> I don't think random interspersed uses of CLogTruncationLock are a good
>>> idea. If you move to only checking visibility after tuple fits into
>>> [relfrozenxid, nextXid), then you don't need to take any locks here, as
>>> long as a lock against vacuum is taken (which I think this should do
>>> anyway).
>
> The version of the patch I'm looking at doesn't seem to mention
> CLogTruncationLock at all, so I'm confused about the comment. But what
> it is that you are wondering about exactly?

In earlier versions of the patch, I was guarding (perhaps unnecessarily) against clog truncation, (perhaps incorrectly)
bytaking the CLogTruncationLock (aka XactTruncationLock.) .  I thought Andres was arguing that such locks were not
necessary"as long as a lock against vacuum is taken".  That's what motivated me to remove the clog locking business and
putin the ShareUpdateExclusive lock.  I don't want to remove the ShareUpdateExclusive lock from the patch without
perhapsa clarification from Andres on the subject.  His recent reply upthread seems to still support the idea that some
kindof protection is required: 

> I think it's not at all ok to look in the procarray or clog for xids
> that are older than what you're announcing you may read. IOW I don't
> think it's OK to just ignore the problem, or try to work around it by
> holding XactTruncationLock.

I don't understand that paragraph fully, in particular the part about "than what you're announcing you may read", since
thecached value of relfrozenxid is not announced; we're just assuming that as long as vacuum cannot advance it during
ourscan, that we should be safe checking whether xids newer than that value (and not in the future) were committed. 

Andres?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On 2020-07-31 08:51:50 -0700, Mark Dilger wrote:
> In earlier versions of the patch, I was guarding (perhaps
> unnecessarily) against clog truncation, (perhaps incorrectly) by
> taking the CLogTruncationLock (aka XactTruncationLock.) .  I thought
> Andres was arguing that such locks were not necessary "as long as a
> lock against vacuum is taken".  That's what motivated me to remove the
> clog locking business and put in the ShareUpdateExclusive lock.  I
> don't want to remove the ShareUpdateExclusive lock from the patch
> without perhaps a clarification from Andres on the subject.  His
> recent reply upthread seems to still support the idea that some kind
> of protection is required:

I'm not sure what I was thinking "back then", but right now I'd argue
that the best lock against vacuum isn't a SUE, but announcing the
correct ->xmin, so you can be sure that clog entries won't be yanked out
from under you. Potentially with the right flag sets to avoid old enough
tuples eing pruned.


> > I think it's not at all ok to look in the procarray or clog for xids
> > that are older than what you're announcing you may read. IOW I don't
> > think it's OK to just ignore the problem, or try to work around it by
> > holding XactTruncationLock.
> 
> I don't understand that paragraph fully, in particular the part about
> "than what you're announcing you may read", since the cached value of
> relfrozenxid is not announced; we're just assuming that as long as
> vacuum cannot advance it during our scan, that we should be safe
> checking whether xids newer than that value (and not in the future)
> were committed.

With 'announcing' I mean using the normal mechanism for avoiding the
clog being truncated for values one might look up. Which is announcing
the oldest xid one may look up in PGXACT->xmin.

Greetings,

Andres Freund



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure what I was thinking "back then", but right now I'd argue
> that the best lock against vacuum isn't a SUE, but announcing the
> correct ->xmin, so you can be sure that clog entries won't be yanked out
> from under you. Potentially with the right flag sets to avoid old enough
> tuples eing pruned.

Suppose we don't even do anything special in terms of advertising
xmin. What can go wrong? To have a problem, we've got to be running
concurrently with a vacuum that truncates clog. The clog truncation
must happen before our XID lookups, but vacuum has to remove the XIDs
from the heap before it can truncate. So we have to observe the XIDs
before vacuum removes them, but then vacuum has to truncate before we
look them up. But since we observe them and look them up while holding
a ShareLock on the buffer, this seems impossible. What's the flaw in
this argument?

If we do need to do something special in terms of advertising xmin,
how would you do it? Normally it happens by registering a snapshot,
but here all we would have is an XID; specifically, the value of
relfrozenxid that we observed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On 2020-07-31 12:42:51 -0400, Robert Haas wrote:
> On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm not sure what I was thinking "back then", but right now I'd argue
> > that the best lock against vacuum isn't a SUE, but announcing the
> > correct ->xmin, so you can be sure that clog entries won't be yanked out
> > from under you. Potentially with the right flag sets to avoid old enough
> > tuples eing pruned.
> 
> Suppose we don't even do anything special in terms of advertising
> xmin. What can go wrong? To have a problem, we've got to be running
> concurrently with a vacuum that truncates clog. The clog truncation
> must happen before our XID lookups, but vacuum has to remove the XIDs
> from the heap before it can truncate. So we have to observe the XIDs
> before vacuum removes them, but then vacuum has to truncate before we
> look them up. But since we observe them and look them up while holding
> a ShareLock on the buffer, this seems impossible. What's the flaw in
> this argument?

The page could have been wrongly marked all-frozen. There could be
interactions between heap and toast table that are checked. Other bugs
could apply, like a broken hot chain or such.


> If we do need to do something special in terms of advertising xmin,
> how would you do it? Normally it happens by registering a snapshot,
> but here all we would have is an XID; specifically, the value of
> relfrozenxid that we observed.

An appropriate procarray or snapmgr function would probably suffice?

Greetings,

Andres Freund



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Jul 31, 2020 at 3:05 PM Andres Freund <andres@anarazel.de> wrote:
> The page could have been wrongly marked all-frozen. There could be
> interactions between heap and toast table that are checked. Other bugs
> could apply, like a broken hot chain or such.

OK, at least the first two of these do sound like problems. Not sure
about the third one.

> > If we do need to do something special in terms of advertising xmin,
> > how would you do it? Normally it happens by registering a snapshot,
> > but here all we would have is an XID; specifically, the value of
> > relfrozenxid that we observed.
>
> An appropriate procarray or snapmgr function would probably suffice?

Not sure; I guess that'll need some investigation.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jul 30, 2020, at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> + curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
> + ctx->toast_rel->rd_att, &isnull));
>
> Should we be worrying about the possibility of fastgetattr crapping
> out if the TOAST tuple is corrupted?

I think we should, but I'm not sure we should be worrying about it at this location.  If the toast index is corrupt,
systable_getnext_orderedcould trip over the index corruption in the process of retrieving the toast tuple, so checking
thetoast tuple only helps if the toast index does not cause a crash first.  I think the toast index should be checked
beforethis point, ala verify_nbtree, so that we don't need to worry about that here.  It might also make more sense to
verifythe toast table ala verify_heapam prior to here, so we don't have to worry about that here either.  But that
raisesquestions about whose responsibility this all is.  If verify_heapam checks the toast table and toast index before
themain table, that takes care of it, but makes a mess of the idea of verify_heapam taking a start and end block, since
verifyingthe toast index is an all or nothing proposition, not something to be done in incremental pieces.  If we leave
verify_heapamas it is, then it is up to the caller to check the toast before the main relation, which is more flexible,
butis more complicated and requires the user to remember to do it.  We could split the difference by having
verify_heapamdo nothing about toast, leaving it up to the caller, but make pg_amcheck handle it by default, making it
easierfor users to not think about the issue.  Users who want to do incremental checking could still keep track of the
chunksthat have already been checked, not just for the main relation, but for the toast relation, too, and give start
andend block arguments to verify_heapam for the toast table check and then again for the main table check.  That
doesn'tfix the question of incrementally checking the index, though. 

Looking at it a slightly different way, I think what is being checked at the point in the code you mention is the
logicalstructure of the toasted value related to the current main table tuple, not the lower level tuple structure of
thetoast table.  We already have a function for checking a heap, namely verify_heapam, and we (or the caller, really)
shouldbe using that.  The clean way to do things is 

    verify_heapam(toast_rel)
    verify_btreeam(toast_idx)
    verify_heapam(main_rel)

and then depending on how fast and loose you want to be, you can use the start and end block arguments, which are
inherentlya bit half-baked, given the lack of any way to be sure you check precisely the right range of blocks, and
alsoyou can be fast and loose about skipping the index check or not, as you see fit. 

Thoughts?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Jul 27, 2020 at 10:02 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I'm indifferent about that change.  Done for v13.

Moving on with verification of the same index in the event of B-Tree
index corruption is a categorical mistake. verify_nbtree.c was simply
not designed to work that way.

You were determined to avoid allowing any behavior that can result in
a backend crash in the event of corruption, but this design will
defeat various measures I took to avoid crashing with corrupt data
(e.g. in commit a9ce839a313).

What's the point in not just giving up on the index (though not
necessarily the table or other indexes) at the first sign of trouble,
anyway? It makes sense for the heap structure, but not for indexes.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, Jul 30, 2020 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I don't really like the name, either. I get that it's probably
> inspired by Perl, but I think it should be given a less-clever name
> like report_corruption() or something.

+1 -- confess() is an awful name for this.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 2, 2020, at 8:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> What's the point in not just giving up on the index (though not
> necessarily the table or other indexes) at the first sign of trouble,
> anyway? It makes sense for the heap structure, but not for indexes.

The case that came to mind was an index broken by a glibc update with breaking changes to the collation sort order
underlyingthe index.  If the breaking change has already been live in production for quite some time before a DBA
notices,they might want to quantify how broken the index has been for the last however many days, not just drop and
recreatethe index.  I'm happy to drop that from the patch, though. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 2, 2020, at 9:13 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Jul 30, 2020 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I don't really like the name, either. I get that it's probably
>> inspired by Perl, but I think it should be given a less-clever name
>> like report_corruption() or something.
>
> +1 -- confess() is an awful name for this.

I was trying to limit unnecessary whitespace changes.  s/ereport/econfess/ leaves the function name nearly the same
lengthsuch that the following lines of indented error text don't usually get moved by pgindent.  Given the unpopularity
ofthe name, it's not worth it, so I'll go with Robert's report_corruption, instead. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 12:00 AM Peter Geoghegan <pg@bowt.ie> wrote:
> Moving on with verification of the same index in the event of B-Tree
> index corruption is a categorical mistake. verify_nbtree.c was simply
> not designed to work that way.
>
> You were determined to avoid allowing any behavior that can result in
> a backend crash in the event of corruption, but this design will
> defeat various measures I took to avoid crashing with corrupt data
> (e.g. in commit a9ce839a313).
>
> What's the point in not just giving up on the index (though not
> necessarily the table or other indexes) at the first sign of trouble,
> anyway? It makes sense for the heap structure, but not for indexes.

I agree that there's a serious design problem with Mark's patch in
this regard, but I disagree that the effort is pointless on its own
terms. You're basically postulating that users don't care how corrupt
their index is: whether there's one problem or one million problems,
it's all the same. If the user presents an index with one million
problems and we tell them about one of them, we've done our job and
can go home.

This doesn't match my experience. When an EDB customer reports
corruption, typically one of the first things I want to understand is
how widespread the problem is. This same issue came up on the thread
about relfrozenxid/relminmxid corruption. If you've got a table with
one or two rows where tuple.xmin < relfrozenxid, that's a different
kind of problem than if 50% of the tuples in the table have tuple.xmin
< relfrozenxid; the latter might well indicate that relfrozenxid value
itself is garbage, while the former indicates that a few tuples
slipped through the cracks somehow. If you're contemplating a recovery
strategy like "nuke the affected tuples from orbit," you really need
to understand which of those cases you've got.

Granted, this is a bit less important with indexes, because in most
cases you're just going to REINDEX. But, even there, the question is
not entirely academic. For instance, consider the case of a user whose
database crashes and then fails to restart because WAL replay fails.
Typically, there is little option here but to run pg_resetwal. At this
point, you know that there is some damage, but you don't know how bad
it is. If there was little system activity at the time of the crash,
there may be only a handful of problems with the database. If there
was a heavy OLTP workload running at the time of the crash, with a
long checkpoint interval, the problems may be widespread. If the user
has done this repeatedly before bothering to contact support, which is
more common than you might suppose, the damage may be extremely
widespread.

Now, you could argue (and not unreasonably) that in any case after
something like this happens even once, the user ought to dump and
restore to get back to a known good state. However, when the cluster
is 10TB in size and there's a $100,000 financial loss for every hour
of downtime, the question naturally arises of how urgent that dump and
restore is. Can we wait until our next maintenance window? Can we at
least wait until off hours? Being able to tell the user whether
they've got a tiny bit of corruption or a whole truckload of
corruption can enable them to make better decisions in such cases, or
at least more educated ones.

Now, again, just replacing ereport(ERROR, ...) with something else
that does not abort the rest of the checks is clearly not OK. I don't
endorse that approach, or anything like it. But neither do I accept
the argument that it would be useless to report all the errors even if
we could do so safely.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 11:02 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I was trying to limit unnecessary whitespace changes.  s/ereport/econfess/ leaves the function name nearly the same
lengthsuch that the following lines of indented error text don't usually get moved by pgindent.  Given the unpopularity
ofthe name, it's not worth it, so I'll go with Robert's report_corruption, instead. 

Yeah, that's not really a good reason for something like that. I think
what you should do is drop the nbtree portion of this for now; the
length of the name then doesn't even matter at all, because all the
code in which this is used will be new code. Even if we were churning
existing code, mechanical stuff like this isn't really a huge problem
most of the time, but there's no need for that here.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Aug 3, 2020 at 8:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I agree that there's a serious design problem with Mark's patch in
> this regard, but I disagree that the effort is pointless on its own
> terms. You're basically postulating that users don't care how corrupt
> their index is: whether there's one problem or one million problems,
> it's all the same. If the user presents an index with one million
> problems and we tell them about one of them, we've done our job and
> can go home.

It's not so much that I think that users won't care about whether any
given index is a bit corrupt or very corrupt. It's more like I don't
think that it's worth the eye-watering complexity, especially without
a real concrete goal in mind. "Counting all the errors, not just the
first" sounds like a tractable goal for the heap/table structure, but
it's just not like that with indexes. If you really wanted to do this,
you'd have to describe a practical scenario under which it made sense
to soldier on, where we'd definitely be able to count the number of
problems in a meaningful way, without much risk of either massively
overcounting or undecounting inconsistencies.

Consider how the search in verify_ntree.c actually works at a high
level. If you thoroughly corrupted one B-Tree leaf page (let's say you
replaced it with an all-zero page image), all pages to the right of
the page would be fundamentally inaccessible to the left-to-right
level search that is coordinated within
bt_check_level_from_leftmost(). And yet, most real index scans can
still be expected to work. How do you know to skip past that one
corrupt leaf page (by going back to the parent to get the next sibling
leaf page) during index verification? That's what it would take to do
this in the general case, I guess. More fundamentally, I wonder how
many inconsistencies one should imagine that this index has, before we
even get into talking about the implementation.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Aug 3, 2020 at 1:16 PM Peter Geoghegan <pg@bowt.ie> wrote:
> If you really wanted to do this,
> you'd have to describe a practical scenario under which it made sense
> to soldier on, where we'd definitely be able to count the number of
> problems in a meaningful way, without much risk of either massively
> overcounting or undecounting inconsistencies.

I completely agree. You have to have a careful plan to make this sort
of thing work - you want to skip checking the things that are
dependent on the part already determined to be bad, without skipping
everything. You need a strategy for where and how to restart checking,
first bypassing whatever needs to be skipped.

> Consider how the search in verify_ntree.c actually works at a high
> level. If you thoroughly corrupted one B-Tree leaf page (let's say you
> replaced it with an all-zero page image), all pages to the right of
> the page would be fundamentally inaccessible to the left-to-right
> level search that is coordinated within
> bt_check_level_from_leftmost(). And yet, most real index scans can
> still be expected to work. How do you know to skip past that one
> corrupt leaf page (by going back to the parent to get the next sibling
> leaf page) during index verification? That's what it would take to do
> this in the general case, I guess.

In that particular example, you would want the function that verifies
that page to return some indicator. If it finds that two keys in the
page are out-of-order, it tells the caller that it can still follow
the right-link. But if it finds that the whole page is garbage, then
it tells the caller that it doesn't have a valid right-link and the
caller's got to do something else, like give up on the rest of the
checks or (better) try to recover a pointer to the next page from the
parent.

> More fundamentally, I wonder how
> many inconsistencies one should imagine that this index has, before we
> even get into talking about the implementation.

I think we should try not to imagine anything in particular. Just to
be clear, I am not trying to knock what you have; I know it was a lot
of work to create and it's a huge improvement over having nothing. But
in my mind, a perfect tool would do just what a human being would do
if investigating manually: assume initially that you know nothing -
the index might be totally fine, mildly corrupted in a very localized
way, completely hosed, or anything in between. And it would
systematically try to track that down by traversing the usable
pointers that it has until it runs out of things to do. It does not
seem impossible to build a tool that would allow us to take a big
index and overwrite a random subset of pages with garbage data and
have the tool tell us about all the bad pages that are still reachable
from the root by any path. If you really wanted to go crazy with it,
you could even try to find the bad pages that are not reachable from
the root, by doing a pass after the fact over all the pages that you
didn't otherwise reach. It would be a lot of work to build something
like that and maybe not the best use of time, but if I got to wave
tools into existence using my magic wand, I think that would be the
gold standard.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Tue, Aug 4, 2020 at 7:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think we should try not to imagine anything in particular. Just to
> be clear, I am not trying to knock what you have; I know it was a lot
> of work to create and it's a huge improvement over having nothing. But
> in my mind, a perfect tool would do just what a human being would do
> if investigating manually: assume initially that you know nothing -
> the index might be totally fine, mildly corrupted in a very localized
> way, completely hosed, or anything in between. And it would
> systematically try to track that down by traversing the usable
> pointers that it has until it runs out of things to do. It does not
> seem impossible to build a tool that would allow us to take a big
> index and overwrite a random subset of pages with garbage data and
> have the tool tell us about all the bad pages that are still reachable
> from the root by any path. If you really wanted to go crazy with it,
> you could even try to find the bad pages that are not reachable from
> the root, by doing a pass after the fact over all the pages that you
> didn't otherwise reach. It would be a lot of work to build something
> like that and maybe not the best use of time, but if I got to wave
> tools into existence using my magic wand, I think that would be the
> gold standard.

I guess that might be true.

With indexes you tend to have redundancy in how relationships among
pages are described. So you have siblings whose pointers must be in
agreement (left points to right, right points to left), and it's not
clear which one you should trust when they don't agree. It's not like
simple heuristics get you all that far. I really can't think of a good
one, and detecting corruption should mean detecting truly exceptional
cases. I guess you could build a model based on Bayesian methods, or
something like that. But that is very complicated, and only used when
you actually have corruption -- which is presumably extremely rare in
reality. That's very unappealing as a project.

I have always believed that the big problem is not "known unknowns".
Rather, I think that the problem is "unknown unknowns". I accept that
you have a point, especially when it comes to heap checking, but even
there the most important consideration should be to make corruption
detection thorough and cheap. The vast vast majority of databases do
not have any corruption at any given time. You're not searching for a
needle in a haystack; you're searching for a needle in many many
haystacks within a field filled with haystacks, which taken together
probably contain no needles at all. (OTOH, once you find one needle
all bets are off, and you could very well go on to find a huge number
of them.)

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote:
> I'm not sure what I was thinking "back then", but right now I'd argue
> that the best lock against vacuum isn't a SUE, but announcing the
> correct ->xmin, so you can be sure that clog entries won't be yanked out
> from under you. Potentially with the right flag sets to avoid old enough
> tuples eing pruned.

I was just thinking about this some more (and talking it over with
Mark) and I think this might actually be a really bad idea. One
problem with it is that it means that the oldest-xmin value can go
backward, which is something that I think has caused us some problems
before. There are some other cases where it can happen, and I'm not
sure that there's any necessarily fatal problem with doing it in this
case, but it would definitely be a shame if this contrib module broke
something for core in a way that was hard to fix. But let's leave that
aside and suppose that there is no fatal problem there. Essentially
what we're talking about here is advertising the table's relfrozenxid
as our xmin. How old is that likely to be? Maybe pretty old. The
default value of vacuum_freeze_table_age is 150 million transactions,
and that's just the trigger to start vacuuming; the actual value of
age(relfrozenxid) could easily be higher than that. But even if it's
only a fraction of that, it's still pretty bad. Advertising an xmin
half that old (75 million transactions) is equivalent to keeping a
snapshot open for an amount of time equal to however long it takes you
to burn through 75 million XIDs. For instance, if you burn 10 million
XIDs/hour, that's the equivalent of keeping a snapshot open for 7.5
hours. In other words, it's quite likely that doing this is going to
make VACUUM (and HOT pruning) drastically less effective throughout
the entire database cluster. To me, this seems a lot worse than just
taking ShareUpdateExclusiveLock on the table. After all,
ShareUpdateExclusiveLock will prevent VACUUM from running on that
table, but it only affects that one table rather than the whole
cluster, and it "only" stops VACUUM from running, which is still
better than having it do lots of I/O but not clean anything up.

I think I see another problem with this approach, too: it's racey. If
some other process has entered vac_update_datfrozenxid() and has
gotten past the calls to GetOldestXmin() and GetOldestMultiXactId(),
and we then advertise an older xmin (and I guess also oldestMXact) it
can still go on to update datfrozenxid/datminmxid and then truncate
the SLRUs. Even holding XactTruncationLock is insufficient to protect
against this race condition, and there doesn't seem to be any other
obvious approach, either.

So I would like to back up a minute and lay out the possible solutions
as I understand them. The specific problem here I'm talking about here
is: how do we keep from looking up an XID or MXID whose information
might have been truncated away from the relevant SLRU?

1. Take a ShareUpdateExclusiveLock on the table. This prevents VACUUM
from running concurrently on this table (which sucks), but that for
sure guarantees that the table's relfrozenxid and relminmxid can't
advance, which precludes a concurrent CLOG truncation.

2. Advertise an older xmin and minimum MXID. See above.

3. Acquire XactTruncationLock for each lookup, like pg_xact_status().
One downside here is a lot of extra lock acquisitions, but we can
mitigate that to some degree by caching the results of lookups, and by
not doing it for XIDs that our newer than our advertised xmin (which
must be OK) or at least as old as the newest XID we previously
discovered to be unsafe to look up (because those must not be OK
either). The problem case is a table with lots of different XIDs that
are all new enough to look up but older than our xmin, e.g. a table
populated using many single-row inserts. But even if we hit this case,
how bad is it really? I don't think XactTruncationLock is particularly
hot, so maybe it just doesn't matter very much. We could contend
against other sessions checking other tables, or against widespread
use of pg_xact_status(), but I think that's about it. Another downside
of this approach is that I'm not sure it does anything to help us with
the MXID case; fixing that might require building some new
infrastructure similar to XactTruncationLock but for MXIDs.

4. Provide entrypoints for looking up XIDs that fail gently instead of
throwing errors. I've got my doubts about how practical this is; if
it's easy, why didn't we do that instead of inventing
XactTruncationLock?

Maybe there are other options here, too? At the moment, I'm thinking
that (2) and (4) are just bad and so we ought to either do (3) if it
doesn't suck too much for performance (which I don't quite see why it
should, but it might) or else fall back on (1).  (1) doesn't feel
clever enough but it might be better to be not clever enough than to
be too clever.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, Aug 4, 2020 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote:
> With indexes you tend to have redundancy in how relationships among
> pages are described. So you have siblings whose pointers must be in
> agreement (left points to right, right points to left), and it's not
> clear which one you should trust when they don't agree. It's not like
> simple heuristics get you all that far. I really can't think of a good
> one, and detecting corruption should mean detecting truly exceptional
> cases. I guess you could build a model based on Bayesian methods, or
> something like that. But that is very complicated, and only used when
> you actually have corruption -- which is presumably extremely rare in
> reality. That's very unappealing as a project.

I think it might be possible to distinguish between different types of
corruption and to separate, at least to some degree, the checking
associated with each type. I think one can imagine something that
checks the structure of a btree without regard to the contents. That
is, it cares that left and right links are consistent with each other
and with downlinks from the parent level. So it checks things like the
left link of the page to which my right link points is pointing back
to me, and that's also the page to which my parent's next downlink
points. It could also verify that there's a proper tree structure,
where every page has a well-defined tree level. So you assign the root
page level 1, and each time you traverse a downlink you assign that
page a level one larger. If you ever try to assign to a page a level
unequal to the level previously assigned to it, you report that as a
problem. You can check, too, that if a page does not have a left or
right link, it's actually the last page at that level according what
you saw at the parent, grandparent, etc. levels. Finally, you can
check that all of the max-level pages you can find are leaf pages, and
the others are all internal pages. All of this structural stuff can be
verified without caring a whit about what keys you've got or what they
mean or whether there's even a heap associated with this index.

Now a second type of checking, which can also be done without regard
to keys, is checking that the TIDs in the index point to TIDs that are
on heap pages that actually exist, and that the corresponding items
are not unused, nor are they tuples which are not the root of a HOT
chain. Passing a check of this type doesn't prove that the index and
heap are consistent, but failing it proves that they are inconsistent.
This kind of check can be done on every leaf index page you can find
by any means even if it fails the structural checks described above.
Failure of these checks on one page does not preclude checking the
same invariants for other pages. Let's call this kind of thing "basic
index-heap sanity checking."

A third type of checking is to verify the relationship between the
index keys within and across the index pages: are the keys actually in
order within a page, and are they in order across pages? The first
part of this can be checked individually for each page pretty much no
matter what other problems we may have; we only have to abandon this
checking for a particular page if it's total garbage and we cannot
identify any index items on the page at all. The second part, though,
has the problem you mention. I think the solution is to skip the
second part of the check for any pages that failed related structural
checks. For example, if my right sibling thinks that I am not its left
sibling, or my right sibling and I agree that we are siblings but do
not agree on who our parent is, or if that parent does not agree that
we have the same sibling relationship that we think we have, then we
should report that problem and forget about issuing any complaints
about the relationship between my key space and that sibling's key
space. The internal consistency of each page with respect to key
ordering can still be verified, though, and it's possible that my key
space can be validly compared to the key space of my other sibling, if
the structural checks pass on that side.

A fourth type of checking is to verify the index key against the keys
in the heap tuples to which they point, but only for index tuples that
passed the basic index-heap sanity checking and where the tuples have
not been pruned. This can be sensibly done even if the structural
checks or index-ordering checks have failed.

I don't mean to suggest that one would implement all of these things
as separate phases; that would be crazy expensive, and what if things
changed by the time you visit the page? Rather, the checks likely
ought to be interleaved, just keeping track internally of which things
need to be skipped because prerequisite checks have already failed.

Aside from providing a way to usefully continue after errors, this
would also be useful in certain scenarios where you want to know what
kind of corruption you have. For example, suppose that I start getting
wrong answers from index lookups on a particular index. Upon
investigation, it turns out that my last glibc update changed my OS
collation definitions for the collation I'm using, and therefore it is
to be expected that some of my keys may appear to be out of order with
respect to the new definitions. Now what I really want to know before
running REINDEX is that this is the only problem I have. It would be
amazing if I could run the tool and have it give me a list of problems
so that I could confirm that I have only index-ordering problems, not
any other kind, and even more amazing if it could tell me the specific
keys that were affected so that I could understand exactly how the
sorting behavior changed. If I were to discover that my index also has
structural problems or inconsistencies with the heap, then I'd know
that it couldn't be right to blame it only the collation update;
something else has gone wrong.

I'm speaking here with fairly limited knowledge of the details of how
all this actually works and, again, I'm not trying to suggest that you
or anyone is obligated to do any work on this, or that it would be
easy to accomplish or worth the time it took. I'm just trying to
sketch out what I see as maybe being theoretically possible, and why I
think it would be useful if it did.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Tue, Aug 4, 2020 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I think it might be possible to distinguish between different types of
> corruption and to separate, at least to some degree, the checking
> associated with each type. I think one can imagine something that
> checks the structure of a btree without regard to the contents. That
> is, it cares that left and right links are consistent with each other
> and with downlinks from the parent level. So it checks things like the
> left link of the page to which my right link points is pointing back
> to me, and that's also the page to which my parent's next downlink
> points.

I think that this kind of phased approach to B-Tree verification is
possible, more or less, but hard to justify. And it seems impossible
to do with only an AccessShareLock.

It's not clear that what you describe is much better than just
checking a bunch of indexes and seeing what patterns emerge. For
example, the involvement of collated text might be a common factor
across indexes. That kind of pattern is the first thing that I look
for, and often the only thing. It also serves to give me an idea of
how messed up things are. There are not that many meaningful degrees
of messed-up with indexes in my experience. The first error really
does tell you most of what you need to know about any given corrupt
index. Kind of like how you can bucket the number of cockroaches in
your home into perhaps three meaningful buckets: 0 cockroaches, at
least 1 cockroach, and lots of cockroaches. (Even there, if you really
care about the distinction between the second and third bucket,
something has gone terribly wrong -- so even three buckets seems like
a lot to me.)

FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot
of details about the tree structure. It's a handy way of getting a
sense of what's going on at a high level. For example, if index
corruption is found very early on, that strongly suggests that it's
pretty pervasive.

> Now a second type of checking, which can also be done without regard
> to keys, is checking that the TIDs in the index point to TIDs that are
> on heap pages that actually exist, and that the corresponding items
> are not unused, nor are they tuples which are not the root of a HOT
> chain. Passing a check of this type doesn't prove that the index and
> heap are consistent, but failing it proves that they are inconsistent.
> This kind of check can be done on every leaf index page you can find
> by any means even if it fails the structural checks described above.
> Failure of these checks on one page does not preclude checking the
> same invariants for other pages. Let's call this kind of thing "basic
> index-heap sanity checking."

One real weakness in the current code is our inability to detect index
tuples that are in the correct order and so on, but point to the wrong
thing -- we can detect that if it manifests itself as the absence of
an index tuple that should be in the index (when you use
heapallindexed verification), but we cannot *reliably* detect the
presence of an index tuple that shouldn't be in the index at all
(though in practice it probably mostly gets caught).

The checks on the tree structure itself are excellent with
bt_index_parent_check() following Alexander's commit d114cc53 (which I
thought was really excellent work). But we still have that one
remaining blind spot in verify_nbtree.c, even when you opt in to every
possible type of verification (i.e. bt_index_parent_check() with all
options). I'd much rather fix that, or help with the new heap checker
stuff.

> A fourth type of checking is to verify the index key against the keys
> in the heap tuples to which they point, but only for index tuples that
> passed the basic index-heap sanity checking and where the tuples have
> not been pruned. This can be sensibly done even if the structural
> checks or index-ordering checks have failed.

That's going to require the equivalent of a merge join, which is
terribly expensive relative to such a small benefit.

> Aside from providing a way to usefully continue after errors, this
> would also be useful in certain scenarios where you want to know what
> kind of corruption you have. For example, suppose that I start getting
> wrong answers from index lookups on a particular index. Upon
> investigation, it turns out that my last glibc update changed my OS
> collation definitions for the collation I'm using, and therefore it is
> to be expected that some of my keys may appear to be out of order with
> respect to the new definitions. Now what I really want to know before
> running REINDEX is that this is the only problem I have. It would be
> amazing if I could run the tool and have it give me a list of problems
> so that I could confirm that I have only index-ordering problems, not
> any other kind, and even more amazing if it could tell me the specific
> keys that were affected so that I could understand exactly how the
> sorting behavior changed.

This detail seems really hard. There are probably lots of cases where
the sorting behavior changed but it just didn't affect you, given the
data you had -- it just so happened that you didn't have exactly the
wrong kind of diacritic mark or whatever. After all, revisions to how
strings in a given natural language are supposed to sort are likely to
be relatively rare and relatively obscure (even among people that
speak the language in question). Also, the task of figuring out if the
tuple to the left or the right is in the wrong order seems kind of
daunting.

Meanwhile, a simple smoke test covering many indexes probably gives
you a fairly meaningful idea of the extent of the damage, without
requiring that we do any hard engineering work.

> I'm speaking here with fairly limited knowledge of the details of how
> all this actually works and, again, I'm not trying to suggest that you
> or anyone is obligated to do any work on this, or that it would be
> easy to accomplish or worth the time it took. I'm just trying to
> sketch out what I see as maybe being theoretically possible, and why I
> think it would be useful if it did.

I don't think that your relatively limited knowledge of the B-Tree
code is an issue here -- your intuitions seem pretty reasonable. I
appreciate your perspective here. Corruption detection presents us
with some odd qualitative questions of the kind that are just awkward
to discuss. Discouraging perspectives that don't quite match my own
would be quite counterproductive.

That having been said, I suspect that this is a huge task for a small
benefit. It's exceptionally hard to test because you have lots of
non-trivial code that only gets used in circumstances that by
definition should never happen. If users really needed to recover the
data in the index then maybe it would happen -- but they don't.

The biggest problem that amcheck currently has is that it isn't used
enough, because it isn't positioned as a general purpose tool at all.
I'm hoping that the work from Mark helps with that.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, Aug 4, 2020 at 9:06 PM Peter Geoghegan <pg@bowt.ie> wrote:
> of messed-up with indexes in my experience. The first error really
> does tell you most of what you need to know about any given corrupt
> index. Kind of like how you can bucket the number of cockroaches in
> your home into perhaps three meaningful buckets: 0 cockroaches, at
> least 1 cockroach, and lots of cockroaches. (Even there, if you really
> care about the distinction between the second and third bucket,
> something has gone terribly wrong -- so even three buckets seems like
> a lot to me.)

Not sure I agree with this. As a homeowner, the distinction between 0
and 1 is less significant to me than the distinction between a few
(preferably in places where I'll never see them) and whole lot. I
agree with you to an extent though: all I really care about is whether
I have too few to worry about, enough that I'd better try to take care
of it somehow, or so many that I need a professional exterminator. If,
however, I were a professional exterminator, I would be unhappy with
just knowing that there are few problems or many. I suspect I would
want to know something about where the problems were, and get a more
nuanced indication of just how bad things are in each location.

FWIW, pg_catcheck is an example of an existing tool (designed by me
and written partially by me) that uses the kind of model I'm talking
about. It does a single SELECT * FROM pg_<whatever> on each catalog
table - so that it doesn't get confused if your system catalog indexes
are messed up - and then performs a bunch of cross-checks on the
tuples it gets back and tells you about all the messed up stuff. If it
can't get data from all your catalog tables it performs whichever
checks are valid given what data it was able to get. As a professional
exterminator of catalog corruption, I find it quite helpful. If
someone sends me the output from a database cluster, I can tell right
away whether they are just fine, in a little bit of trouble, or in a
whole lot of trouble; I can speculate pretty well about what kind of
thing might've happened to cause the problem; and I can recommend
steps to straighten things out.

> FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot
> of details about the tree structure. It's a handy way of getting a
> sense of what's going on at a high level. For example, if index
> corruption is found very early on, that strongly suggests that it's
> pretty pervasive.

Interesting.

> > A fourth type of checking is to verify the index key against the keys
> > in the heap tuples to which they point, but only for index tuples that
> > passed the basic index-heap sanity checking and where the tuples have
> > not been pruned. This can be sensibly done even if the structural
> > checks or index-ordering checks have failed.
>
> That's going to require the equivalent of a merge join, which is
> terribly expensive relative to such a small benefit.

I think it depends on how big your data is. If you've got a 2TB table
and 512GB of RAM, it's pretty impractical no matter the algorithm. But
for small tables even a naive nested loop will suffice.

> Meanwhile, a simple smoke test covering many indexes probably gives
> you a fairly meaningful idea of the extent of the damage, without
> requiring that we do any hard engineering work.

In my experience, when EDB customers complain about corruption-related
problems, the two most common patterns are: (1) my whole system is
messed up and (2) I have one or a few specific objects which are
messed up and everything else is fine. The first category is often
something like inability to start the database, or scary messages in
the log file complaining about, say, checkpoints failing. The second
category is the one I'm worried about here. The people who are in this
category generally already know which things are broken; they've
figured that out through trial and error. Sometimes they miss some
problems, but more frequently, in my experience, their understanding
of what problems they have is accurate. Now that category of users can
be further decomposed into two groups: the people who don't care what
happened and just want to barrel through it, and the people who do
care what happened and want to know what happened, why it happened,
whether it's a bug, etc. The first group are unproblematic: tell them
to REINDEX (or restore from backup, or whatever) and you're done.

The second group is a lot harder. It is in general difficult to
speculate about how something that is now wrong got that way given
knowledge only of the present state of affairs. But good tooling makes
it easier to speculate intelligently. To take a classic example,
there's a great difference between a checksum failure caused by the
checksum being incorrect on an otherwise-valid page; a checksum
failure on a page the first half of which appears valid and the second
half of which looks like it might be some other database page; and a
checksum failure on a page whose contents appear to be taken from a
Microsoft Word document. I'm not saying we ever want a tool which
tries to figure that sort of thing out in an automated way; there's no
substitute for human intelligence (yet, anyway). But, the more the
tools we do have localize the problems to particular pages or tuples
and describe them accurately, the easier it is to do manual
investigation as follow-up, when it's necessary.

> That having been said, I suspect that this is a huge task for a small
> benefit. It's exceptionally hard to test because you have lots of
> non-trivial code that only gets used in circumstances that by
> definition should never happen. If users really needed to recover the
> data in the index then maybe it would happen -- but they don't.

Yep, that's a very key difference as compared to the heap.

> The biggest problem that amcheck currently has is that it isn't used
> enough, because it isn't positioned as a general purpose tool at all.
> I'm hoping that the work from Mark helps with that.

Agreed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Wed, Aug 5, 2020 at 7:09 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Not sure I agree with this. As a homeowner, the distinction between 0
> and 1 is less significant to me than the distinction between a few
> (preferably in places where I'll never see them) and whole lot. I
> agree with you to an extent though: all I really care about is whether
> I have too few to worry about, enough that I'd better try to take care
> of it somehow, or so many that I need a professional exterminator. If,
> however, I were a professional exterminator, I would be unhappy with
> just knowing that there are few problems or many. I suspect I would
> want to know something about where the problems were, and get a more
> nuanced indication of just how bad things are in each location.

Right, but the professional exterminator can be expected to use expert
level tools, where a great deal of technical sophistication is
required to interpret what's going on sensibly. An amatuer can only
use them to determine if something is wrong at all, which is usually
not how they add value.

(I think that my analogy is slightly flawed in that it hinged upon
everybody hating cockroaches as much as I do, which is more than the
ordinary amount.)

> FWIW, pg_catcheck is an example of an existing tool (designed by me
> and written partially by me) that uses the kind of model I'm talking
> about. It does a single SELECT * FROM pg_<whatever> on each catalog
> table - so that it doesn't get confused if your system catalog indexes
> are messed up - and then performs a bunch of cross-checks on the
> tuples it gets back and tells you about all the messed up stuff. If it
> can't get data from all your catalog tables it performs whichever
> checks are valid given what data it was able to get. As a professional
> exterminator of catalog corruption, I find it quite helpful.

I myself seem to have had quite different experiences with corruption,
presumably because it happened at product companies like Heroku. I
tended to find software bugs (e.g. the one fixed by commit 008c4135)
that were rare and novel by casting a wide net over a large number of
relatively homogenous databases. Whereas your experiences tend to
involve large support customers with more opportunity for operator
error. Both perspectives are important.

> The second group is a lot harder. It is in general difficult to
> speculate about how something that is now wrong got that way given
> knowledge only of the present state of affairs. But good tooling makes
> it easier to speculate intelligently. To take a classic example,
> there's a great difference between a checksum failure caused by the
> checksum being incorrect on an otherwise-valid page; a checksum
> failure on a page the first half of which appears valid and the second
> half of which looks like it might be some other database page; and a
> checksum failure on a page whose contents appear to be taken from a
> Microsoft Word document. I'm not saying we ever want a tool which
> tries to figure that sort of thing out in an automated way; there's no
> substitute for human intelligence (yet, anyway).

I wrote my own expert level tool, pg_hexedit. I have to admit that the
level of interest in that tool doesn't seem to be all that great,
though I myself have used it to investigate corruption to great
effect. But I suppose there is no way to know how it's being used.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Aug 5, 2020 at 4:36 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Right, but the professional exterminator can be expected to use expert
> level tools, where a great deal of technical sophistication is
> required to interpret what's going on sensibly. An amatuer can only
> use them to determine if something is wrong at all, which is usually
> not how they add value.

Quite true.

> I myself seem to have had quite different experiences with corruption,
> presumably because it happened at product companies like Heroku. I
> tended to find software bugs (e.g. the one fixed by commit 008c4135)
> that were rare and novel by casting a wide net over a large number of
> relatively homogenous databases. Whereas your experiences tend to
> involve large support customers with more opportunity for operator
> error. Both perspectives are important.

I concur.

> I wrote my own expert level tool, pg_hexedit. I have to admit that the
> level of interest in that tool doesn't seem to be all that great,
> though I myself have used it to investigate corruption to great
> effect. But I suppose there is no way to know how it's being used.

I admit not to having tried pg_hexedit, but I doubt that it would help
me very much outside of my own development work. The problem is that
in a typical case I am trying to help someone in a professional
capacity without access to their machines, and without knowledge of
their environment or data. Moreover, sometimes the person I'm trying
to help is an unreliable narrator. I can ask people to run tools they
have and send the output, and then I can look at that output and tell
them what to do next. But it has to be a tool they have (or they can
easily get) and it can't involve any complicated if-then stuff.
Something like "if the page is totally garbled then do X but if it
looks mostly OK then do Y" is radically out of reach. They have no
clue about that. Hence my interest in tools that automate as much of
the investigation as may be practical.

We're probably beating this topic to death at this point; I don't
think we are really in any sort of meaningful disagreement, and the
next steps in this particular case seem clear enough.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Amul Sul
Date:
On Thu, Jul 30, 2020 at 11:29 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> > Not at all!  I appreciate all the reviews.
>
> Reviewing 0002, reading through verify_heapam.c:
>
> +typedef enum SkipPages
> +{
> + SKIP_ALL_FROZEN_PAGES,
> + SKIP_ALL_VISIBLE_PAGES,
> + SKIP_PAGES_NONE
> +} SkipPages;
>
> This looks inconsistent. Maybe just start them all with SKIP_PAGES_.
>
> + if (PG_ARGISNULL(0))
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("missing required parameter for 'rel'")));
>
> This doesn't look much like other error messages in the code. Do
> something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study
> the comparables.
>
> + ereport(ERROR,
> + (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> + errmsg("unrecognized parameter for 'skip': %s", skip),
> + errhint("please choose from 'all-visible', 'all-frozen', or 'none'")));
>
> Same problem. Check pg_prewarm's handling of the prewarm type, or
> EXPLAIN's handling of the FORMAT option, or similar examples. Read the
> message style guidelines concerning punctuation of hint and detail
> messages.
>
> + * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572)
> + * to have sometimes rendered the oldest xid value for a database invalid.
> + * It seems unwise to report rows as corrupt for failing to be newer than
> + * a value which itself may be corrupt.  We instead use the oldest xid for
> + * the entire cluster, which must be at least as old as the oldest xid for
> + * our database.
>
> This kind of reference to another comment will not age well; line
> numbers and files change a lot. But I think the right thing to do here
> is just rely on relfrozenxid and relminmxid. If the table is
> inconsistent with those, then something needs fixing. datfrozenxid and
> the cluster-wide value can look out for themselves. The corruption
> detector shouldn't be trying to work around any bugs in setting
> relfrozenxid itself; such problems are arguably precisely what we're
> here to find.
>
> +/*
> + * confess
> + *
> + *   Return a message about corruption, including information
> + *   about where in the relation the corruption was found.
> + *
> + *   The msg argument is pfree'd by this function.
> + */
> +static void
> +confess(HeapCheckContext *ctx, char *msg)
>
> Contrary to what the comments say, the function doesn't return a
> message about corruption or anything else. It returns void.
>
> I don't really like the name, either. I get that it's probably
> inspired by Perl, but I think it should be given a less-clever name
> like report_corruption() or something.
>
> + * corrupted table from using workmem worth of memory building up the
>
> This kind of thing destroys grep-ability. If you're going to refer to
> work_mem, you gotta spell it the same way we do everywhere else.
>
> + * Helper function to construct the TupleDesc needed by verify_heapam.
>
> Instead of saying it's the TupleDesc somebody needs, how about saying
> that it's the TupleDesc that we'll use to report problems that we find
> while scanning the heap, or something like that?
>
> + * Given a TransactionId, attempt to interpret it as a valid
> + * FullTransactionId, neither in the future nor overlong in
> + * the past.  Stores the inferred FullTransactionId in *fxid.
>
> It really doesn't, because there's no such thing as 'fxid' referenced
> anywhere here. You should really make the effort to proofread your
> patches before posting, and adjust comments and so on as you go.
> Otherwise reviewing takes longer, and if you keep introducing new
> stuff like this as you fix other stuff, you can fail to ever produce a
> committable patch.
>
> + * Determine whether tuples are visible for verification.  Similar to
> + *  HeapTupleSatisfiesVacuum, but with critical differences.
>
> Not accurate, because it also reports problems, which is not mentioned
> anywhere in the function header comment that purports to be a detailed
> description of what the function does.
>
> + else if (TransactionIdIsCurrentTransactionId(raw_xmin))
> + return true; /* insert or delete in progress */
> + else if (TransactionIdIsInProgress(raw_xmin))
> + return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */
> + else if (!TransactionIdDidCommit(raw_xmin))
> + {
> + return false; /* HEAPTUPLE_DEAD */
> + }
>
> One of these cases is not punctuated like the others.
>
> + pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor
> has a valid xmax"));
>
> 1. I don't think that's very grammatical.
>
> 2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and
> HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should
> be referencing C constant names here at all, and if you are I don't
> think you should abbreviate, and if you do abbreviate I don't think
> you should omit different numbers of words depending on which constant
> it is.
>
> I wonder what the intended division of responsibility is here,
> exactly. It seems like you've ended up with some sanity checks in
> check_tuple() before tuple_is_visible() is called, and others in
> tuple_is_visible() proper. As far as I can see the comments don't
> really discuss the logic behind the split, but there's clearly a close
> relationship between the two sets of checks, even to the point where
> you have "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has
> a valid xmax" in tuple_is_visible() and "tuple xmax marked
> incompatibly as keys updated and locked only" in check_tuple(). Now,
> those are not the same check, but they seem like closely related
> things, so it's not ideal that they happen in different functions with
> differently-formatted messages to report problems and no explanation
> of why it's different.
>
> I think it might make sense here to see whether you could either move
> more stuff out of tuple_is_visible(), so that it really just checks
> whether the tuple is visible, or move more stuff into it, so that it
> has the job not only of checking whether we should continue with
> checks on the tuple contents but also complaining about any other
> visibility problems. Or if neither of those make sense then there
> should be a stronger attempt to rationalize in the comments what
> checks are going where and for what reason, and also a stronger
> attempt to rationalize the message wording.
>
> + curchunk = DatumGetInt32(fastgetattr(toasttup, 2,
> + ctx->toast_rel->rd_att, &isnull));
>
> Should we be worrying about the possibility of fastgetattr crapping
> out if the TOAST tuple is corrupted?
>
> + if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len)
> + {
> + confess(ctx,
> + psprintf("tuple attribute should start at offset %u, but tuple
> length is only %u",
> + ctx->tuphdr->t_hoff + ctx->offset, ctx->lp_len));
> + return false;
> + }
> +
> + /* Skip null values */
> + if (infomask & HEAP_HASNULL && att_isnull(ctx->attnum, ctx->tuphdr->t_bits))
> + return true;
> +
> + /* Skip non-varlena values, but update offset first */
> + if (thisatt->attlen != -1)
> + {
> + ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign);
> + ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen,
> + tp + ctx->offset);
> + return true;
> + }
>
> This looks like it's not going to complain about a fixed-length
> attribute that overruns the tuple length. There's code further down
> that handles that case for a varlena attribute, but there's nothing
> comparable for the fixed-length case.
>
> + confess(ctx,
> + psprintf("%s toast at offset %u is unexpected",
> + va_tag == VARTAG_INDIRECT ? "indirect" :
> + va_tag == VARTAG_EXPANDED_RO ? "expanded" :
> + va_tag == VARTAG_EXPANDED_RW ? "expanded" :
> + "unexpected",
> + ctx->tuphdr->t_hoff + ctx->offset));
>
> I suggest "unexpected TOAST tag %d", without trying to convert to a
> string. Such a conversion will likely fail in the case of genuine
> corruption, and isn't meaningful even if it works.
>
> Again, let's try to standardize terminology here: most of the messages
> in this function are now of the form "tuple attribute %d has some
> problem" or "attribute %d has some problem", but some have neither.
> Since we're separately returning attnum I don't see why it should be
> in the message, and if we weren't separately returning attnum then it
> ought to be in the message the same way all the time, rather than
> sometimes writing "attribute" and other times "tuple attribute".
>
> + /* Check relminmxid against mxid, if any */
> + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
> + if (infomask & HEAP_XMAX_IS_MULTI &&
> + MultiXactIdPrecedes(xmax, ctx->relminmxid))
> + {
> + confess(ctx,
> + psprintf("tuple xmax %u precedes relminmxid %u",
> + xmax, ctx->relminmxid));
> + fatal = true;
> + }
>
> There are checks that an XID is neither too old nor too new, and
> presumably something similar could be done for MultiXactIds, but here
> you only check one end of the range. Seems like you should check both.
>
> + /* Check xmin against relfrozenxid */
> + xmin = HeapTupleHeaderGetXmin(ctx->tuphdr);
> + if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> + TransactionIdIsNormal(xmin))
> + {
> + if (TransactionIdPrecedes(xmin, ctx->relfrozenxid))
> + {
> + confess(ctx,
> + psprintf("tuple xmin %u precedes relfrozenxid %u",
> + xmin, ctx->relfrozenxid));
> + fatal = true;
> + }
> + else if (!xid_valid_in_rel(xmin, ctx))
> + {
> + confess(ctx,
> + psprintf("tuple xmin %u follows last assigned xid %u",
> + xmin, ctx->next_valid_xid));
> + fatal = true;
> + }
> + }
>
> Here you do check both ends of the range, but the comment claims
> otherwise. Again, please proof-read for this kind of stuff.
>
> + /* Check xmax against relfrozenxid */
>
> Ditto here.
>
> + psprintf("tuple's header size is %u bytes which is less than the %u
> byte minimum valid header size",
>
> I suggest: tuple data begins at byte %u, but the tuple header must be
> at least %u bytes
>
> + psprintf("tuple's %u byte header size exceeds the %u byte length of
> the entire tuple",
>
> I suggest: tuple data begins at byte %u, but the entire tuple length
> is only %u bytes
>
> + psprintf("tuple's user data offset %u not maximally aligned to %u",
>
> I suggest: tuple data begins at byte %u, but that is not maximally aligned
> Or: tuple data begins at byte %u, which is not a multiple of %u
>
> That makes the messages look much more similar to each other
> grammatically and is more consistent about calling things by the same
> names.
>
> + psprintf("tuple with null values has user data offset %u rather than
> the expected offset %u",
> + psprintf("tuple without null values has user data offset %u rather
> than the expected offset %u",
>
> I suggest merging these: tuple data offset %u, but expected offset %u
> (%u attributes, %s)
> where %s is either "has nulls" or "no nulls"
>
> In fact, aren't several of the above checks redundant with this one?
> Like, why check for a value less than SizeofHeapTupleHeader or that's
> not properly aligned first? Just check this straightaway and call it
> good.
>
> + * If we get this far, the tuple is visible to us, so it must not be
> + * incompatible with our relDesc.  The natts field could be legitimately
> + * shorter than rel's natts, but it cannot be longer than rel's natts.
>
> This is yet another case where you didn't update the comments.
> tuple_is_visible() now checks whether the tuple is visible to anyone,
> not whether it's visible to us, but the comment doesn't agree. In some
> sense I think this comment is redundant with the previous one anyway,
> because that one already talks about the tuple being visible. Maybe
> just write: The tuple is visible, so it must be compatible with the
> current version of the relation descriptor. It might have fewer
> columns than are present in the relation descriptor, but it cannot
> have more.
>
> + psprintf("tuple has %u attributes in relation with only %u attributes",
> + ctx->natts,
> + RelationGetDescr(ctx->rel)->natts));
>
> I suggest: tuple has %u attributes, but relation has only %u attributes
>
> + /*
> + * Iterate over the attributes looking for broken toast values. This
> + * roughly follows the logic of heap_deform_tuple, except that it doesn't
> + * bother building up isnull[] and values[] arrays, since nobody wants
> + * them, and it unrolls anything that might trip over an Assert when
> + * processing corrupt data.
> + */
> + ctx->offset = 0;
> + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++)
> + {
> + if (!check_tuple_attribute(ctx))
> + break;
> + }
>
> I think this comment is too wordy. This text belongs in the header
> comment of check_tuple_attribute(), not at the place where it gets
> called. Otherwise, as you update what check_tuple_attribute() does,
> you have to remember to come find this comment and fix it to match,
> and you might forget to do that. In fact... looks like that already
> happened, because check_tuple_attribute() now checks more than broken
> TOAST attributes. Seems like you could just simplify this down to
> something like "Now check each attribute." Also, you could lose the
> extra braces.
>
> - bt_index_check |             relname             | relpages
> + bt_index_check |             relname             | relpages
>
> Don't include unrelated changes in the patch.
>
> I'm not really sure that the list of fields you're displaying for each
> reported problem really makes sense. I think the theory here should be
> that we want to report the information that the user needs to localize
> the problem but not everything that they could find out from
> inspecting the page, and not things that are too specific to
> particular classes of errors. So I would vote for keeping blkno,
> offnum, and attnum, but I would lose lp_flags, lp_len, and chunk.
> lp_off feels like it's a more arguable case: technically, it's a
> locator for the problem, because it gives you the byte offset within
> the page, but normally we reference tuples by TID, i.e. (blkno,
> offset), not byte offset. On balance I'd be inclined to omit it.
>
> --

In addition to this, I found a few more things while reading v13 patch are as
below:

Patch v13-0001:

-
+#include "amcheck.h"

Not in correct order.


+typedef struct BtreeCheckContext
+{
+ TupleDesc tupdesc;
+ Tuplestorestate *tupstore;
+ bool is_corrupt;
+ bool on_error_stop;
+} BtreeCheckContext;

Unnecessary spaces/tabs between } and BtreeCheckContext.


 static void bt_index_check_internal(Oid indrelid, bool parentcheck,
- bool heapallindexed, bool rootdescend);
+ bool heapallindexed, bool rootdescend,
+ BtreeCheckContext * ctx);

Unnecessary space between * and ctx. The same changes needed for other places as
well.
---

Patch v13-0002:

+-- partitioned tables (the parent ones) don't have visibility maps
+create table test_partitioned (a int, b text default repeat('x', 5000))
+ partition by list (a);
+-- these should all fail
+select * from verify_heapam('test_partitioned',
+ on_error_stop := false,
+ skip := NULL,
+ startblock := NULL,
+ endblock := NULL);
+ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
+create table test_partition partition of test_partitioned for values in (1);
+create index test_index on test_partition (a);

Can't we make it work? If the input is partitioned, I think we could
collect all its leaf partitions and process them one by one. Thoughts?


+ ctx->chunkno++;

Instead of incrementing  in check_toast_tuple(), I think incrementing should
happen at the caller  -- just after check_toast_tuple() call.
---

Patch v13-0003:

+ resetPQExpBuffer(query);
+ destroyPQExpBuffer(query);

resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer().


+ appendPQExpBuffer(query,
+   "SELECT c.relname, v.blkno, v.offnum, v.lp_off, "
+   "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg"
+   "\nFROM verify_heapam(rel := %u, on_error_stop := %s, "
+   "skip := %s, startblock := %s, endblock := %s) v, "
+   "pg_class c"
+   "\nWHERE c.oid = %u",
+   tbloid, stop, skip, settings.startblock,
+   settings.endblock, tbloid

pg_class should be schema-qualified like elsewhere.  IIUC, pg_class is meant to
get relname only, instead, we could use '%u'::pg_catalog.regclass in the target
list for the relname. Thoughts?

Also I think we should skip '\n' from the query string (see appendPQExpBuffer()
in pg_dump.c)


+ appendPQExpBuffer(query,
+   "SELECT i.indexrelid"
+   "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c"
+   "\nWHERE i.indexrelid = c.oid"
+   "\n    AND c.relam = %u"
+   "\n    AND i.indrelid = %u",
+   BTREE_AM_OID, tbloid);
+
+ ExecuteSqlStatement("RESET search_path");
+ res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK);
+ PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL));

I don't think we need the search_path query. The main query doesn't have any
dependencies on it.  Same is in check_indexes(), check_index (),
expand_table_name_patterns() & get_table_check_list().
Correct me if I am missing something.


+ output = PageOutput(lines + 2, NULL);
+ for (lineno = 0; usage_text[lineno]; lineno++)
+ fprintf(output, "%s\n", usage_text[lineno]);
+ fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT);
+ fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);

I am not sure why we want PageOutput() if the second argument is always going to
be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ?
e.g. usage() function in pg_basebackup.c.

Regards,
Amul



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote:
>
> In addition to this, I found a few more things while reading v13 patch are as
> below:
>
> Patch v13-0001:
>
> -
> +#include "amcheck.h"
>
> Not in correct order.

Fixed.

> +typedef struct BtreeCheckContext
> +{
> + TupleDesc tupdesc;
> + Tuplestorestate *tupstore;
> + bool is_corrupt;
> + bool on_error_stop;
> +} BtreeCheckContext;
>
> Unnecessary spaces/tabs between } and BtreeCheckContext.

This refers to a change in verify_nbtree.c that has been removed.  Per discussions with Peter and Robert, I have simply
withdrawnthat portion of the patch. 

> static void bt_index_check_internal(Oid indrelid, bool parentcheck,
> - bool heapallindexed, bool rootdescend);
> + bool heapallindexed, bool rootdescend,
> + BtreeCheckContext * ctx);
>
> Unnecessary space between * and ctx. The same changes needed for other places as
> well.

Same as above.  The changes to verify_nbtree.c have been withdrawn.

> ---
>
> Patch v13-0002:
>
> +-- partitioned tables (the parent ones) don't have visibility maps
> +create table test_partitioned (a int, b text default repeat('x', 5000))
> + partition by list (a);
> +-- these should all fail
> +select * from verify_heapam('test_partitioned',
> + on_error_stop := false,
> + skip := NULL,
> + startblock := NULL,
> + endblock := NULL);
> +ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
> +create table test_partition partition of test_partitioned for values in (1);
> +create index test_index on test_partition (a);
>
> Can't we make it work? If the input is partitioned, I think we could
> collect all its leaf partitions and process them one by one. Thoughts?

I was following the example from pg_visibility.  I haven't thought about your proposal enough to have much opinion as
yet,except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user
interface.

> + ctx->chunkno++;
>
> Instead of incrementing  in check_toast_tuple(), I think incrementing should
> happen at the caller  -- just after check_toast_tuple() call.

I agree.

> ---
>
> Patch v13-0003:
>
> + resetPQExpBuffer(query);
> + destroyPQExpBuffer(query);
>
> resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer().

Thanks.  I removed it in cases where destroyPQExpBuffer is obviously the very next call.

> + appendPQExpBuffer(query,
> +   "SELECT c.relname, v.blkno, v.offnum, v.lp_off, "
> +   "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg"
> +   "\nFROM verify_heapam(rel := %u, on_error_stop := %s, "
> +   "skip := %s, startblock := %s, endblock := %s) v, "
> +   "pg_class c"
> +   "\nWHERE c.oid = %u",
> +   tbloid, stop, skip, settings.startblock,
> +   settings.endblock, tbloid
>
> pg_class should be schema-qualified like elsewhere.

Agreed, and changed.

> IIUC, pg_class is meant to
> get relname only, instead, we could use '%u'::pg_catalog.regclass in the target
> list for the relname. Thoughts?

get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over, calling
check_table()for each one.  I think some verification that the table still exists is in order.  Using
'%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of
the'ERROR:  could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped
table.

> Also I think we should skip '\n' from the query string (see appendPQExpBuffer()
> in pg_dump.c)

I'm not sure I understand.  pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very
similarto what this patch does. 

> + appendPQExpBuffer(query,
> +   "SELECT i.indexrelid"
> +   "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c"
> +   "\nWHERE i.indexrelid = c.oid"
> +   "\n    AND c.relam = %u"
> +   "\n    AND i.indrelid = %u",
> +   BTREE_AM_OID, tbloid);
> +
> + ExecuteSqlStatement("RESET search_path");
> + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK);
> + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL));
>
> I don't think we need the search_path query. The main query doesn't have any
> dependencies on it.  Same is in check_indexes(), check_index (),
> expand_table_name_patterns() & get_table_check_list().
> Correct me if I am missing something.

Right.

> + output = PageOutput(lines + 2, NULL);
> + for (lineno = 0; usage_text[lineno]; lineno++)
> + fprintf(output, "%s\n", usage_text[lineno]);
> + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT);
> + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
>
> I am not sure why we want PageOutput() if the second argument is always going to
> be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ?
> e.g. usage() function in pg_basebackup.c.

Done.


Please find attached the next version of the patch.  In addition to your review comments (above), I have made changes
inresponse to Peter and Robert's review comments upthread. 




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Amul Sul
Date:
On Thu, Aug 20, 2020 at 8:00 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
>
>
>
> > On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote:
> >
> > In addition to this, I found a few more things while reading v13 patch are as
> > below:
> >
> > Patch v13-0001:
> >
> > -
> > +#include "amcheck.h"
> >
> > Not in correct order.
>
> Fixed.
>
> > +typedef struct BtreeCheckContext
> > +{
> > + TupleDesc tupdesc;
> > + Tuplestorestate *tupstore;
> > + bool is_corrupt;
> > + bool on_error_stop;
> > +} BtreeCheckContext;
> >
> > Unnecessary spaces/tabs between } and BtreeCheckContext.
>
> This refers to a change in verify_nbtree.c that has been removed.  Per discussions with Peter and Robert, I have
simplywithdrawn that portion of the patch. 
>
> > static void bt_index_check_internal(Oid indrelid, bool parentcheck,
> > - bool heapallindexed, bool rootdescend);
> > + bool heapallindexed, bool rootdescend,
> > + BtreeCheckContext * ctx);
> >
> > Unnecessary space between * and ctx. The same changes needed for other places as
> > well.
>
> Same as above.  The changes to verify_nbtree.c have been withdrawn.
>
> > ---
> >
> > Patch v13-0002:
> >
> > +-- partitioned tables (the parent ones) don't have visibility maps
> > +create table test_partitioned (a int, b text default repeat('x', 5000))
> > + partition by list (a);
> > +-- these should all fail
> > +select * from verify_heapam('test_partitioned',
> > + on_error_stop := false,
> > + skip := NULL,
> > + startblock := NULL,
> > + endblock := NULL);
> > +ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
> > +create table test_partition partition of test_partitioned for values in (1);
> > +create index test_index on test_partition (a);
> >
> > Can't we make it work? If the input is partitioned, I think we could
> > collect all its leaf partitions and process them one by one. Thoughts?
>
> I was following the example from pg_visibility.  I haven't thought about your proposal enough to have much opinion as
yet,except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user
interface.
>

pg_visibility does exist from before the declarative partitioning came
in, I think it's time to improve that as well.

> > + ctx->chunkno++;
> >
> > Instead of incrementing  in check_toast_tuple(), I think incrementing should
> > happen at the caller  -- just after check_toast_tuple() call.
>
> I agree.
>
> > ---
> >
> > Patch v13-0003:
> >
> > + resetPQExpBuffer(query);
> > + destroyPQExpBuffer(query);
> >
> > resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer().
>
> Thanks.  I removed it in cases where destroyPQExpBuffer is obviously the very next call.
>
> > + appendPQExpBuffer(query,
> > +   "SELECT c.relname, v.blkno, v.offnum, v.lp_off, "
> > +   "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg"
> > +   "\nFROM verify_heapam(rel := %u, on_error_stop := %s, "
> > +   "skip := %s, startblock := %s, endblock := %s) v, "
> > +   "pg_class c"
> > +   "\nWHERE c.oid = %u",
> > +   tbloid, stop, skip, settings.startblock,
> > +   settings.endblock, tbloid
> >
> > pg_class should be schema-qualified like elsewhere.
>
> Agreed, and changed.
>
> > IIUC, pg_class is meant to
> > get relname only, instead, we could use '%u'::pg_catalog.regclass in the target
> > list for the relname. Thoughts?
>
> get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over, calling
check_table()for each one.  I think some verification that the table still exists is in order.  Using
'%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of
the'ERROR:  could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped
table.
>
> > Also I think we should skip '\n' from the query string (see appendPQExpBuffer()
> > in pg_dump.c)
>
> I'm not sure I understand.  pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very
similarto what this patch does. 
>

I see there is a mix of styles, I was referring to dumpDatabase() from pg_dump.c
which doesn't include '\n'.

> > + appendPQExpBuffer(query,
> > +   "SELECT i.indexrelid"
> > +   "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c"
> > +   "\nWHERE i.indexrelid = c.oid"
> > +   "\n    AND c.relam = %u"
> > +   "\n    AND i.indrelid = %u",
> > +   BTREE_AM_OID, tbloid);
> > +
> > + ExecuteSqlStatement("RESET search_path");
> > + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK);
> > + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL));
> >
> > I don't think we need the search_path query. The main query doesn't have any
> > dependencies on it.  Same is in check_indexes(), check_index (),
> > expand_table_name_patterns() & get_table_check_list().
> > Correct me if I am missing something.
>
> Right.
>
> > + output = PageOutput(lines + 2, NULL);
> > + for (lineno = 0; usage_text[lineno]; lineno++)
> > + fprintf(output, "%s\n", usage_text[lineno]);
> > + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT);
> > + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
> >
> > I am not sure why we want PageOutput() if the second argument is always going to
> > be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ?
> > e.g. usage() function in pg_basebackup.c.
>
> Done.
>
>
> Please find attached the next version of the patch.  In addition to your review comments (above), I have made changes
inresponse to Peter and Robert's review comments upthread. 

Thanks for the updated version, I'll have a look.

Regards,
Amul



Re: new heapcheck contrib module

From
Amul Sul
Date:
Few comments for v14 version:

v14-0001:

verify_heapam.c: In function ‘verify_heapam’:
verify_heapam.c:339:14: warning: variable ‘ph’ set but not used
[-Wunused-but-set-variable]
   PageHeader ph;
              ^
verify_heapam.c: In function ‘check_toast_tuple’:
verify_heapam.c:877:8: warning: variable ‘chunkdata’ set but not used
[-Wunused-but-set-variable]
  char    *chunkdata;

Got these compilation warnings


+++ b/contrib/amcheck/amcheck.h
@@ -0,0 +1,5 @@
+#include "postgres.h"
+
+Datum verify_heapam(PG_FUNCTION_ARGS);
+Datum bt_index_check(PG_FUNCTION_ARGS);
+Datum bt_index_parent_check(PG_FUNCTION_ARGS);

bt_index_* are needed?


#include "access/htup_details.h"
#include "access/xact.h"
#include "catalog/pg_type.h"
#include "catalog/storage_xlog.h"
#include "storage/smgr.h"
#include "utils/lsyscache.h"
#include "utils/rel.h"
#include "utils/snapmgr.h"
#include "utils/syscache.h"

These header file inclusion to verify_heapam.c. can be omitted. Some of those
might be implicitly got added by other header files or no longer need due to
recent changes.


+ *   on_error_stop:
+ *     Whether to stop at the end of the first page for which errors are
+ *     detected.  Note that multiple rows may be returned.
+ *
+ *   check_toast:
+ *     Whether to check each toasted attribute against the toast table to
+ *     verify that it can be found there.
+ *
+ *   skip:
+ *     What kinds of pages in the heap relation should be skipped.  Valid
+ *     options are "all-visible", "all-frozen", and "none".

I think it would be good if the description also includes what will be default
value otherwise.


+ /*
+ * Optionally open the toast relation, if any, also protected from
+ * concurrent vacuums.
+ */

Now lock is changed to AccessShareLock, I think we need to rephrase this comment
as well since we are not really doing anything extra explicitly to protect from
the concurrent vacuum.


+/*
+ * Return wehter a multitransaction ID is in the cached valid range.
+ */

Typo: s/wehter/whether


v14-0002:

+#define NOPAGER 0

Unused macro.


+ appendPQExpBuffer(querybuf,
+   "SELECT c.relname, v.blkno, v.offnum, v.attnum, v.msg"
+   "\nFROM public.verify_heapam("
+ "\nrelation := %u,"
+ "\non_error_stop := %s,"
+ "\nskip := %s,"
+ "\ncheck_toast := %s,"
+ "\nstartblock := %s,"
+ "\nendblock := %s) v, "
+ "\npg_catalog.pg_class c"
+   "\nWHERE c.oid = %u",
+   tbloid, stop, skip, toast, startblock, endblock, tbloid);
[....]
+ appendPQExpBuffer(querybuf,
+   "SELECT public.bt_index_parent_check('%s'::regclass, %s, %s)",
+   idxoid,
+   settings.heapallindexed ? "true" : "false",
+   settings.rootdescend ? "true" : "false");

The assumption that the amcheck extension will be always installed in the public
schema doesn't seem to be correct. This will not work if amcheck install
somewhere else.

Regards,
Amul




On Thu, Aug 20, 2020 at 5:17 PM Amul Sul <sulamul@gmail.com> wrote:
>
> On Thu, Aug 20, 2020 at 8:00 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
> >
> >
> >
> > > On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote:
> > >
> > > In addition to this, I found a few more things while reading v13 patch are as
> > > below:
> > >
> > > Patch v13-0001:
> > >
> > > -
> > > +#include "amcheck.h"
> > >
> > > Not in correct order.
> >
> > Fixed.
> >
> > > +typedef struct BtreeCheckContext
> > > +{
> > > + TupleDesc tupdesc;
> > > + Tuplestorestate *tupstore;
> > > + bool is_corrupt;
> > > + bool on_error_stop;
> > > +} BtreeCheckContext;
> > >
> > > Unnecessary spaces/tabs between } and BtreeCheckContext.
> >
> > This refers to a change in verify_nbtree.c that has been removed.  Per discussions with Peter and Robert, I have
simplywithdrawn that portion of the patch. 
> >
> > > static void bt_index_check_internal(Oid indrelid, bool parentcheck,
> > > - bool heapallindexed, bool rootdescend);
> > > + bool heapallindexed, bool rootdescend,
> > > + BtreeCheckContext * ctx);
> > >
> > > Unnecessary space between * and ctx. The same changes needed for other places as
> > > well.
> >
> > Same as above.  The changes to verify_nbtree.c have been withdrawn.
> >
> > > ---
> > >
> > > Patch v13-0002:
> > >
> > > +-- partitioned tables (the parent ones) don't have visibility maps
> > > +create table test_partitioned (a int, b text default repeat('x', 5000))
> > > + partition by list (a);
> > > +-- these should all fail
> > > +select * from verify_heapam('test_partitioned',
> > > + on_error_stop := false,
> > > + skip := NULL,
> > > + startblock := NULL,
> > > + endblock := NULL);
> > > +ERROR:  "test_partitioned" is not a table, materialized view, or TOAST table
> > > +create table test_partition partition of test_partitioned for values in (1);
> > > +create index test_index on test_partition (a);
> > >
> > > Can't we make it work? If the input is partitioned, I think we could
> > > collect all its leaf partitions and process them one by one. Thoughts?
> >
> > I was following the example from pg_visibility.  I haven't thought about your proposal enough to have much opinion
asyet, except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user
interface.
> >
>
> pg_visibility does exist from before the declarative partitioning came
> in, I think it's time to improve that as well.
>
> > > + ctx->chunkno++;
> > >
> > > Instead of incrementing  in check_toast_tuple(), I think incrementing should
> > > happen at the caller  -- just after check_toast_tuple() call.
> >
> > I agree.
> >
> > > ---
> > >
> > > Patch v13-0003:
> > >
> > > + resetPQExpBuffer(query);
> > > + destroyPQExpBuffer(query);
> > >
> > > resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer().
> >
> > Thanks.  I removed it in cases where destroyPQExpBuffer is obviously the very next call.
> >
> > > + appendPQExpBuffer(query,
> > > +   "SELECT c.relname, v.blkno, v.offnum, v.lp_off, "
> > > +   "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg"
> > > +   "\nFROM verify_heapam(rel := %u, on_error_stop := %s, "
> > > +   "skip := %s, startblock := %s, endblock := %s) v, "
> > > +   "pg_class c"
> > > +   "\nWHERE c.oid = %u",
> > > +   tbloid, stop, skip, settings.startblock,
> > > +   settings.endblock, tbloid
> > >
> > > pg_class should be schema-qualified like elsewhere.
> >
> > Agreed, and changed.
> >
> > > IIUC, pg_class is meant to
> > > get relname only, instead, we could use '%u'::pg_catalog.regclass in the target
> > > list for the relname. Thoughts?
> >
> > get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over,
callingcheck_table() for each one.  I think some verification that the table still exists is in order.  Using
'%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of
the'ERROR:  could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped
table.
> >
> > > Also I think we should skip '\n' from the query string (see appendPQExpBuffer()
> > > in pg_dump.c)
> >
> > I'm not sure I understand.  pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very
similarto what this patch does. 
> >
>
> I see there is a mix of styles, I was referring to dumpDatabase() from pg_dump.c
> which doesn't include '\n'.
>
> > > + appendPQExpBuffer(query,
> > > +   "SELECT i.indexrelid"
> > > +   "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c"
> > > +   "\nWHERE i.indexrelid = c.oid"
> > > +   "\n    AND c.relam = %u"
> > > +   "\n    AND i.indrelid = %u",
> > > +   BTREE_AM_OID, tbloid);
> > > +
> > > + ExecuteSqlStatement("RESET search_path");
> > > + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK);
> > > + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL));
> > >
> > > I don't think we need the search_path query. The main query doesn't have any
> > > dependencies on it.  Same is in check_indexes(), check_index (),
> > > expand_table_name_patterns() & get_table_check_list().
> > > Correct me if I am missing something.
> >
> > Right.
> >
> > > + output = PageOutput(lines + 2, NULL);
> > > + for (lineno = 0; usage_text[lineno]; lineno++)
> > > + fprintf(output, "%s\n", usage_text[lineno]);
> > > + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT);
> > > + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL);
> > >
> > > I am not sure why we want PageOutput() if the second argument is always going to
> > > be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ?
> > > e.g. usage() function in pg_basebackup.c.
> >
> > Done.
> >
> >
> > Please find attached the next version of the patch.  In addition to your review comments (above), I have made
changesin response to Peter and Robert's review comments upthread. 
>
> Thanks for the updated version, I'll have a look.
>
> Regards,
> Amul



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 24, 2020, at 2:48 AM, Amul Sul <sulamul@gmail.com> wrote:
>
> Few comments for v14 version:
>
> v14-0001:
>
> verify_heapam.c: In function ‘verify_heapam’:
> verify_heapam.c:339:14: warning: variable ‘ph’ set but not used
> [-Wunused-but-set-variable]
>   PageHeader ph;
>              ^
> verify_heapam.c: In function ‘check_toast_tuple’:
> verify_heapam.c:877:8: warning: variable ‘chunkdata’ set but not used
> [-Wunused-but-set-variable]
>  char    *chunkdata;
>
> Got these compilation warnings

Removed.

>
>
> +++ b/contrib/amcheck/amcheck.h
> @@ -0,0 +1,5 @@
> +#include "postgres.h"
> +
> +Datum verify_heapam(PG_FUNCTION_ARGS);
> +Datum bt_index_check(PG_FUNCTION_ARGS);
> +Datum bt_index_parent_check(PG_FUNCTION_ARGS);
>
> bt_index_* are needed?

This entire header file is not needed.  Removed.

> #include "access/htup_details.h"
> #include "access/xact.h"
> #include "catalog/pg_type.h"
> #include "catalog/storage_xlog.h"
> #include "storage/smgr.h"
> #include "utils/lsyscache.h"
> #include "utils/rel.h"
> #include "utils/snapmgr.h"
> #include "utils/syscache.h"
>
> These header file inclusion to verify_heapam.c. can be omitted. Some of those
> might be implicitly got added by other header files or no longer need due to
> recent changes.

Removed.


> + *   on_error_stop:
> + *     Whether to stop at the end of the first page for which errors are
> + *     detected.  Note that multiple rows may be returned.
> + *
> + *   check_toast:
> + *     Whether to check each toasted attribute against the toast table to
> + *     verify that it can be found there.
> + *
> + *   skip:
> + *     What kinds of pages in the heap relation should be skipped.  Valid
> + *     options are "all-visible", "all-frozen", and "none".
>
> I think it would be good if the description also includes what will be default
> value otherwise.

The defaults are defined in amcheck--1.2--1.3.sql, and I was concerned that documenting them in verify_heapam.c would
createa hazard of the defaults and their documented values getting out of sync.  The handling of null arguments in
verify_heapam.cwas, however, duplicating the defaults from the .sql file, so I've changed that to just ereport error on
null. (I can't make the whole function strict, as some other arguments are allowed to be null.)  I have not documented
thedefaults in either file, as they are quite self-evident in the .sql file.  I've updated some tests that were passing
nullto get the default behavior to now either pass nothing or explicitly pass the argument they want. 

>
> + /*
> + * Optionally open the toast relation, if any, also protected from
> + * concurrent vacuums.
> + */
>
> Now lock is changed to AccessShareLock, I think we need to rephrase this comment
> as well since we are not really doing anything extra explicitly to protect from
> the concurrent vacuum.

Right.  Comment changed.

> +/*
> + * Return wehter a multitransaction ID is in the cached valid range.
> + */
>
> Typo: s/wehter/whether

Changed.

> v14-0002:
>
> +#define NOPAGER 0
>
> Unused macro.

Removed.

> + appendPQExpBuffer(querybuf,
> +   "SELECT c.relname, v.blkno, v.offnum, v.attnum, v.msg"
> +   "\nFROM public.verify_heapam("
> + "\nrelation := %u,"
> + "\non_error_stop := %s,"
> + "\nskip := %s,"
> + "\ncheck_toast := %s,"
> + "\nstartblock := %s,"
> + "\nendblock := %s) v, "
> + "\npg_catalog.pg_class c"
> +   "\nWHERE c.oid = %u",
> +   tbloid, stop, skip, toast, startblock, endblock, tbloid);
> [....]
> + appendPQExpBuffer(querybuf,
> +   "SELECT public.bt_index_parent_check('%s'::regclass, %s, %s)",
> +   idxoid,
> +   settings.heapallindexed ? "true" : "false",
> +   settings.rootdescend ? "true" : "false");
>
> The assumption that the amcheck extension will be always installed in the public
> schema doesn't seem to be correct. This will not work if amcheck install
> somewhere else.

Right.  I removed the schema qualification, leaving it up to the search path.

Thanks for the review!


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
"Andrey M. Borodin"
Date:

> 25 авг. 2020 г., в 19:36, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):

Hi Mark!

Thanks for working on this important feature.

I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow.
Will your module be able to gather tid's of similar corruptions?

server/db M # select * from heap_check('pg_toast.pg_toast_4848601');
ERROR:  58P01: could not access status of transaction 636558742
DETAIL:  Could not open file "pg_xact/025F": No such file or directory.
LOCATION:  SlruReportIOError, slru.c:913
Time: 3439.915 ms (00:03.440)

Thanks!

Best regards, Andrey Borodin.


Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Aug 28, 2020 at 1:07 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG
anyhow.
> Will your module be able to gather tid's of similar corruptions?
>
> server/db M # select * from heap_check('pg_toast.pg_toast_4848601');
> ERROR:  58P01: could not access status of transaction 636558742
> DETAIL:  Could not open file "pg_xact/025F": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:913
> Time: 3439.915 ms (00:03.440)

This kind of thing gets really tricky. PostgreSQL uses errors in tons
of places to report problems, and if you want to accumulate a list of
errors and report them all rather than just letting the first one
cancel the operation, you need special handling for each individual
error you want to bypass. A tool like this naturally wants to use as
much PostgreSQL infrastructure as possible, to avoid duplicating a ton
of code and creating a bloated monstrosity, but all that code can
throw errors. I think the code in its current form is trying to be
resilient against problems on the table pages that it is actually
checking, but it can't necessarily handle gracefully corruption in
other parts of the system. For instance:

- CLOG could be truncated, as in your example
- the disk files could have had their permissions changed so that they
can't be accessed
- the PageIsVerified() check might fail when pages are read
- the TOAST table's metadata in pg_class/pg_attribute/etc. could be corrupted
- ...or the files for those system catalogs could've had their
permissions changed
- ....or they could contain invalid pages
- ...or their indexes could be messed up

I think there are probably a bunch more, and I don't think it's
practical to allow this tool to continue after arbitrary stuff goes
wrong. It'll be too much code and impossible to maintain. In the case
you mention, I think we should view that as a problem with clog rather
than a problem with the table, and thus out of scope.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 27, 2020, at 10:07 PM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
>> 25 авг. 2020 г., в 19:36, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):
>
> Hi Mark!
>
> Thanks for working on this important feature.
>
> I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow.
> Will your module be able to gather tid's of similar corruptions?
>
> server/db M # select * from heap_check('pg_toast.pg_toast_4848601');
> ERROR:  58P01: could not access status of transaction 636558742
> DETAIL:  Could not open file "pg_xact/025F": No such file or directory.
> LOCATION:  SlruReportIOError, slru.c:913
> Time: 3439.915 ms (00:03.440)

The design principle for verify_heapam.c is, if the rest of the system is not corrupt, corruption in the table being
checkedshould not cause a crash during the table check. This is a very limited principle.  Even corruption in the
associatedtoast table or toast index could cause a crash.  That is why checking against the toast table is optional,
andfalse by default. 

Perhaps a more extensive effort could be made later.  I think it is out of scope for this release cycle.  It is a very
interestingarea for further research, though. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
"Andrey M. Borodin"
Date:

> 28 авг. 2020 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а):
> In the case
> you mention, I think we should view that as a problem with clog rather
> than a problem with the table, and thus out of scope.

I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing.
Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast
(whiletoast was accessible until CLOG truncation). 

Best regards, Andrey Borodin.


Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 28, 2020, at 11:10 AM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
>> 28 авг. 2020 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а):
>> In the case
>> you mention, I think we should view that as a problem with clog rather
>> than a problem with the table, and thus out of scope.
>
> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing.
> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast
(whiletoast was accessible until CLOG truncation). 
>
> Best regards, Andrey Borodin.

If you lock the relations involved, check the toast table first, the toast index second, and the main table third, do
youstill get the problem?  Look at how pg_amcheck handles this and let me know if you still see a problem.  There is
theever present problem that external forces, like a rogue process deleting backend files, will strike at precisely the
wrongmoment, but barring that kind of concurrent corruption, I think the toast table being checked prior to the main
tablebeing checked solves some of the issues you are worried about. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing.
> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast
(whiletoast was accessible until CLOG truncation).
 

The code can (and should, and I think does) refrain from looking up
XIDs that are out of the range thought to be valid -- but how do you
propose that it avoid looking up XIDs that ought to have clog data
associated with them despite being >= relfrozenxid and < nextxid?
TransactionIdDidCommit() does not have a suppress-errors flag, adding
one would be quite invasive, yet we cannot safely perform a
significant number of checks without knowing whether the inserting
transaction committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
"Andrey M. Borodin"
Date:

> 29 авг. 2020 г., в 00:56, Robert Haas <robertmhaas@gmail.com> написал(а):
>
> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing.
>> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast
(whiletoast was accessible until CLOG truncation). 
>
> The code can (and should, and I think does) refrain from looking up
> XIDs that are out of the range thought to be valid -- but how do you
> propose that it avoid looking up XIDs that ought to have clog data
> associated with them despite being >= relfrozenxid and < nextxid?
> TransactionIdDidCommit() does not have a suppress-errors flag, adding
> one would be quite invasive, yet we cannot safely perform a
> significant number of checks without knowing whether the inserting
> transaction committed.

What you write seems completely correct to me. I agree that CLOG thresholds lookup seems unnecessary.

But I have a real corruption at hand (on testing site). If I have proposed here heapcheck. And I have pg_surgery from
thethread nearby. Yet I cannot fix the problem, because cannot list affected tuples. These tools do not solve the
problemneglected for long enough. It would be supercool if they could. 

This corruption like a caries had 3 stages:
1. incorrect VM flag that page do not need vacuum
2. xmin and xmax < relfrozenxid
3. CLOG truncated

Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that different.

Thanks!

Best regards, Andrey Borodin.


Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Aug 29, 2020, at 3:27 AM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
>> 29 авг. 2020 г., в 00:56, Robert Haas <robertmhaas@gmail.com> написал(а):
>>
>> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote:
>>> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing.
>>> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast
(whiletoast was accessible until CLOG truncation). 
>>
>> The code can (and should, and I think does) refrain from looking up
>> XIDs that are out of the range thought to be valid -- but how do you
>> propose that it avoid looking up XIDs that ought to have clog data
>> associated with them despite being >= relfrozenxid and < nextxid?
>> TransactionIdDidCommit() does not have a suppress-errors flag, adding
>> one would be quite invasive, yet we cannot safely perform a
>> significant number of checks without knowing whether the inserting
>> transaction committed.
>
> What you write seems completely correct to me. I agree that CLOG thresholds lookup seems unnecessary.
>
> But I have a real corruption at hand (on testing site). If I have proposed here heapcheck. And I have pg_surgery from
thethread nearby. Yet I cannot fix the problem, because cannot list affected tuples. These tools do not solve the
problemneglected for long enough. It would be supercool if they could. 
>
> This corruption like a caries had 3 stages:
> 1. incorrect VM flag that page do not need vacuum
> 2. xmin and xmax < relfrozenxid
> 3. CLOG truncated
>
> Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that different.

I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog.  Ultimately, I
rippedthat out.  My reasoning was that a simpler patch submission was more likely to be acceptable to the community. 

If you want to submit a separate patch that creates a non-throwing version of the clog interface, and get the community
toaccept and commit it, I would seriously consider using that from verify_heapam.  If it gets committed in time, I
mighteven do so for this release cycle.  But I don't want to make this patch dependent on that hypothetical patch
gettingwritten and accepted. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Thanks for the review!

+                                                         msg OUT text
+                                                         )

Looks like atypical formatting.

+REVOKE ALL ON FUNCTION
+verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
+FROM PUBLIC;

This too.

+-- Don't want this to be available to public

Add "by default, but superusers can grant access" or so?

I think there should be a call to pg_class_aclcheck() here, just like
the one in pg_prewarm, so that if the superuser does choose to grant
access, users given access can check tables they anyway have
permission to access, but not others. Maybe put that in
check_relation_relkind_and_relam() and rename it. Might want to look
at the pg_surgery precedent, too. Oh, and that functions header
comment is also wrong.

I think that the way the checks on the block range are performed could
be improved. Generally, we want to avoid reporting the same problem
with a variety of different message strings, because it adds burden
for translators and is potentially confusing for users. You've got two
message strings that are only going to be used for empty relations and
a third message string that is only going to be used for non-empty
relations. What stops you from just ripping off the way that this is
done in pg_prewarm, which requires only 2 messages? Then you'd be
adding a net total of 0 new messages instead of 3, and in my view they
would be clearer than your third message, "block range is out of
bounds for relation with block count %u: " INT64_FORMAT " .. "
INT64_FORMAT, which doesn't say very precisely what the problem is,
and also falls afoul of our usual practice of avoiding the use of
INT64_FORMAT in error messages that are subject to translation. I
notice that pg_prewarm just silently does nothing if the start and end
blocks are swapped, rather than generating an error. We could choose
to do differently here, but I'm not sure why we should bother.

+                       all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE;
+                       all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN;
+
+                       if ((all_frozen && skip_option ==
SKIP_PAGES_ALL_FROZEN) ||
+                               (all_visible && skip_option ==
SKIP_PAGES_ALL_VISIBLE))
+                       {
+                               continue;
+                       }

This isn't horrible style, but why not just get rid of the local
variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if
((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... }

Typically no braces around a block containing only one line.

+ * table contains corrupt all frozen bits, a concurrent vacuum might skip the

all-frozen?

+ * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions
+ * seems acceptable, since if we had checked it earlier in our scan it would
+ * have truly been valid at that time, and we break no MVCC guarantees by
+ * failing to notice the concurrent change in its status.

I agree with the first half of this sentence, but I don't know what
MVCC guarantees have to do with anything. I'd just delete the second
part, or make it a lot clearer.

+ * Some kinds of tuple header corruption make it unsafe to check the tuple
+ * attributes, for example when the tuple is foreshortened and such checks
+ * would read beyond the end of the line pointer (and perhaps the page).  In

I think of foreshortening mostly as an art term, though I guess it has
other meanings. Maybe it would be clearer to say something like "Some
kinds of corruption make it unsafe to check the tuple attributes, for
example when the line pointer refers to a range of bytes outside the
page"?

+ * Other kinds of tuple header corruption do not bare on the question of

bear

+                                                 pstrdup(_("updating
transaction ID marked incompatibly as keys updated and locked
only")));
+                                                 pstrdup(_("updating
transaction ID marked incompatibly as committed and as a
multitransaction ID")));

"updating transaction ID" might scare somebody who thinks that you are
telling them that you changed something. That's not what it means, but
it might not be totally clear. Maybe:

tuple is marked as only locked, but also claims key columns were updated
multixact should not be marked committed

+
psprintf(_("data offset differs from expected: %u vs. %u (1 attribute,
has nulls)"),

For these, how about:

tuple data should begin at byte %u, but actually begins at byte %u (1
attribute, has nulls)
etc.

+
psprintf(_("old-style VACUUM FULL transaction ID is in the future:
%u"),
+
psprintf(_("old-style VACUUM FULL transaction ID precedes freeze
threshold: %u"),
+
psprintf(_("old-style VACUUM FULL transaction ID is invalid in this
relation: %u"),

old-style VACUUM FULL transaction ID %u is in the future
old-style VACUUM FULL transaction ID %u precedes freeze threshold %u
old-style VACUUM FULL transaction ID %u out of range %u..%u

Doesn't the second of these overlap with the third?

Similarly in other places, e.g.

+
psprintf(_("inserting transaction ID is in the future: %u"),

I think this should change to: inserting transaction ID %u is in the future

+       else if (VARATT_IS_SHORT(chunk))
+               /*
+                * could happen due to heap_form_tuple doing its thing
+                */
+               chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;

Add braces here, since there are multiple lines.

+                                                 psprintf(_("toast
chunk sequence number not the expected sequence number: %u vs. %u"),

toast chunk sequence number %u does not match expected sequence number %u

There are more instances of this kind of thing.

+
psprintf(_("toasted attribute has unexpected TOAST tag: %u"),

Remove colon.

+
psprintf(_("attribute ends at offset beyond total tuple length: %u vs.
%u (attribute length %u)"),

Let's try to specify the attribute number in the attribute messages
where we can, e.g.

+
psprintf(_("attribute ends at offset beyond total tuple length: %u vs.
%u (attribute length %u)"),

How about: attribute %u with length %u should end at offset %u, but
the tuple length is only %u

+               if (TransactionIdIsNormal(ctx->relfrozenxid) &&
+                       TransactionIdPrecedes(xmin, ctx->relfrozenxid))
+               {
+                       report_corruption(ctx,
+                                                         /*
translator: Both %u are transaction IDs. */
+
psprintf(_("inserting transaction ID is from before freeze cutoff: %u
vs. %u"),
+
    xmin, ctx->relfrozenxid));
+                       fatal = true;
+               }
+               else if (!xid_valid_in_rel(xmin, ctx))
+               {
+                       report_corruption(ctx,
+                                                         /*
translator: %u is a transaction ID. */
+
psprintf(_("inserting transaction ID is in the future: %u"),
+
    xmin));
+                       fatal = true;
+               }

This seems like good evidence that xid_valid_in_rel needs some
rethinking. As far as I can see, every place where you call
xid_valid_in_rel, you have checks beforehand that duplicate some of
what it does, so that you can give a more accurate error message.
That's not good. Either the message should be adjusted so that it
covers all the cases "e.g. tuple xmin %u is outside acceptable range
%u..%u" or we should just get rid of xid_valid_in_rel() and have
separate error messages for each case, e.g. tuple xmin %u precedes
relfrozenxid %u". I think it's OK to use terms like xmin and xmax in
these messages, rather than inserting transaction ID etc. We have
existing instances of that, and while someone might judge it
user-unfriendly, I disagree. A person who is qualified to interpret
this output must know what 'tuplex min' means immediately, but whether
they can understand that 'inserting transaction ID' means the same
thing is questionable, I think.

This is not a full review, but in general I think that this is getting
pretty close to being committable. The error messages seem to still
need some polishing and I wouldn't be surprised if there are a few
more bugs lurking yet, but I think it's come a long way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Sep 21, 2020, at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I think there should be a call to pg_class_aclcheck() here, just like
> the one in pg_prewarm, so that if the superuser does choose to grant
> access, users given access can check tables they anyway have
> permission to access, but not others. Maybe put that in
> check_relation_relkind_and_relam() and rename it. Might want to look
> at the pg_surgery precedent, too.

In the presence of corruption, verify_heapam() reports to the user (in other words, leaks) metadata about the corrupted
rows. Reasoning about the attack vectors this creates is hard, but a conservative approach is to assume that an
attackercan cause corruption in order to benefit from the leakage, and make sure the leakage does not violate any
reasonablesecurity expectations. 

Basing the security decision on whether the user has access to read the table seems insufficient, as it ignores row
levelsecurity.  Perhaps that is ok if row level security is not enabled for the table or if the user has been granted
BYPASSRLS. There is another problem, though.  There is no grantable privilege to read dead rows.  In the case of
corruption,verify_heapam() may well report metadata about dead rows. 

pg_surgery also appears to leak information about dead rows.  Owners of tables can probe whether supplied TIDs refer to
deadrows.  If a table containing sensitive information has rows deleted prior to ownership being transferred, the new
ownerof the table could probe each page of deleted data to determine something of the content that was there.
Informationabout the number of deleted rows is already available through the pg_stat_* views, but those views don't
givesuch a fine-grained approach to figuring out how large each deleted row was.  For a table with fixed content
options,the content can sometimes be completely inferred from the length of the row.  (Consider a table with a single
textcolumn containing either "approved" or "denied".) 

But pg_surgery is understood to be a collection of sharp tools only to be used under fairly exceptional conditions.
amcheck,on the other hand, is something that feels safer and more reasonable to use on a regular basis, perhaps from a
cronjob executed by a less trusted user.  Forcing the user to be superuser makes it clearer that this feeling of safety
isnot justified. 

I am inclined to just restrict verify_heapam() to superusers and be done.  What do you think?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Tue, Sep 22, 2020 at 10:55 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I am inclined to just restrict verify_heapam() to superusers and be done.  What do you think?

The existing amcheck functions were designed to have execute privilege
granted to non-superusers, though we never actually advertised that
fact. Maybe now would be a good time to start doing so.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, Sep 22, 2020 at 1:55 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I am inclined to just restrict verify_heapam() to superusers and be done.  What do you think?

I think that's an old and largely failed approach. If you want to use
pg_class_ownercheck here rather than pg_class_aclcheck or something
like that, seems fair enough. But I don't think there should be an
is-superuser check in the code, because we've been trying really hard
to get rid of those in most places. And I also don't think there
should be no secondary permissions check, because if somebody does
grant execute permission on these functions, it's unlikely that they
want the person getting that permission to be able to check every
relation in the system even those on which they have no other
privileges at all.

But now I see that there's no secondary permission check in the
verify_nbtree.c code. Is that intentional? Peter, what's the
justification for that?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> But now I see that there's no secondary permission check in the
> verify_nbtree.c code. Is that intentional? Peter, what's the
> justification for that?

As noted by comments in contrib/amcheck/sql/check_btree.sql (the
verify_nbtree.c tests), this is intentional. Note that we explicitly
test that a non-superuser role can perform verification following
GRANT EXECUTE ON FUNCTION ... .

As I mentioned earlier, this is supported (or at least it is supported
in my interpretation of things). It just isn't documented anywhere
outside the test itself.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
> +REVOKE ALL ON FUNCTION
> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
> +FROM PUBLIC;
>
> This too.

Do we really want to use a cstring as an enum-like argument?

I think that I see a bug at this point in check_tuple() (in
v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch):

> +   /* If xmax is a multixact, it should be within valid range */
> +   xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
> +   if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx))
> +   {

*** SNIP ***

> +   }
> +
> +   /* If xmax is normal, it should be within valid range */
> +   if (TransactionIdIsNormal(xmax))
> +   {

Why should it be okay to call TransactionIdIsNormal(xmax) at this
point? It isn't certain that xmax is an XID at all (could be a
MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the
value in the first place). Don't you need to check "(infomask &
HEAP_XMAX_IS_MULTI) == 0" here?

This does look like it's shaping up. Thanks for working on it, Mark.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog.  Ultimately, I
rippedthat out.  My reasoning was that a simpler patch submission was more likely to be acceptable to the community.
 

Isn't some kind of pragmatic compromise possible?

> But I don't want to make this patch dependent on that hypothetical patch getting written and accepted.

Fair enough, but if you're alluding to what I said then about
check_tuphdr_xids()/clog checking a while back then FWIW I didn't
intend to block progress on clog/xact status verification at all. I
just don't think that it is sensible to impose an iron clad guarantee
about having no assertion failures with corrupt clog data -- that
leads to far too much code duplication. But why should you need to
provide an absolute guarantee of that?

I for one would be fine with making the clog checks an optional extra,
that rescinds the no crash guarantee that you're keen on -- just like
with the TOAST checks that you have already in v15. It might make
sense to review how often crashes occur with simulated corruption, and
then to minimize the number of occurrences in the real world. Maybe we
could tolerate a usually-no-crash interface to clog -- if it could
still have assertion failures. Making a strong guarantee about
assertions seems unnecessary.

I don't see how verify_heapam will avoid raising an error during basic
validation from PageIsVerified(), which will violate the guarantee
about not throwing errors. I don't see that as a problem myself, but
presumably you will.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> +REVOKE ALL ON FUNCTION
>> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
>> +FROM PUBLIC;
>> 
>> This too.

> Do we really want to use a cstring as an enum-like argument?

Ugh.  We should not be using cstring as a SQL-exposed datatype
unless there really is no alternative.  Why wasn't this argument
declared "text"?

            regards, tom lane



Re: new heapcheck contrib module

From
Stephen Frost
Date:
Greetings,

* Peter Geoghegan (pg@bowt.ie) wrote:
> On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > But now I see that there's no secondary permission check in the
> > verify_nbtree.c code. Is that intentional? Peter, what's the
> > justification for that?
>
> As noted by comments in contrib/amcheck/sql/check_btree.sql (the
> verify_nbtree.c tests), this is intentional. Note that we explicitly
> test that a non-superuser role can perform verification following
> GRANT EXECUTE ON FUNCTION ... .

> As I mentioned earlier, this is supported (or at least it is supported
> in my interpretation of things). It just isn't documented anywhere
> outside the test itself.

Would certainly be good to document this but I tend to agree with the
comments that ideally-

a) it'd be nice for a relatively low-privileged user/process could run
   the tests in an ongoing manner
b) we don't want to add more is-superuser checks
c) users shouldn't really be given the ability to see rows they're not
   supposed to have access to

In other places in the code, when an error is generated and the user
doesn't have access to the underlying table or doesn't have BYPASSRLS,
we don't include the details or the actual data in the error.  Perhaps
that approach would make sense here (or perhaps not, but it doesn't seem
entirely crazy to me, anyway).  In other words:

a) keep the ability for someone who has EXECUTE on the function to be
   able to run the function against any relation
b) when we detect an issue, perform a permissions check to see if the
   user calling the function has rights to read the rows of the table
   and, if RLS is enabled on the table, if they have BYPASSRLS
c) if the user has appropriate privileges, log the detailed error, if
   not, return a generic error with a HINT that details weren't
   available due to lack of privileges on the relation

I can appreciate the concerns regarding dead rows ending up being
visible to someone who wouldn't normally be able to see them but I'd
argue we could simply document that fact rather than try to build
something to address it, for this particular case.  If there's push back
on that then I'd suggest we have a "can read dead rows" or some such
capability that can be GRANT'd (in the form of a default role, I would
think) which a user would also have to have in order to get detailed
error reports from this function.

Thanks,

Stephen

Attachment

Re: new heapcheck contrib module

From
Michael Paquier
Date:
On Tue, Aug 25, 2020 at 07:36:53AM -0700, Mark Dilger wrote:
> Removed.

This patch is failing to compile on Windows:
C:\projects\postgresql\src\include\fe_utils/print.h(18): fatal error
  C1083: Cannot open include file: 'libpq-fe.h': No such file or
  directory [C:\projects\postgresql\pg_amcheck.vcxproj]

It looks like you forgot to tweak the scripts in src/tools/msvc/.
--
Michael

Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:
Robert, Peter, Andrey, Stephen, and Michael,

Attached is a new version based in part on your review comments, quoted and responded to below as necessary.

There remain a few open issues and/or things I did not implement:

- This version follows Robert's suggestion of using pg_class_aclcheck() to check that the caller has permission to
selectfrom the table being checked.  This is inconsistent with the btree checking logic, which does no such check.
Thesetwo approaches should be reconciled, but there was apparently no agreement on this issue. 

- The public facing documentation, currently live at https://www.postgresql.org/docs/13/amcheck.html, claims "amcheck
functionsmay only be used by superusers."  The docs on master still say the same.  This patch replaces that language
withalternate language explaining that execute permissions may be granted to non-superusers, along with a warning about
therisk of data leakage.  Perhaps some portion of that language in this patch should be back-patched? 

- Stephen's comments about restricting how much information goes into the returned corruption report depending on the
permissionsof the caller has not been implemented.  I may implement some of this if doing so is consistent with
whateverwe decide to do for the aclcheck issue, above, though probably not.  It seems overly complicated. 

- This version does not change clog handling, which leaves Andrey's concern unaddressed.  Peter also showed some
supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests.  I may come back to this
issue,depending on time available and further feedback. 


Moving on to Michael's review....

> On Sep 28, 2020, at 10:56 PM, Michael Paquier <michael@paquier.xyz> wrote:
>
> On Tue, Aug 25, 2020 at 07:36:53AM -0700, Mark Dilger wrote:
>> Removed.
>
> This patch is failing to compile on Windows:
> C:\projects\postgresql\src\include\fe_utils/print.h(18): fatal error
>  C1083: Cannot open include file: 'libpq-fe.h': No such file or
>  directory [C:\projects\postgresql\pg_amcheck.vcxproj]
>
> It looks like you forgot to tweak the scripts in src/tools/msvc/.

Fixed, I think.  I have not tested on windows.


Moving on to Stephen's review....

> On Sep 23, 2020, at 6:46 AM, Stephen Frost <sfrost@snowman.net> wrote:
>
> Greetings,
>
> * Peter Geoghegan (pg@bowt.ie) wrote:
>> On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>> But now I see that there's no secondary permission check in the
>>> verify_nbtree.c code. Is that intentional? Peter, what's the
>>> justification for that?
>>
>> As noted by comments in contrib/amcheck/sql/check_btree.sql (the
>> verify_nbtree.c tests), this is intentional. Note that we explicitly
>> test that a non-superuser role can perform verification following
>> GRANT EXECUTE ON FUNCTION ... .
>
>> As I mentioned earlier, this is supported (or at least it is supported
>> in my interpretation of things). It just isn't documented anywhere
>> outside the test itself.
>
> Would certainly be good to document this but I tend to agree with the
> comments that ideally-
>
> a) it'd be nice for a relatively low-privileged user/process could run
>   the tests in an ongoing manner
> b) we don't want to add more is-superuser checks
> c) users shouldn't really be given the ability to see rows they're not
>   supposed to have access to
>
> In other places in the code, when an error is generated and the user
> doesn't have access to the underlying table or doesn't have BYPASSRLS,
> we don't include the details or the actual data in the error.  Perhaps
> that approach would make sense here (or perhaps not, but it doesn't seem
> entirely crazy to me, anyway).  In other words:
>
> a) keep the ability for someone who has EXECUTE on the function to be
>   able to run the function against any relation
> b) when we detect an issue, perform a permissions check to see if the
>   user calling the function has rights to read the rows of the table
>   and, if RLS is enabled on the table, if they have BYPASSRLS
> c) if the user has appropriate privileges, log the detailed error, if
>   not, return a generic error with a HINT that details weren't
>   available due to lack of privileges on the relation
>
> I can appreciate the concerns regarding dead rows ending up being
> visible to someone who wouldn't normally be able to see them but I'd
> argue we could simply document that fact rather than try to build
> something to address it, for this particular case.  If there's push back
> on that then I'd suggest we have a "can read dead rows" or some such
> capability that can be GRANT'd (in the form of a default role, I would
> think) which a user would also have to have in order to get detailed
> error reports from this function.

There wasn't enough agreement on the thread about how this should work, so I left this idea unimplemented.

I'm a bit concerned that restricting the results for non-superusers would create a perverse incentive to use a
superuserrole to connect and check tables.  On the other hand, there would not be any difference in the output in the
commoncase that no corruption exists, so maybe the perverse incentive would not be too significant. 

Implementing the idea you outline would complicate the patch a fair amount, as we'd need to tailor all the reports in
thisway, and extend the tests to verify we're not leaking any information to non-superusers.  I would prefer to find a
simplersolution. 


Moving on to Robert's review....

> On Sep 21, 2020, at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Thanks for the review!
>
> +                                                         msg OUT text
> +                                                         )
>
> Looks like atypical formatting.
>
> +REVOKE ALL ON FUNCTION
> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
> +FROM PUBLIC;
>
> This too.

Changed in this next version.

> +-- Don't want this to be available to public
>
> Add "by default, but superusers can grant access" or so?

Hmm.  I borrowed the verbiage from elsewhere.

contrib/pg_buffercache/pg_buffercache--1.2.sql:-- Don't want these to be available to public.
contrib/pg_freespacemap/pg_freespacemap--1.1.sql:-- Don't want these to be available to public.
contrib/pg_visibility/pg_visibility--1.1.sql:-- Don't want these to be available to public.

> I think there should be a call to pg_class_aclcheck() here, just like
> the one in pg_prewarm, so that if the superuser does choose to grant
> access, users given access can check tables they anyway have
> permission to access, but not others. Maybe put that in
> check_relation_relkind_and_relam() and rename it. Might want to look
> at the pg_surgery precedent, too.

I don't think there are any great options here, but for this next version I've done it with pg_class_aclcheck().

> Oh, and that functions header
> comment is also wrong.

Changed in this next version.

> I think that the way the checks on the block range are performed could
> be improved. Generally, we want to avoid reporting the same problem
> with a variety of different message strings, because it adds burden
> for translators and is potentially confusing for users. You've got two
> message strings that are only going to be used for empty relations and
> a third message string that is only going to be used for non-empty
> relations. What stops you from just ripping off the way that this is
> done in pg_prewarm, which requires only 2 messages? Then you'd be
> adding a net total of 0 new messages instead of 3, and in my view they
> would be clearer than your third message, "block range is out of
> bounds for relation with block count %u: " INT64_FORMAT " .. "
> INT64_FORMAT, which doesn't say very precisely what the problem is,
> and also falls afoul of our usual practice of avoiding the use of
> INT64_FORMAT in error messages that are subject to translation. I
> notice that pg_prewarm just silently does nothing if the start and end
> blocks are swapped, rather than generating an error. We could choose
> to do differently here, but I'm not sure why we should bother.

This next version borrows pg_prewarm's messages as you suggest, except that pg_prewarm embeds INT64_FORMAT in the
messagestrings, which are replaced with %u in this next patch.  Also, there is no good way to report an invalid block
rangefor empty tables using these messages, so the patch now just exists early in such a case for invalid ranges
withoutthrowing an error.  This is a little bit non-orthogonal with how invalid block ranges are handled on non-empty
tables,but perhaps that's ok.  

>
> +                       all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE;
> +                       all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN;
> +
> +                       if ((all_frozen && skip_option ==
> SKIP_PAGES_ALL_FROZEN) ||
> +                               (all_visible && skip_option ==
> SKIP_PAGES_ALL_VISIBLE))
> +                       {
> +                               continue;
> +                       }
>
> This isn't horrible style, but why not just get rid of the local
> variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if
> ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... }
>
> Typically no braces around a block containing only one line.

Changed in this next version.

> + * table contains corrupt all frozen bits, a concurrent vacuum might skip the
>
> all-frozen?

Changed in this next version.

> + * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions
> + * seems acceptable, since if we had checked it earlier in our scan it would
> + * have truly been valid at that time, and we break no MVCC guarantees by
> + * failing to notice the concurrent change in its status.
>
> I agree with the first half of this sentence, but I don't know what
> MVCC guarantees have to do with anything. I'd just delete the second
> part, or make it a lot clearer.

Changed in this next version to simply omit the MVCC related language.

>
> + * Some kinds of tuple header corruption make it unsafe to check the tuple
> + * attributes, for example when the tuple is foreshortened and such checks
> + * would read beyond the end of the line pointer (and perhaps the page).  In
>
> I think of foreshortening mostly as an art term, though I guess it has
> other meanings. Maybe it would be clearer to say something like "Some
> kinds of corruption make it unsafe to check the tuple attributes, for
> example when the line pointer refers to a range of bytes outside the
> page"?
>
> + * Other kinds of tuple header corruption do not bare on the question of
>
> bear

Changed.

> +                                                 pstrdup(_("updating
> transaction ID marked incompatibly as keys updated and locked
> only")));
> +                                                 pstrdup(_("updating
> transaction ID marked incompatibly as committed and as a
> multitransaction ID")));
>
> "updating transaction ID" might scare somebody who thinks that you are
> telling them that you changed something. That's not what it means, but
> it might not be totally clear. Maybe:
>
> tuple is marked as only locked, but also claims key columns were updated
> multixact should not be marked committed

Changed to use your verbiage.

> +
> psprintf(_("data offset differs from expected: %u vs. %u (1 attribute,
> has nulls)"),
>
> For these, how about:
>
> tuple data should begin at byte %u, but actually begins at byte %u (1
> attribute, has nulls)
> etc.

Is it ok to embed interpolated values into the message string like that?  I thought that made it harder for
translators. I agree that your language is easier to understand, and have used it in this next version of the patch.
Manyof your comments that follow raise the same issue, but I'm using your verbiage anyway. 

> +
> psprintf(_("old-style VACUUM FULL transaction ID is in the future:
> %u"),
> +
> psprintf(_("old-style VACUUM FULL transaction ID precedes freeze
> threshold: %u"),
> +
> psprintf(_("old-style VACUUM FULL transaction ID is invalid in this
> relation: %u"),
>
> old-style VACUUM FULL transaction ID %u is in the future
> old-style VACUUM FULL transaction ID %u precedes freeze threshold %u
> old-style VACUUM FULL transaction ID %u out of range %u..%u
>
> Doesn't the second of these overlap with the third?

Good point.  If the second one reports, so will the third.  I've changed it to use if/else if logic to avoid that, and
touse your suggested verbiage. 

>
> Similarly in other places, e.g.
>
> +
> psprintf(_("inserting transaction ID is in the future: %u"),
>
> I think this should change to: inserting transaction ID %u is in the future

Changed, along with similarly formatted messages.

>
> +       else if (VARATT_IS_SHORT(chunk))
> +               /*
> +                * could happen due to heap_form_tuple doing its thing
> +                */
> +               chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT;
>
> Add braces here, since there are multiple lines.

Changed.

>
> +                                                 psprintf(_("toast
> chunk sequence number not the expected sequence number: %u vs. %u"),
>
> toast chunk sequence number %u does not match expected sequence number %u
>
> There are more instances of this kind of thing.

Changed.

> +
> psprintf(_("toasted attribute has unexpected TOAST tag: %u"),
>
> Remove colon.

Changed.

> +
> psprintf(_("attribute ends at offset beyond total tuple length: %u vs.
> %u (attribute length %u)"),
>
> Let's try to specify the attribute number in the attribute messages
> where we can, e.g.
>
> +
> psprintf(_("attribute ends at offset beyond total tuple length: %u vs.
> %u (attribute length %u)"),
>
> How about: attribute %u with length %u should end at offset %u, but
> the tuple length is only %u

I had omitted the attribute numbers from the attribute corruption messages because attnum is one of the OUT parameters
fromverify_heapam.  I'm including attnum in the message text for this next version, as you request. 

> +               if (TransactionIdIsNormal(ctx->relfrozenxid) &&
> +                       TransactionIdPrecedes(xmin, ctx->relfrozenxid))
> +               {
> +                       report_corruption(ctx,
> +                                                         /*
> translator: Both %u are transaction IDs. */
> +
> psprintf(_("inserting transaction ID is from before freeze cutoff: %u
> vs. %u"),
> +
>    xmin, ctx->relfrozenxid));
> +                       fatal = true;
> +               }
> +               else if (!xid_valid_in_rel(xmin, ctx))
> +               {
> +                       report_corruption(ctx,
> +                                                         /*
> translator: %u is a transaction ID. */
> +
> psprintf(_("inserting transaction ID is in the future: %u"),
> +
>    xmin));
> +                       fatal = true;
> +               }
>
> This seems like good evidence that xid_valid_in_rel needs some
> rethinking. As far as I can see, every place where you call
> xid_valid_in_rel, you have checks beforehand that duplicate some of
> what it does, so that you can give a more accurate error message.
> That's not good. Either the message should be adjusted so that it
> covers all the cases "e.g. tuple xmin %u is outside acceptable range
> %u..%u" or we should just get rid of xid_valid_in_rel() and have
> separate error messages for each case, e.g. tuple xmin %u precedes
> relfrozenxid %u".

This next version is refactored, removing the function xid_valid_in_rel entirely, and structuring get_xid_status
differently.

> I think it's OK to use terms like xmin and xmax in
> these messages, rather than inserting transaction ID etc. We have
> existing instances of that, and while someone might judge it
> user-unfriendly, I disagree. A person who is qualified to interpret
> this output must know what 'tuplex min' means immediately, but whether
> they can understand that 'inserting transaction ID' means the same
> thing is questionable, I think.

Done.

> This is not a full review, but in general I think that this is getting
> pretty close to being committable. The error messages seem to still
> need some polishing and I wouldn't be surprised if there are a few
> more bugs lurking yet, but I think it's come a long way.

This next version has some other message rewording.  While testing, I found it odd to report an xid as out of bounds
(inthe future, or before the freeze threshold, etc.), without mentioning the xid value against which it is being
comparedunfavorably.  We don't normally need to think about the epoch when comparing two xids against each other, as
theymust both make sense relative to the current epoch; but for corruption, you can't assume the corrupt xid was
writtenrelative to any particular epoch, and only the 32-bit xid value can be printed since the epoch is unknown.  The
otherxid value (freeze threshold, etc) can be printed with the epoch information, but printing the epoch+xid merely as
xid8outdoes (in other words, as a UINT64) makes the messages thoroughly confusing.  I went with the equivalent of
sprintf("%u:%u",epoch, xid), which follows the precedent from pg_controldata.c, gistdesc.c, and elsewhere. 


Moving on to Peter's reviews....

> On Sep 22, 2020, at 4:18 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote:
>> +REVOKE ALL ON FUNCTION
>> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint)
>> +FROM PUBLIC;
>>
>> This too.
>
> Do we really want to use a cstring as an enum-like argument?

Perhaps not.  This next version has that as text.

>
> I think that I see a bug at this point in check_tuple() (in
> v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch):
>
>> +   /* If xmax is a multixact, it should be within valid range */
>> +   xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr);
>> +   if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx))
>> +   {
>
> *** SNIP ***
>
>> +   }
>> +
>> +   /* If xmax is normal, it should be within valid range */
>> +   if (TransactionIdIsNormal(xmax))
>> +   {
>
> Why should it be okay to call TransactionIdIsNormal(xmax) at this
> point? It isn't certain that xmax is an XID at all (could be a
> MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the
> value in the first place). Don't you need to check "(infomask &
> HEAP_XMAX_IS_MULTI) == 0" here?

I think you are right.  This check you suggest is used in this next version.


> On Sep 22, 2020, at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog.  Ultimately, I
rippedthat out.  My reasoning was that a simpler patch submission was more likely to be acceptable to the community. 
>
> Isn't some kind of pragmatic compromise possible?
>
>> But I don't want to make this patch dependent on that hypothetical patch getting written and accepted.
>
> Fair enough, but if you're alluding to what I said then about
> check_tuphdr_xids()/clog checking a while back then FWIW I didn't
> intend to block progress on clog/xact status verification at all.

I don't recall your comments factoring into my thinking on this specific issue, but rather a conversation I had
off-listwith Robert.  The clog interface may be a hot enough code path that adding a flag for non-throwing behavior
merelyto support a contrib module might be resisted.  If folks generally like such a change to the clog interface, I
couldconsider adding that as a third patch in this set. 

> I
> just don't think that it is sensible to impose an iron clad guarantee
> about having no assertion failures with corrupt clog data -- that
> leads to far too much code duplication. But why should you need to
> provide an absolute guarantee of that?
>
> I for one would be fine with making the clog checks an optional extra,
> that rescinds the no crash guarantee that you're keen on -- just like
> with the TOAST checks that you have already in v15. It might make
> sense to review how often crashes occur with simulated corruption, and
> then to minimize the number of occurrences in the real world. Maybe we
> could tolerate a usually-no-crash interface to clog -- if it could
> still have assertion failures. Making a strong guarantee about
> assertions seems unnecessary.
>
> I don't see how verify_heapam will avoid raising an error during basic
> validation from PageIsVerified(), which will violate the guarantee
> about not throwing errors. I don't see that as a problem myself, but
> presumably you will.

My concern is not so much that verify_heapam will stop with an error, but rather that it might trigger a panic that
stopsall backends.  Stopping with an error merely because it hits corruption is not ideal, as I would rather it
completedthe scan and reported all corruptions found, but that's minor compared to the damage done if verify_heapam
createsdowntime in a production environment offering high availability guarantees.  That statement might seem nuts,
giventhat the corrupt table itself would be causing downtime, but that analysis depends on assumptions about table
accesspatterns, and there is no a priori reason to think that corrupt pages are necessarily ever being accessed, or
accessedin a way that causes crashes (rather than merely wrong results) outside verify_heapam scanning the whole table. 




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> - This version does not change clog handling, which leaves Andrey's concern unaddressed.  Peter also showed some
supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests.  I may come back to this
issue,depending on time available and further feedback. 

Attached is a patch set that includes the clog handling as discussed.  The 0001 and 0002 are effectively unchanged
sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004
whichuses the non-throwing interface from within amcheck's heap checking functions. 

I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test coverage
ofverify_heapam in the presence of clog truncation.  I was hoping to have that as part of v17, but since it is taking a
bitlonger than I anticipated, I'll have to come back with that in a later patch. 




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Andrey Borodin
Date:

> 7 окт. 2020 г., в 04:20, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):
>
>
>
>> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>>
>> - This version does not change clog handling, which leaves Andrey's concern unaddressed.  Peter also showed some
supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests.  I may come back to this
issue,depending on time available and further feedback. 
>
> Attached is a patch set that includes the clog handling as discussed.  The 0001 and 0002 are effectively unchanged
sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004
whichuses the non-throwing interface from within amcheck's heap checking functions. 
>
> I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test
coverageof verify_heapam in the presence of clog truncation.  I was hoping to have that as part of v17, but since it is
takinga bit longer than I anticipated, I'll have to come back with that in a later patch. 
>

Many thanks, Mark! I really appreciate this functionality. It could save me many hours of recreating clogs.

I'm not entire sure this message is correct: psprintf(_("xmax %u commit status is lost")
It seems to me to be not commit status, but rather transaction status.

Thanks!

Best regards, Andrey Borodin.


Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 6, 2020, at 11:27 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
>
>> 7 окт. 2020 г., в 04:20, Mark Dilger <mark.dilger@enterprisedb.com> написал(а):
>>
>>
>>
>>> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>>>
>>> - This version does not change clog handling, which leaves Andrey's concern unaddressed.  Peter also showed some
supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests.  I may come back to this
issue,depending on time available and further feedback. 
>>
>> Attached is a patch set that includes the clog handling as discussed.  The 0001 and 0002 are effectively unchanged
sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004
whichuses the non-throwing interface from within amcheck's heap checking functions. 
>>
>> I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test
coverageof verify_heapam in the presence of clog truncation.  I was hoping to have that as part of v17, but since it is
takinga bit longer than I anticipated, I'll have to come back with that in a later patch. 
>>
>
> Many thanks, Mark! I really appreciate this functionality. It could save me many hours of recreating clogs.

You are quite welcome, though the thanks may be premature.  I posted 0003 and 0004 patches mostly as concrete
implementationexamples that can be criticized. 

> I'm not entire sure this message is correct: psprintf(_("xmax %u commit status is lost")
> It seems to me to be not commit status, but rather transaction status.

I have changed several such messages to say "transaction status" rather than "commit status".  I'll be posting it in a
separateemail, shortly. 

Thanks for reviewing!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> There remain a few open issues and/or things I did not implement:
>
> - This version follows Robert's suggestion of using pg_class_aclcheck() to check that the caller has permission to
selectfrom the table being checked.  This is inconsistent with the btree checking logic, which does no such check.
Thesetwo approaches should be reconciled, but there was apparently no agreement on this issue. 

This next version, attached, has the acl checking and associated documentation changes split out into patch 0005,
makingit easier to review in isolation from the rest of the patch series. 

Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's
reviewupthread. 





—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Mon, Oct 5, 2020 at 5:24 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > I don't see how verify_heapam will avoid raising an error during basic
> > validation from PageIsVerified(), which will violate the guarantee
> > about not throwing errors. I don't see that as a problem myself, but
> > presumably you will.
>
> My concern is not so much that verify_heapam will stop with an error, but rather that it might trigger a panic that
stopsall backends.  Stopping with an error merely because it hits corruption is not ideal, as I would rather it
completedthe scan and reported all corruptions found, but that's minor compared to the damage done if verify_heapam
createsdowntime in a production environment offering high availability guarantees.  That statement might seem nuts,
giventhat the corrupt table itself would be causing downtime, but that analysis depends on assumptions about table
accesspatterns, and there is no a priori reason to think that corrupt pages are necessarily ever being accessed, or
accessedin a way that causes crashes (rather than merely wrong results) outside verify_heapam scanning the whole table. 

That seems reasonable to me. I think that it makes sense to never take
down the server in a non-debug build with verify_heapam. That's not
what I took away from your previous remarks on the issue, but perhaps
it doesn't matter now.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> This next version, attached, has the acl checking and associated documentation changes split out into patch 0005,
makingit easier to review in isolation from the rest of the patch series.
 
>
> Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's
reviewupthread.
 

I was about to commit 0001, after making some cosmetic changes, when I
discovered that it won't link for me. I think there must be something
wrong with the NLS stuff. My version of 0001 is attached. The error I
got is:

ccache clang -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Werror=vla -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror
-fno-omit-frame-pointer  -bundle -multiply_defined suppress -o
amcheck.so  verify_heapam.o verify_nbtree.o -L../../src/port
-L../../src/common   -L/opt/local/lib -L/opt/local/lib
-L/opt/local/lib -L/opt/local/lib  -L/opt/local/lib
-Wl,-dead_strip_dylibs  -Wall -Werror -fno-omit-frame-pointer
-bundle_loader ../../src/backend/postgres
Undefined symbols for architecture x86_64:
  "_libintl_gettext", referenced from:
      _verify_heapam in verify_heapam.o
      _check_tuple in verify_heapam.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [amcheck.so] Error 1

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: new heapcheck contrib module

From
Alvaro Herrera
Date:
On 2020-Oct-21, Robert Haas wrote:

> On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> > This next version, attached, has the acl checking and associated documentation changes split out into patch 0005,
makingit easier to review in isolation from the rest of the patch series.
 
> >
> > Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's
reviewupthread.
 
> 
> I was about to commit 0001, after making some cosmetic changes, when I
> discovered that it won't link for me. I think there must be something
> wrong with the NLS stuff. My version of 0001 is attached. The error I
> got is:

Hmm ... I don't think we have translation support in contrib, do we?  I
think you could solve that by adding a "#undef _, #define _(...) (...)"
or similar at the top of the offending C files, assuming you don't want
to rip out all use of _() there.

TBH the usage of "translation:" comments in this patch seems
over-enthusiastic to me.




Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 21, 2020, at 1:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
>
> On 2020-Oct-21, Robert Haas wrote:
>
>> On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>>> This next version, attached, has the acl checking and associated documentation changes split out into patch 0005,
makingit easier to review in isolation from the rest of the patch series. 
>>>
>>> Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's
reviewupthread. 
>>
>> I was about to commit 0001, after making some cosmetic changes, when I
>> discovered that it won't link for me. I think there must be something
>> wrong with the NLS stuff. My version of 0001 is attached. The error I
>> got is:
>
> Hmm ... I don't think we have translation support in contrib, do we?  I
> think you could solve that by adding a "#undef _, #define _(...) (...)"
> or similar at the top of the offending C files, assuming you don't want
> to rip out all use of _() there.

There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not
forRobert. 

On mac, I'm using the toolchain from XCode, whereas Robert is using MacPorts.

Mine reports:

Apple clang version 11.0.0 (clang-1100.0.33.17)
Target: x86_64-apple-darwin19.6.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin

Robert's reports:

clang version 5.0.2 (tags/RELEASE_502/final)
Target: x86_64-apple-darwin19.4.0
Thread model: posix
InstalledDir: /opt/local/libexec/llvm-5.0/bin

On linux, I'm using gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

Searching around on the web, there are various reports of MacPort's clang not linking libintl correctly, though I don't
knowif that is a real problem with MacPorts or just a few cases of user error.  Has anybody else following this thread
hadissues with MacPort's version of clang vis-a-vis linking libintl's gettext? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I was about to commit 0001, after making some cosmetic changes, when I
> discovered that it won't link for me. I think there must be something
> wrong with the NLS stuff. My version of 0001 is attached. The error I
> got is:

Well, the short answer would be "you need to add

SHLIB_LINK += $(filter -lintl, $(LIBS))

to the Makefile".  However, I would vote against that, because in point
of fact amcheck has no translation support, just like all our other
contrib modules.  What should likely happen instead is to rip out
whatever code is overoptimistically expecting it needs to support
translation.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not
forRobert. 

Are you using --enable-nls at all on your Mac build?  Because for sure it
should not work there, given the failure to include -lintl in amcheck's
link step.  Some platforms are forgiving of that, but not Mac.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 21, 2020, at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not
forRobert. 
>
> Are you using --enable-nls at all on your Mac build?  Because for sure it
> should not work there, given the failure to include -lintl in amcheck's
> link step.  Some platforms are forgiving of that, but not Mac.

Thanks, Tom!

No, that's the answer.  I had a typo/thinko in my configure options, --with-nls instead of --enable-nls, and the
warningabout it being an invalid flag went by so fast I didn't see it.  I had it spelled correctly on linux, but I
guessthat's one of the platforms that is more forgiving. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 21, 2020, at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
>> I was about to commit 0001, after making some cosmetic changes, when I
>> discovered that it won't link for me. I think there must be something
>> wrong with the NLS stuff. My version of 0001 is attached. The error I
>> got is:
>
> Well, the short answer would be "you need to add
>
> SHLIB_LINK += $(filter -lintl, $(LIBS))
>
> to the Makefile".  However, I would vote against that, because in point
> of fact amcheck has no translation support, just like all our other
> contrib modules.  What should likely happen instead is to rip out
> whatever code is overoptimistically expecting it needs to support
> translation.

Done that way in the attached, which also include Robert's changes from v19 he posted earlier today.





—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Done that way in the attached, which also include Robert's changes from v19 he posted earlier today.

Committed. Let's see what the buildfarm thinks.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Committed. Let's see what the buildfarm thinks.

It is mostly happy, but thorntail is not:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11

I thought that the problem might be related to the fact that thorntail
is using force_parallel_mode, but I tried that here and it did not
cause a failure. So my next guess is that it is related to the fact
that this is a sparc64 machine, but it's hard to tell, since none of
the other sparc64 critters have run yet. In any case I don't know why
that would cause a failure. The messages in the log aren't very
illuminating, unfortunately. :-(

Mark, any ideas what might cause specifically that set of tests to fail?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> The messages in the log aren't very
> illuminating, unfortunately. :-(

Considering this is a TAP test, why in the world is it designed to hide
all details of any unexpected amcheck messages?  Surely being able to
see what amcheck is saying would be helpful here.

IOW, don't have the tests abbreviate the module output with count(*),
but return the full thing, and then use a regex to see if you got what
was expected.  If you didn't, the output will show what you did get.

            regards, tom lane



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Oct 22, 2020 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Considering this is a TAP test, why in the world is it designed to hide
> all details of any unexpected amcheck messages?  Surely being able to
> see what amcheck is saying would be helpful here.
>
> IOW, don't have the tests abbreviate the module output with count(*),
> but return the full thing, and then use a regex to see if you got what
> was expected.  If you didn't, the output will show what you did get.

Yeah, that thought crossed my mind, too. But I'm not sure it would
help in the case of this particular failure, because I think the
problem is that we're expecting to get complaints and instead getting
none.

It might be good to change it anyway, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Tom Lane
Date:
lapwing just spit up a possibly relevant issue:

ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla
-Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g
-O2-Werror -fPIC -I. -I. -I../../src/include  -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE
-I/usr/include/libxml2 -I/usr/include/et  -c -o verify_heapam.o verify_heapam.c 
verify_heapam.c: In function 'get_xid_status':
verify_heapam.c:1432:5: error: 'fxid.value' may be used uninitialized in this function [-Werror=maybe-uninitialized]
cc1: all warnings being treated as errors



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Committed. Let's see what the buildfarm thinks.
>
> It is mostly happy, but thorntail is not:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11
>
> I thought that the problem might be related to the fact that thorntail
> is using force_parallel_mode, but I tried that here and it did not
> cause a failure. So my next guess is that it is related to the fact
> that this is a sparc64 machine, but it's hard to tell, since none of
> the other sparc64 critters have run yet. In any case I don't know why
> that would cause a failure. The messages in the log aren't very
> illuminating, unfortunately. :-(
>
> Mark, any ideas what might cause specifically that set of tests to fail?

The code is correctly handling an uncorrupted table, but then more or less randomly failing some of the time when
processinga corrupt table. 

Tom identified a problem with an uninitialized variable.  I'm putting together a new patch set to address it.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 9:01 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Oct 22, 2020, at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
>>> Committed. Let's see what the buildfarm thinks.
>>
>> It is mostly happy, but thorntail is not:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11
>>
>> I thought that the problem might be related to the fact that thorntail
>> is using force_parallel_mode, but I tried that here and it did not
>> cause a failure. So my next guess is that it is related to the fact
>> that this is a sparc64 machine, but it's hard to tell, since none of
>> the other sparc64 critters have run yet. In any case I don't know why
>> that would cause a failure. The messages in the log aren't very
>> illuminating, unfortunately. :-(
>>
>> Mark, any ideas what might cause specifically that set of tests to fail?
>
> The code is correctly handling an uncorrupted table, but then more or less randomly failing some of the time when
processinga corrupt table. 
>
> Tom identified a problem with an uninitialized variable.  I'm putting together a new patch set to address it.

The 0001 attached patch addresses the -Werror=maybe-uninitialized problem.

The 0002 attached patch addresses the test failures:

The failing test is designed to stop the server, create blunt force trauma to the heap and toast files through
overwritinggarbage bytes, restart the server, and verify that corruption is detected by amcheck's verify_heapam().  The
exacttrauma is intended to be the same on all platforms, in terms of the number of bytes written and the location in
thefile that it gets written, but owing to differences between platforms, by design the test does not expect a
particularcorruption message. 

The test was overwriting far fewer bytes than I had intended, but since it was still sufficient to create corruption on
theplatforms where I tested, I failed to notice.  It should do a more thorough job now. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Oct 22, 2020 at 3:15 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> The 0001 attached patch addresses the -Werror=maybe-uninitialized problem.

I am skeptical. Why so much code churn to fix a compiler warning? And
even in the revised code, *status isn't set in all cases, so I don't
see why this would satisfy the compiler. Even if it satisfies this
particular compiler for some other reason, some other compiler is
bound to be unhappy sometime. It's better to just arrange to set
*status always, and use a dummy value in cases where it doesn't
matter. Also, "return XID_BOUNDS_OK;;" has exceeded its recommended
allowance of semicolons.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Oct 22, 2020 at 3:15 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> The 0001 attached patch addresses the -Werror=maybe-uninitialized problem.
>
> I am skeptical. Why so much code churn to fix a compiler warning? And
> even in the revised code, *status isn't set in all cases, so I don't
> see why this would satisfy the compiler. Even if it satisfies this
> particular compiler for some other reason, some other compiler is
> bound to be unhappy sometime. It's better to just arrange to set
> *status always, and use a dummy value in cases where it doesn't
> matter. Also, "return XID_BOUNDS_OK;;" has exceeded its recommended
> allowance of semicolons.

I think the compiler warning was about fxid not being set.  The callers pass NULL for status if they don't want status
checked,so writing *status unconditionally would be an error.  Also, if the xid being checked is out of bounds, we
can'tcheck the status of the xid in clog. 

As for the code churn, I probably refactored it a bit more than I needed to fix the compiler warning about fxid, but
thatwas because the old arrangement seemed to make it harder to reason about when and where fxid got set.  I think that
ismore clear now. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
ooh, looks like prairiedog sees the problem too.  That means I should be
able to reproduce it under a debugger, if you're not certain yet where
the problem lies.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
... btw, having now looked more closely at get_xid_status(), I wonder
how come there aren't more compilers bitching about it, because it
is very very obviously broken.  In particular, the case of
requesting status for an xid that is BootstrapTransactionId or
FrozenTransactionId *will* fall through to perform
FullTransactionIdPrecedesOrEquals with an uninitialized fxid.

The fact that most compilers seem to fail to notice that is quite scary.
I suppose it has something to do with FullTransactionId being a struct,
which makes me wonder if that choice was quite as wise as we thought.

Meanwhile, so far as this code goes, I wonder why you don't just change it
to always set that value, ie

    XidBoundsViolation result;
    FullTransactionId fxid;
    FullTransactionId clog_horizon;

+    fxid = FullTransactionIdFromXidAndCtx(xid, ctx);
+
    /* Quick check for special xids */
    if (!TransactionIdIsValid(xid))
        result = XID_INVALID;
    else if (xid == BootstrapTransactionId || xid == FrozenTransactionId)
        result = XID_BOUNDS_OK;
    else
    {
        /* Check if the xid is within bounds */
-        fxid = FullTransactionIdFromXidAndCtx(xid, ctx);
        if (!fxid_in_cached_range(fxid, ctx))
        {


            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> ooh, looks like prairiedog sees the problem too.  That means I should be
> able to reproduce it under a debugger, if you're not certain yet where
> the problem lies.

Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code.  I
thinkthey are a busted perl test.  The test was supposed to corrupt the heap by overwriting a heap file with a large
chunkof garbage, but in fact only wrote a small amount of garbage.  The idea was to write about 2000 bytes starting at
offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl
test,that didn't happen. 

I think the uninitialized variable warning is warning about a real problem in the c-code, but I have no reason to think
thatparticular problem is causing this particular regression test failure. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> ... btw, having now looked more closely at get_xid_status(), I wonder
> how come there aren't more compilers bitching about it, because it
> is very very obviously broken.  In particular, the case of
> requesting status for an xid that is BootstrapTransactionId or
> FrozenTransactionId *will* fall through to perform
> FullTransactionIdPrecedesOrEquals with an uninitialized fxid.
>
> The fact that most compilers seem to fail to notice that is quite scary.
> I suppose it has something to do with FullTransactionId being a struct,
> which makes me wonder if that choice was quite as wise as we thought.
>
> Meanwhile, so far as this code goes, I wonder why you don't just change it
> to always set that value, ie
>
>     XidBoundsViolation result;
>     FullTransactionId fxid;
>     FullTransactionId clog_horizon;
>
> +    fxid = FullTransactionIdFromXidAndCtx(xid, ctx);
> +
>     /* Quick check for special xids */
>     if (!TransactionIdIsValid(xid))
>         result = XID_INVALID;
>     else if (xid == BootstrapTransactionId || xid == FrozenTransactionId)
>         result = XID_BOUNDS_OK;
>     else
>     {
>         /* Check if the xid is within bounds */
> -        fxid = FullTransactionIdFromXidAndCtx(xid, ctx);
>         if (!fxid_in_cached_range(fxid, ctx))
>         {

Yeah, I reached the same conclusion before submitting the fix upthread.  I structured it a bit differently, but I
believefxid will now always get set before being used, though sometimes the function returns before doing either. 

I had the same thought about compilers not catching that, too.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> ooh, looks like prairiedog sees the problem too.  That means I should be
>> able to reproduce it under a debugger, if you're not certain yet where
>> the problem lies.

> Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code.  I
thinkthey are a busted perl test.  The test was supposed to corrupt the heap by overwriting a heap file with a large
chunkof garbage, but in fact only wrote a small amount of garbage.  The idea was to write about 2000 bytes starting at
offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl
test,that didn't happen. 

Hm, but why are we seeing the failure only on specific machine
architectures?  sparc64 and ppc32 is a weird pairing, too.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> ooh, looks like prairiedog sees the problem too.  That means I should be
>>> able to reproduce it under a debugger, if you're not certain yet where
>>> the problem lies.
>
>> Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code.  I
thinkthey are a busted perl test.  The test was supposed to corrupt the heap by overwriting a heap file with a large
chunkof garbage, but in fact only wrote a small amount of garbage.  The idea was to write about 2000 bytes starting at
offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl
test,that didn't happen. 
>
> Hm, but why are we seeing the failure only on specific machine
> architectures?  sparc64 and ppc32 is a weird pairing, too.

It is seeking to position 32 and writing '\x77\x77\x77\x77'.  x86_64 is little-endian, and ppc32 and sparc64 are both
big-endian,right? 
—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Oct 22, 2020, at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Hm, but why are we seeing the failure only on specific machine
>> architectures?  sparc64 and ppc32 is a weird pairing, too.

> It is seeking to position 32 and writing '\x77\x77\x77\x77'.  x86_64 is
> little-endian, and ppc32 and sparc64 are both big-endian, right?

They are, but that should not meaningfully affect the results of
that corruption step.  You zapped only one line pointer not
several, but it would look the same regardless of endiannness.

I find it more plausible that we might see the bad effects of
the uninitialized variable only on those arches --- but that
theory is still pretty shaky, since you'd think compiler
choices about register or stack-location assignment would
be the controlling factor, and those should be all over the
map.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
I wrote:
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> It is seeking to position 32 and writing '\x77\x77\x77\x77'.  x86_64 is
>> little-endian, and ppc32 and sparc64 are both big-endian, right?

> They are, but that should not meaningfully affect the results of
> that corruption step.  You zapped only one line pointer not
> several, but it would look the same regardless of endiannness.

Oh, wait a second.  ItemIdData has the flag bits in the middle:

typedef struct ItemIdData
{
    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
                lp_flags:2,       /* state of line pointer, see below */
                lp_len:15;        /* byte length of tuple */
} ItemIdData;

meaning that for that particular bit pattern, one endianness
is going to see the flags as 01 (LP_NORMAL) and the other as 10
(LP_REDIRECT).  The offset/len are corrupt either way, but
I'd certainly expect that amcheck would produce different
complaints about those two cases.  So it's unsurprising if
this test case's output is endian-dependent.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> It is seeking to position 32 and writing '\x77\x77\x77\x77'.  x86_64 is
>>> little-endian, and ppc32 and sparc64 are both big-endian, right?
>
>> They are, but that should not meaningfully affect the results of
>> that corruption step.  You zapped only one line pointer not
>> several, but it would look the same regardless of endiannness.
>
> Oh, wait a second.  ItemIdData has the flag bits in the middle:
>
> typedef struct ItemIdData
> {
>    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
>                lp_flags:2,       /* state of line pointer, see below */
>                lp_len:15;        /* byte length of tuple */
> } ItemIdData;
>
> meaning that for that particular bit pattern, one endianness
> is going to see the flags as 01 (LP_NORMAL) and the other as 10
> (LP_REDIRECT).  The offset/len are corrupt either way, but
> I'd certainly expect that amcheck would produce different
> complaints about those two cases.  So it's unsurprising if
> this test case's output is endian-dependent.

Yeah, I'm already looking at that.  The logic in verify_heapam skips over line pointers that are unused or dead, and
thetest is reporting zero corruption (and complaining about that), so it's probably not going to help to overwrite all
theline pointers with this particular bit pattern any more than to just overwrite the first one, as it would just skip
themall. 

I think the test should overwrite the line pointers with a variety of different bit patterns, or one calculated to work
onall platforms.  I'll have to write that up. 


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> It is seeking to position 32 and writing '\x77\x77\x77\x77'.  x86_64 is
>>> little-endian, and ppc32 and sparc64 are both big-endian, right?
>
>> They are, but that should not meaningfully affect the results of
>> that corruption step.  You zapped only one line pointer not
>> several, but it would look the same regardless of endiannness.
>
> Oh, wait a second.  ItemIdData has the flag bits in the middle:
>
> typedef struct ItemIdData
> {
>    unsigned    lp_off:15,        /* offset to tuple (from start of page) */
>                lp_flags:2,       /* state of line pointer, see below */
>                lp_len:15;        /* byte length of tuple */
> } ItemIdData;
>
> meaning that for that particular bit pattern, one endianness
> is going to see the flags as 01 (LP_NORMAL) and the other as 10
> (LP_REDIRECT).  The offset/len are corrupt either way, but
> I'd certainly expect that amcheck would produce different
> complaints about those two cases.  So it's unsurprising if
> this test case's output is endian-dependent.

Well, the issue is that on big-endian machines it is not reporting any corruption at all.  Are you sure the difference
willbe LP_NORMAL vs LP_REDIRECT?  I was thinking it was LP_DEAD vs LP_REDIRECT, as the little endian platforms are
seeingcorruption messages about bad redirect line pointers, and the big-endian are apparently skipping over the line
pointerentirely, which makes sense if it is LP_DEAD but not if it is LP_NORMAL.  It would also skip over LP_UNUSED, but
Idon't see how that could be stored in lp_flags, because 0x77 is going to either be 01110111 or 11101110, and in
neithercase do you get two zeros adjacent, but you could get two ones adjacent.  (LP_UNUSED = binary 00 and LP_DEAD =
binary11) 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Yeah, I'm already looking at that.  The logic in verify_heapam skips over line pointers that are unused or dead, and
thetest is reporting zero corruption (and complaining about that), so it's probably not going to help to overwrite all
theline pointers with this particular bit pattern any more than to just overwrite the first one, as it would just skip
themall. 

> I think the test should overwrite the line pointers with a variety of different bit patterns, or one calculated to
workon all platforms.  I'll have to write that up. 

What we need here is to produce the same test results on either
endianness.  So probably the thing to do is apply the equivalent
of ntohl() to produce a string that looks right for either host
endianness.  As a separate matter, you'd want to test corruption
producing any of the four flag bitpatterns, probably.

It says here you can use Perl's pack/unpack functions to get
the equivalent of ntohl(), but I've not troubled to work out how.

            regards, tom lane



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, Oct 22, 2020 at 5:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Committed. Let's see what the buildfarm thinks.

This is great work. Thanks Mark and Robert.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 2:26 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Oct 22, 2020 at 5:51 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> Committed. Let's see what the buildfarm thinks.
>
> This is great work. Thanks Mark and Robert.

That's the first time I've laughed today.  Having turned the build-farm red, this is quite ironic feedback!  Thanks all
thesame for the sentiment. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, Oct 22, 2020 at 2:39 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > This is great work. Thanks Mark and Robert.
>
> That's the first time I've laughed today.  Having turned the build-farm red, this is quite ironic feedback!  Thanks
allthe same for the sentiment.
 

Breaking the buildfarm is not a capital offense. Especially when it
happens with patches that are in some sense low level and/or novel,
and therefore inherently more likely to cause trouble.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Oct 22, 2020 at 4:04 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I think the compiler warning was about fxid not being set.  The callers pass NULL for status if they don't want
statuschecked, so writing *status unconditionally would be an error.  Also, if the xid being checked is out of bounds,
wecan't check the status of the xid in clog. 

Sorry, you're (partly) right. The new logic is a lot more clear that
we never used that uninitialized.

I'll remove the extra semi-colon and commit this.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Oh, wait a second.  ItemIdData has the flag bits in the middle:
>> meaning that for that particular bit pattern, one endianness
>> is going to see the flags as 01 (LP_NORMAL) and the other as 10
>> (LP_REDIRECT).

> Well, the issue is that on big-endian machines it is not reporting any
> corruption at all.  Are you sure the difference will be LP_NORMAL vs
> LP_REDIRECT?

[ thinks a bit harder... ]  Probably not.  The byte/bit string looks
the same either way, given that it's four repetitions of the same
byte value.  But which field is which will differ: we have either

    oooooooooooooooFFlllllllllllllll
    01110111011101110111011101110111

or

    lllllllllllllllFFooooooooooooooo
    01110111011101110111011101110111

So now I think this is a REDIRECT on either architecture, but the
offset and length fields have different values, causing the redirect
pointer to point to different places.  Maybe it happens to point
at a DEAD tuple in the big-endian case.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
I wrote:
> So now I think this is a REDIRECT on either architecture, but the
> offset and length fields have different values, causing the redirect
> pointer to point to different places.  Maybe it happens to point
> at a DEAD tuple in the big-endian case.

Just to make sure, I tried this test program:

#include <stdio.h>
#include <string.h>

typedef struct ItemIdData
{
    unsigned    lp_off:15,      /* offset to tuple (from start of page) */
                lp_flags:2,     /* state of line pointer, see below */
                lp_len:15;      /* byte length of tuple */
} ItemIdData;

int main()
{
    ItemIdData lp;

    memset(&lp, 0x77, sizeof(lp));
    printf("off = %x, flags = %x, len = %x\n",
           lp.lp_off, lp.lp_flags, lp.lp_len);
    return 0;
}

I get

off = 7777, flags = 2, len = 3bbb

on a little-endian machine, and

off = 3bbb, flags = 2, len = 7777

on big-endian.  It'd be less symmetric if the bytes weren't
all the same ...

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
I wrote:
> I get
> off = 7777, flags = 2, len = 3bbb
> on a little-endian machine, and
> off = 3bbb, flags = 2, len = 7777
> on big-endian.  It'd be less symmetric if the bytes weren't
> all the same ...

... but given that this is the test value we are using, why
don't both endiannesses whine about a non-maxalign'd offset?
The code really shouldn't even be trying to follow these
redirects, because we risk SIGBUS on picky architectures.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> So now I think this is a REDIRECT on either architecture, but the
>> offset and length fields have different values, causing the redirect
>> pointer to point to different places.  Maybe it happens to point
>> at a DEAD tuple in the big-endian case.
>
> Just to make sure, I tried this test program:
>
> #include <stdio.h>
> #include <string.h>
>
> typedef struct ItemIdData
> {
>    unsigned    lp_off:15,      /* offset to tuple (from start of page) */
>                lp_flags:2,     /* state of line pointer, see below */
>                lp_len:15;      /* byte length of tuple */
> } ItemIdData;
>
> int main()
> {
>    ItemIdData lp;
>
>    memset(&lp, 0x77, sizeof(lp));
>    printf("off = %x, flags = %x, len = %x\n",
>           lp.lp_off, lp.lp_flags, lp.lp_len);
>    return 0;
> }
>
> I get
>
> off = 7777, flags = 2, len = 3bbb
>
> on a little-endian machine, and
>
> off = 3bbb, flags = 2, len = 7777
>
> on big-endian.  It'd be less symmetric if the bytes weren't
> all the same ...

I think we're going in the wrong direction here.  The idea behind this test was to have as little knowledge about the
layoutof pages as possible and still verify that damaging the pages would result in corruption reports.  Of course, not
alldamage will result in corruption reports, because some damage looks legit.  I think it was just luck (good or bad
dependingon your perspective) that the damage in the test as committed works on little-endian but not big-endian. 

I can embed this knowledge that you have researched into the test if you want me to, but my instinct is to go the other
directionand have even less knowledge about pages in the test.  That would work if instead of expecting corruption for
everytime the test writes the file, instead to have it just make sure that it gets corruption reports at least some of
thetimes that it does so.  That seems more maintainable long term. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> I get
>> off = 7777, flags = 2, len = 3bbb
>> on a little-endian machine, and
>> off = 3bbb, flags = 2, len = 7777
>> on big-endian.  It'd be less symmetric if the bytes weren't
>> all the same ...
>
> ... but given that this is the test value we are using, why
> don't both endiannesses whine about a non-maxalign'd offset?
> The code really shouldn't even be trying to follow these
> redirects, because we risk SIGBUS on picky architectures.

Ahh, crud.  It's because

    syswrite($fh, '\x77\x77\x77\x77', 500)

is wrong twice.  The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal
withbackslashes and such.  It should have been double-quoted. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 6:50 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Oct 22, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> I wrote:
>>> I get
>>> off = 7777, flags = 2, len = 3bbb
>>> on a little-endian machine, and
>>> off = 3bbb, flags = 2, len = 7777
>>> on big-endian.  It'd be less symmetric if the bytes weren't
>>> all the same ...
>>
>> ... but given that this is the test value we are using, why
>> don't both endiannesses whine about a non-maxalign'd offset?
>> The code really shouldn't even be trying to follow these
>> redirects, because we risk SIGBUS on picky architectures.
>
> Ahh, crud.  It's because
>
>     syswrite($fh, '\x77\x77\x77\x77', 500)
>
> is wrong twice.  The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal
withbackslashes and such.  It should have been double-quoted. 

The reason this never came up in testing is what I was talking about elsewhere -- this test isn't designed to create
*specific*corruptions.  It's just supposed to corrupt the table in some random way.  For whatever reasons I'm not too
curiousabout, that string corrupts on little endian machines but not big endian machines.  If we want to have a test
thattailors very specific corruptions, I don't think the way to get there is by debugging this test. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> Ahh, crud.  It's because
>     syswrite($fh, '\x77\x77\x77\x77', 500)
> is wrong twice.  The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal
withbackslashes and such.  It should have been double-quoted. 

Argh.  So we really have, using same test except

    memcpy(&lp, "\\x77", sizeof(lp));

little endian:    off = 785c, flags = 2, len = 1b9b
big endian:    off = 2e3c, flags = 0, len = 3737

which explains the apparent LP_DEAD result.

I'm not particularly on board with your suggestion of "well, if it works
sometimes then it's okay".  Then we have no idea of what we really tested.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> Ahh, crud.  It's because
>>     syswrite($fh, '\x77\x77\x77\x77', 500)
>> is wrong twice.  The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal
withbackslashes and such.  It should have been double-quoted. 
>
> Argh.  So we really have, using same test except
>
>     memcpy(&lp, "\\x77", sizeof(lp));
>
> little endian:    off = 785c, flags = 2, len = 1b9b
> big endian:    off = 2e3c, flags = 0, len = 3737
>
> which explains the apparent LP_DEAD result.
>
> I'm not particularly on board with your suggestion of "well, if it works
> sometimes then it's okay".  Then we have no idea of what we really tested.
>
>             regards, tom lane

Ok, I've pruned it down to something you may like better.  Instead of just checking that *some* corruption occurs, it
checksthe returned corruption against an expected regex, and if it fails to match, you should see in the logs what you
gotvs. what you expected. 

It only corrupts the first two line pointers, the first one with 0x77777777 and the second one with 0xAAAAAAAA, which
areconsciously chosen to be bitwise reverses of each other and just strings of alternating bits rather than anything
thatcould have a more complicated interpretation. 

On my little-endian mac, the 0x77777777 value creates a line pointer which redirects to an invalid offset 0x7777, which
getsreported as decimal 30583 in the corruption report, "line pointer redirection to item at offset 30583 exceeds
maximumoffset 38".  The test is indifferent to whether the corruption it is looking for is reported relative to the
firstline pointer or the second one, so if endian-ness matters, it may be the 0xAAAAAAAA that results in that
corruptionmessage.  I don't have a machine handy to test that.  It would be nice to determine the minimum amount of
paranoianecessary to make this portable and not commit the rest. 




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 22, 2020, at 9:21 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Oct 22, 2020, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> Ahh, crud.  It's because
>>>     syswrite($fh, '\x77\x77\x77\x77', 500)
>>> is wrong twice.  The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string
literalwith backslashes and such.  It should have been double-quoted. 
>>
>> Argh.  So we really have, using same test except
>>
>>     memcpy(&lp, "\\x77", sizeof(lp));
>>
>> little endian:    off = 785c, flags = 2, len = 1b9b
>> big endian:    off = 2e3c, flags = 0, len = 3737
>>
>> which explains the apparent LP_DEAD result.
>>
>> I'm not particularly on board with your suggestion of "well, if it works
>> sometimes then it's okay".  Then we have no idea of what we really tested.
>>
>>             regards, tom lane
>
> Ok, I've pruned it down to something you may like better.  Instead of just checking that *some* corruption occurs, it
checksthe returned corruption against an expected regex, and if it fails to match, you should see in the logs what you
gotvs. what you expected. 
>
> It only corrupts the first two line pointers, the first one with 0x77777777 and the second one with 0xAAAAAAAA, which
areconsciously chosen to be bitwise reverses of each other and just strings of alternating bits rather than anything
thatcould have a more complicated interpretation. 
>
> On my little-endian mac, the 0x77777777 value creates a line pointer which redirects to an invalid offset 0x7777,
whichgets reported as decimal 30583 in the corruption report, "line pointer redirection to item at offset 30583 exceeds
maximumoffset 38".  The test is indifferent to whether the corruption it is looking for is reported relative to the
firstline pointer or the second one, so if endian-ness matters, it may be the 0xAAAAAAAA that results in that
corruptionmessage.  I don't have a machine handy to test that.  It would be nice to determine the minimum amount of
paranoianecessary to make this portable and not commit the rest. 

Obviously, that should have said 0x55555555 and 0xAAAAAAAA.  After writing the patch that way, I checked that the old
value0x77777777 also works on my mac, which it does, and checked that writing the line pointers starting at offset 24
ratherthan 32 works on my mac, which it does, and then went on to write this rather confused email and attached the
patchwith those changes, which all work (at least on my mac) but are potentially less portable than what I had before
testingthose changes. 

I apologize for any confusion my email from last night may have caused.

The patch I *should* have attached last night this time:



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> The patch I *should* have attached last night this time:

Thanks, I'll do some big-endian testing with this.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
I wrote:
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> The patch I *should* have attached last night this time:

> Thanks, I'll do some big-endian testing with this.

Seems to work, so I pushed it (after some compulsive fooling
about with whitespace and perltidy-ing).  It appears to me that
the code coverage for verify_heapam.c is not very good though,
only circa 50%.  Do we care to expend more effort on that?

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 23, 2020, at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> I wrote:
>> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>>> The patch I *should* have attached last night this time:
>
>> Thanks, I'll do some big-endian testing with this.
>
> Seems to work, so I pushed it (after some compulsive fooling
> about with whitespace and perltidy-ing).

Thanks for all the help!

> It appears to me that
> the code coverage for verify_heapam.c is not very good though,
> only circa 50%.  Do we care to expend more effort on that?

Part of the issue here is that I developed the heapcheck code as a sequence of patches, and there is much greater
coveragein the tests in the 0002 patch, which hasn't been committed yet.  (Nor do I know that it ever will be.)  Over
time,the patch set became: 

0001 -- adds verify_heapam.c to contrib/amcheck, with basic test coverage
0002 -- adds pg_amcheck command line interface to contrib/pg_amcheck, with more extensive test coverage
0003 -- creates a non-throwing interface to clog
0004 -- uses the non-throwing clog interface from within verify_heapam
0005 -- adds some controversial acl checks to verify_heapam

Your question doesn't have much to do with 3,4,5 above, but it definitely matters whether we're going to commit 0002.
Thetest in that patch, in contrib/pg_amcheck/t/004_verify_heapam.pl, does quite a bit of bit twiddling of heap tuples
andtoast records and checks that the right corruption messages come back.  Part of the reason I was trying to keep
0001'st/001_verify_heapam.pl test ignorant of the exact page layout information is that I already had this other test
thatcovers that. 

So, should I port that test from (currently non-existant) contrib/pg_amcheck into contrib/amcheck, or should we wait to
seeif the 0002 patch is going to get committed? 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Hmm, we're not out of the woods yet: thorntail is even less happy
than before.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-23%2018%3A08%3A11

I do not have 64-bit big-endian hardware to play with unfortunately.
But what I suspect is happening here is less about endianness and
more about alignment pickiness; or maybe we were unlucky enough to
index off the end of the shmem segment.  I see that verify_heapam
does this for non-redirect tuples:

            /* Set up context information about this next tuple */
            ctx.lp_len = ItemIdGetLength(ctx.itemid);
            ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
            ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);

with absolutely no thought for the possibility that lp_off is out of
range or not maxaligned.  The checks for a sane lp_len seem to have
gone missing as well.

            regards, tom lane



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Fri, Oct 23, 2020 at 11:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>             /* Set up context information about this next tuple */
>             ctx.lp_len = ItemIdGetLength(ctx.itemid);
>             ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
>             ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
>
> with absolutely no thought for the possibility that lp_off is out of
> range or not maxaligned.  The checks for a sane lp_len seem to have
> gone missing as well.

That is surprising. verify_nbtree.c has PageGetItemIdCareful() for
this exact reason.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 23, 2020, at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Hmm, we're not out of the woods yet: thorntail is even less happy
> than before.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-23%2018%3A08%3A11
>
> I do not have 64-bit big-endian hardware to play with unfortunately.
> But what I suspect is happening here is less about endianness and
> more about alignment pickiness; or maybe we were unlucky enough to
> index off the end of the shmem segment.  I see that verify_heapam
> does this for non-redirect tuples:
>
>            /* Set up context information about this next tuple */
>            ctx.lp_len = ItemIdGetLength(ctx.itemid);
>            ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid);
>            ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr);
>
> with absolutely no thought for the possibility that lp_off is out of
> range or not maxaligned.  The checks for a sane lp_len seem to have
> gone missing as well.

You certainly appear to be right about that.  I've added the extra checks, and extended the regression test to include
them. Patch attached. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> On Oct 23, 2020, at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I do not have 64-bit big-endian hardware to play with unfortunately.
>> But what I suspect is happening here is less about endianness and
>> more about alignment pickiness; or maybe we were unlucky enough to
>> index off the end of the shmem segment.

> You certainly appear to be right about that.  I've added the extra checks, and extended the regression test to
includethem.  Patch attached. 

Meanwhile, I've replicated the SIGBUS problem on gaur's host, so
that's definitely what's happening.

(Although PPC is also alignment-picky on the hardware level, I believe
that both macOS and Linux try to mask that by having kernel trap handlers
execute unaligned accesses, leaving only a nasty performance loss behind.
That's why I failed to see this effect when checking your previous patch
on an old Apple box.  We likely won't see it in the buildfarm either,
unless maybe on Noah's AIX menagerie.)

I'll check this patch on gaur and push it if it's clean.

            regards, tom lane



Re: new heapcheck contrib module

From
Tom Lane
Date:
Mark Dilger <mark.dilger@enterprisedb.com> writes:
> You certainly appear to be right about that.  I've added the extra checks, and extended the regression test to
includethem.  Patch attached. 

Pushed with some more fooling about.  The "bit reversal" idea is not
a sufficient guide to picking values that will hit all the code checks.
For instance, I was seeing out-of-range warnings on one endianness and
not the other because on the other one the maxalign check rejected the
value first.  I ended up manually tweaking the corruption patterns
until they hit all the tests on both endiannesses.  Pretty much the
opposite of black-box testing, but it's not like our notions of line
pointer layout are going to change anytime soon.

I made some logic rearrangements in the C code, too.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 23, 2020, at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Mark Dilger <mark.dilger@enterprisedb.com> writes:
>> You certainly appear to be right about that.  I've added the extra checks, and extended the regression test to
includethem.  Patch attached. 
>
> Pushed with some more fooling about.  The "bit reversal" idea is not
> a sufficient guide to picking values that will hit all the code checks.
> For instance, I was seeing out-of-range warnings on one endianness and
> not the other because on the other one the maxalign check rejected the
> value first.  I ended up manually tweaking the corruption patterns
> until they hit all the tests on both endiannesses.  Pretty much the
> opposite of black-box testing, but it's not like our notions of line
> pointer layout are going to change anytime soon.
>
> I made some logic rearrangements in the C code, too.

Thanks Tom!  And Peter, your comment earlier save me some time. Thanks to you, also!

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Fri, Oct 23, 2020 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Seems to work, so I pushed it (after some compulsive fooling
> about with whitespace and perltidy-ing).  It appears to me that
> the code coverage for verify_heapam.c is not very good though,
> only circa 50%.  Do we care to expend more effort on that?

There are two competing goods here. On the one hand, more test
coverage is better than less. On the other hand, finicky tests that
have platform-dependent results or fail for strange reasons not
indicative of actual problems with the code are often judged not to be
worth the trouble. An early version of this patch set had a very
extensive chunk of Perl code in it that actually understood the page
layout and, if we adopt something like that, it would probably be
easier to test a whole bunch of scenarios. The downside is that it was
a lot of code that basically duplicated a lot of backend logic in
Perl, and I was (and am) afraid that people will complain about the
amount of code and/or the difficulty of maintaining it. On the other
hand, having all that code might allow better testing not only of this
particular patch but also other scenarios involving corrupted pages,
so maybe it's wrong to view all that code as a burden that we have to
carry specifically to test this; or, alternatively, maybe it's worth
carrying even if we only use it for this. On the third hand, as Mark
points out, if we get 0002 committed, that will help somewhat with
test coverage even if we do nothing else.

Thanks for committing (and adjusting) the patches for the existing
buildfarm failures. If I understand the buildfarm results correctly,
hornet is still unhappy even after
321633e17b07968e68ca5341429e2c8bbf15c331?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 26, 2020, at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Oct 23, 2020 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Seems to work, so I pushed it (after some compulsive fooling
>> about with whitespace and perltidy-ing).  It appears to me that
>> the code coverage for verify_heapam.c is not very good though,
>> only circa 50%.  Do we care to expend more effort on that?
>
> There are two competing goods here. On the one hand, more test
> coverage is better than less. On the other hand, finicky tests that
> have platform-dependent results or fail for strange reasons not
> indicative of actual problems with the code are often judged not to be
> worth the trouble. An early version of this patch set had a very
> extensive chunk of Perl code in it that actually understood the page
> layout and, if we adopt something like that, it would probably be
> easier to test a whole bunch of scenarios. The downside is that it was
> a lot of code that basically duplicated a lot of backend logic in
> Perl, and I was (and am) afraid that people will complain about the
> amount of code and/or the difficulty of maintaining it. On the other
> hand, having all that code might allow better testing not only of this
> particular patch but also other scenarios involving corrupted pages,
> so maybe it's wrong to view all that code as a burden that we have to
> carry specifically to test this; or, alternatively, maybe it's worth
> carrying even if we only use it for this. On the third hand, as Mark
> points out, if we get 0002 committed, that will help somewhat with
> test coverage even if we do nothing else.

Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line
utiiltyis not wanted. 

>
> Thanks for committing (and adjusting) the patches for the existing
> buildfarm failures. If I understand the buildfarm results correctly,
> hornet is still unhappy even after
> 321633e17b07968e68ca5341429e2c8bbf15c331?

That appears to be a failed test for pg_surgery rather than for amcheck.  Or am I reading the log wrong?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line
utiiltyis not wanted.
 

How much consensus do we think we have around 0002 at this point? I
think I remember a vote in favor and no votes against, but I haven't
been paying a whole lot of attention.

> > Thanks for committing (and adjusting) the patches for the existing
> > buildfarm failures. If I understand the buildfarm results correctly,
> > hornet is still unhappy even after
> > 321633e17b07968e68ca5341429e2c8bbf15c331?
>
> That appears to be a failed test for pg_surgery rather than for amcheck.  Or am I reading the log wrong?

Oh, yeah, you're right. I don't know why it just failed now, though:
there are a bunch of successful runs preceding it. But I guess it's
unrelated to this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 26, 2020, at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line
utiiltyis not wanted. 
>
> How much consensus do we think we have around 0002 at this point? I
> think I remember a vote in favor and no votes against, but I haven't
> been paying a whole lot of attention.

My sense over the course of the thread is that people were very much in favor of having heap checking functionality,
butquite vague on whether they wanted the command line interface.  I think the interface is useful, but I'd rather hear
fromothers on this list whether it is useful enough to justify maintaining it.  As the author of it, I'm biased.
Hopefullyothers with a more objective view of the matter will read this and vote? 

I don't recall patches 0003 through 0005 getting any votes.  0003 and 0004, which create and use a non-throwing
interfaceto clog, were written in response to Andrey's request, so I'm guessing that's kind of a vote in favor.  0005
wasfactored out of of 0001 in response to a lack of agreement about whether verify_heapam should have acl checks.  You
seemedin favor, and Peter against, but I don't think we heard other opinions. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>> hornet is still unhappy even after
>>> 321633e17b07968e68ca5341429e2c8bbf15c331?

>> That appears to be a failed test for pg_surgery rather than for amcheck.  Or am I reading the log wrong?

> Oh, yeah, you're right. I don't know why it just failed now, though:
> there are a bunch of successful runs preceding it. But I guess it's
> unrelated to this thread.

pg_surgery's been unstable since it went in.  I believe Andres is
working on a fix.

            regards, tom lane



Re: new heapcheck contrib module

From
Andres Freund
Date:
Hi,

On October 26, 2020 7:13:15 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger
>> <mark.dilger@enterprisedb.com> wrote:
>>>> hornet is still unhappy even after
>>>> 321633e17b07968e68ca5341429e2c8bbf15c331?
>
>>> That appears to be a failed test for pg_surgery rather than for
>amcheck.  Or am I reading the log wrong?
>
>> Oh, yeah, you're right. I don't know why it just failed now, though:
>> there are a bunch of successful runs preceding it. But I guess it's
>> unrelated to this thread.
>
>pg_surgery's been unstable since it went in.  I believe Andres is
>working on a fix.

I posted one a while ago - was planning to push a cleaned up version soon if nobody comments in the near future.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 26, 2020, at 7:08 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Oct 26, 2020, at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger
>> <mark.dilger@enterprisedb.com> wrote:
>>> Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command
lineutiilty is not wanted. 
>>
>> How much consensus do we think we have around 0002 at this point? I
>> think I remember a vote in favor and no votes against, but I haven't
>> been paying a whole lot of attention.
>
> My sense over the course of the thread is that people were very much in favor of having heap checking functionality,
butquite vague on whether they wanted the command line interface.  I think the interface is useful, but I'd rather hear
fromothers on this list whether it is useful enough to justify maintaining it.  As the author of it, I'm biased.
Hopefullyothers with a more objective view of the matter will read this and vote? 
>
> I don't recall patches 0003 through 0005 getting any votes.  0003 and 0004, which create and use a non-throwing
interfaceto clog, were written in response to Andrey's request, so I'm guessing that's kind of a vote in favor.  0005
wasfactored out of of 0001 in response to a lack of agreement about whether verify_heapam should have acl checks.  You
seemedin favor, and Peter against, but I don't think we heard other opinions. 

The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase.  (0001 was already committed last
week.)

Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming.  There are no
substantialchanges. 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Done that way in the attached, which also include Robert's changes from v19 he posted earlier today.

> Committed. Let's see what the buildfarm thinks.

Another thing that the buildfarm is pointing out is

[WARN] FOUserAgent - The contents of fo:block line 2 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 148863:380) 

This is coming from the sample output for verify_heapam(), which is too
wide to fit in even a normal-size browser window, let alone A4 PDF.

While we could perhaps hack it up to allow more line breaks, or see
if \x formatting helps, my own suggestion would be to just nuke the
sample output altogether.  It doesn't look like it is any sort of
representative real output, and it is not useful enough to be worth
spending time to patch up.

            regards, tom lane



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Oct 26, 2020, at 9:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger
>> <mark.dilger@enterprisedb.com> wrote:
>>> Done that way in the attached, which also include Robert's changes from v19 he posted earlier today.
>
>> Committed. Let's see what the buildfarm thinks.
>
> Another thing that the buildfarm is pointing out is
>
> [WARN] FOUserAgent - The contents of fo:block line 2 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 148863:380) 
>
> This is coming from the sample output for verify_heapam(), which is too
> wide to fit in even a normal-size browser window, let alone A4 PDF.
>
> While we could perhaps hack it up to allow more line breaks, or see
> if \x formatting helps, my own suggestion would be to just nuke the
> sample output altogether.

Ok.

> It doesn't look like it is any sort of
> representative real output,

It is not.  It came from artificially created corruption in the regression tests.  I may even have manually edited
that,though I don't recall. 

> and it is not useful enough to be worth
> spending time to patch up.

Ok.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Oct 26, 2020 at 12:12 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase.  (0001 was already committed
lastweek.)
 
>
> Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming.  There are no
substantialchanges.
 

Here's a review of 0002. I basically like the direction this is going
but I guess nobody will be surprised that there are some things in
here that I think could be improved.

+const char *usage_text[] = {
+       "pg_amcheck is the PostgreSQL command line frontend for the
amcheck database corruption checker.",
+       "",

This looks like a novel approach to the problem of printing out the
usage() information, and I think that it's inferior to the technique
used elsewhere of just having a bunch of printf() statements, because
unless I misunderstand, it doesn't permit localization.

+       "  -b, --startblock             begin checking table(s) at the
given starting block number",
+       "  -e, --endblock               check table(s) only up to the
given ending block number",
+       "  -B, --toast-startblock       begin checking toast table(s)
at the given starting block",
+       "  -E, --toast-endblock         check toast table(s) only up
to the given ending block",

I am not very convinced by this. What's the use case? If you're just
checking a single table, you might want to specify a start and end
block, but then you don't need separate options for the TOAST and
non-TOAST cases, do you? If I want to check pg_statistic, I'll say
pg_amcheck -t pg_catalog.pg_statistic. If I want to check the TOAST
table for pg_statistic, I'll say pg_amcheck -t pg_toast.pg_toast_2619.
In either case, if I want to check just the first three blocks, I can
add -b 0 -e 2.

+       "  -f, --skip-all-frozen        do NOT check blocks marked as
all frozen",
+       "  -v, --skip-all-visible       do NOT check blocks marked as
all visible",

I think this is using up too many one character option names for too
little benefit on things that are too closely related. How about, -s,
--skip=all-frozen|all-visible|none? And then -v could mean verbose,
which could trigger things like printing all the queries sent to the
server, setting PQERRORS_VERBOSE, etc.

+       "  -x, --check-indexes          check btree indexes associated
with tables being checked",
+       "  -X, --skip-indexes           do NOT check any btree indexes",
+       "  -i, --index=PATTERN          check the specified index(es) only",
+       "  -I, --exclude-index=PATTERN  do NOT check the specified index(es)",

This is a lotta controls for something that has gotta have some
default. Either the default is everything, in which case I don't see
why I need -x, or it's nothing, in which case I don't see why I need
-X.

+       "  -c, --check-corrupt          check indexes even if their
associated table is corrupt",
+       "  -C, --skip-corrupt           do NOT check indexes if their
associated table is corrupt",

Ditto. (I think the default be to check corrupt, and there can be an
option to skip it.)

+       "  -a, --heapallindexed         check index tuples against the
table tuples",
+       "  -A, --no-heapallindexed      do NOT check index tuples
against the table tuples",

Ditto. (Not sure what the default should be, though.)

+       "  -r, --rootdescend            search from the root page for
each index tuple",
+       "  -R, --no-rootdescend         do NOT search from the root
page for each index tuple",

Ditto. (Again, not sure about the default.)

I'm also not sure if these descriptions are clear enough, but it may
also be hard to do a good job in a brief space. Still, comparing this
to the documentation of heapallindexed makes me rather nervous. This
is only trying to verify that the index contains all the tuples in the
heap, not that the values in the heap and index tuples actually match.

+typedef struct
+AmCheckSettings
+{
+       char       *dbname;
+       char       *host;
+       char       *port;
+       char       *username;
+} ConnectOptions;

Making the struct name different from the type name seems not good,
and the struct name also shouldn't be on a separate line.

+typedef enum trivalue
+{
+       TRI_DEFAULT,
+       TRI_NO,
+       TRI_YES
+} trivalue;

Ugh. It's not this patch's fault, but we really oughta move this to
someplace more centralized.

+typedef struct
...
+} AmCheckSettings;

I'm not sure I consider all of these things settings, "db" in
particular. But maybe that's nitpicking.

+static void expand_schema_name_patterns(const SimpleStringList *patterns,
+
         const SimpleOidList *exclude_oids,
+
         SimpleOidList *oids
+
         bool strict_names);

This is copied from pg_dump, along with I think at least one other
function from nearby. Unlike the trivalue case above, this would be
the first duplication of this logic. Can we push this stuff into
pgcommon, perhaps?

+       /*
+        * Default behaviors for user settable options.  Note that these default
+        * to doing all the safe checks and none of the unsafe ones,
on the theory
+        * that if a user says "pg_amcheck mydb" without specifying
any additional
+        * options, we should check everything we know how to check without
+        * risking any backend aborts.
+        */

This to me seems too conservative. The result is that by default we
check only tables, not indexes. I don't think that's going to be what
users want. I don't know whether they want the heapallindexed or
rootdescend behaviors for index checks, but I think they want their
indexes checked. Happy to hear opinions from actual users on what they
want; this is just me guessing that you've guessed wrong. :-)

+               if (settings.db == NULL)
+               {
+                       pg_log_error("no connection to server after
initial attempt");
+                       exit(EXIT_BADCONN);
+               }

I think this is documented as meaning out of memory, and reported that
way elsewhere. Anyway I am going to keep complaining until there are
no cases where we tell the user it broke without telling them what
broke. Which means this bit is a problem too:

+       if (!settings.db)
+       {
+               pg_log_error("no connection to server");
+               exit(EXIT_BADCONN);
+       }

Something went wrong, good luck figuring out what it was!

+       /*
+        * All information about corrupt indexes are returned via
ereport, not as
+        * tuples.  We want all the details to report if corruption exists.
+        */
+       PQsetErrorVerbosity(settings.db, PQERRORS_VERBOSE);

Really? Why? If I need the source code file name, function name, and
line number to figure out what went wrong, that is not a great sign
for the quality of the error reports it produces.

+                       /*
+                        * The btree checking logic which optionally
checks the contents
+                        * of an index against the corresponding table
has not yet been
+                        * sufficiently hardened against corrupt
tables.  In particular,
+                        * when called with heapallindexed true, it
segfaults if the file
+                        * backing the table relation has been
erroneously unlinked.  In
+                        * any event, it seems unwise to reconcile an
index against its
+                        * table when we already know the table is corrupt.
+                        */
+                       old_heapallindexed = settings.heapallindexed;
+                       if (corruptions)
+                               settings.heapallindexed = false;

This seems pretty lame to me. Even if the btree checker can't tolerate
corruption to the extent that the heap checker does, seg faulting
because of a missing file seems like a bug that we should just fix
(and probably back-patch). I'm not very convinced by the decision to
override the user's decision about heapallindexed either. Maybe I lack
imagination, but that seems pretty arbitrary. Suppose there's a giant
index which is missing entries for 5 million heap tuples and also
there's 1 entry in the table which has an xmin that is less than the
pg_clas.relfrozenxid value by 1. You are proposing that because I have
the latter problem I don't want you to check for the former one. But
I, John Q. Smartuser, do not want you to second-guess what I told you
on the command line that I wanted. :-)

I think in general you're worrying too much about the possibility of
this tool causing backend crashes. I think it's good that you wrote
the heapcheck code in a way that's hardened against that, and I think
we should try to harden other things as time permits. But I don't
think that the remote possibility of a crash due to the lack of such
hardening should dictate the design behavior of this tool. If the
crash possibilities are not remote, then I think the solution is to
fix them, rather than cutting out important checks.

It doesn't seem like great design to me that get_table_check_list()
gets just the OID of the table itself, and then later if we decide to
check the TOAST table we've got to run a separate query for each table
we want to check to fetch the TOAST OID, when we could've just fetched
both in get_table_check_list() by including two columns in the query
rather than one and it would've been basically free. Imagine if some
user wrote a query that fetched the primary key value for all their
rows and then had their application run a separate query to fetch the
entire contents of each of those rows, said contents consisting of one
more integer. And then suppose they complained about performance. We'd
tell them they were doing it wrong, and so here.

+       if (settings.db == NULL)
+               fatal("no connection on entry to check_table");

Uninformative. Is this basically an Assert? If so maybe just make it
one. If not maybe fail somewhere else with a better message?

+       if (startblock == NULL)
+               startblock = "NULL";
+       if (endblock == NULL)
+               endblock = "NULL";

It seems like it would be more elegant to initialize
settings.startblock and settings.endblock to "NULL." However, there's
also a related problem, which is that the startblock and endblock
values can be anything, and are interpolated with quoting. I don't
think that it's good to ship a tool with SQL injection hazards built
into it. I think that you should (a) check that these values are
integers during argument parsing and error out if they are not and
then (b) use either a prepared query or PQescapeLiteral() anyway.

+       stop = (on_error_stop) ? "true" : "false";
+       toast = (check_toast) ? "true" : "false";

The parens aren't really needed here.

+
printf("(relname=%s,blkno=%s,offnum=%s,attnum=%s)\n%s\n",
+                                  PQgetvalue(res, i, 0),       /* relname */
+                                  PQgetvalue(res, i, 1),       /* blkno */
+                                  PQgetvalue(res, i, 2),       /* offnum */
+                                  PQgetvalue(res, i, 3),       /* attnum */
+                                  PQgetvalue(res, i, 4));      /* msg */

I am not quite sure how to format the output, but this looks like
something designed by an engineer who knows too much about the topic.
I suspect users won't find the use of things like "relname" and
"blkno" too easy to understand. At least I think we should say
"relation, block, offset, attribute" instead of "relname, blkno,
offnum, attnum". I would probably drop the parenthesis and add spaces,
so that you end up with something like:

relation "%s", block "%s", offset "%s", attribute "%s":

I would also define variant strings so that we entirely omit things
that are NULL. e.g. have four strings:

relation "%s":
relation "%s", block "%s":(
relation "%s", block "%s", offset "%s":
relation "%s", block "%s", offset "%s", attribute "%s":

Would it make it more readable if we indented the continuation line by
four spaces or something?

+               corruption_cnt++;
+               printf("%s\n", error);
+               pfree(error);

Seems like we could still print the relation name in this case, and
that it would be a good idea to do so, in case it's not in the message
that the server returns.

The general logic in this part of the code looks a bit strange to me.
If ExecuteSqlQuery() returns PGRES_TUPLES_OK, we print out the details
for each returned row. Otherwise, if error = true, we print the error.
But, what if neither of those things are the case? Then we'd just
print nothing despite having gotten back some weird response from the
server. That actually can't happen, because ExecuteSqlQuery() always
sets *error when the return code is not PGRES_TUPLES_OK, but you
wouldn't know that from looking at this code.

Honestly, as written, ExecSqlQuery() seems like kind of a waste. The
OrDie() version is useful as a notational shorthand, but this version
seems to add more confusion than clarity. It has only three callers:
the ones in check_table() and check_indexes() have the problem
described above, and the one in get_toast_oid() could just as well be
using the OrDie() version. And also we should probably get rid of it
entirely by fetching the toast OIDs the first time around, as
mentioned above.

check_indexes() lacks a function comment. It seems to have more or
less the same problem as get_toast_oid() -- an extra query per table
to get the list of indexes. I guess it has a better excuse: there
could be lots of indexes per table, and we're fetching multiple
columns of data for each one, whereas in the TOAST case we are issuing
an extra query per table to fetch a single integer. But, couldn't we
fetch information about all the indexes we want to check in one go,
rather than fetching them separately for each table being checked? I'm
not sure if that would create too much other complexity, but it seems
like it would be quicker.

+       if (settings.db == NULL)
+               fatal("no connection on entry to check_index");
+       if (idxname == NULL)
+               fatal("no index name on entry to check_index");
+       if (tblname == NULL)
+               fatal("no table name on entry to check_index");

Again, probably these should be asserts, or if they're not, the error
should be reported better and maybe elsewhere.

Similarly in some other places, like expand_schema_name_patterns().

+        * The loop below runs multiple SELECTs might sometimes result in
+        * duplicate entries in the Oid list, but we don't care.

This is missing a which, like the place you copied it from, but the
version in pg_dumpall.c is better.

expand_table_name_patterns() should be reformatted to not gratuitously
exceed 80 columns.  Ditto for expand_index_name_patterns().

I sort of expected that this patch might use threads to allow parallel
checking - seems like it would be a useful feature.

I originally intended to review the docs and regression tests in the
same email as the patch itself, but this email has gotten rather long
and taken rather longer to get together than I had hoped, so I'm going
to stop here for now and come back to that stuff.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, Nov 19, 2020 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
> I'm also not sure if these descriptions are clear enough, but it may
> also be hard to do a good job in a brief space. Still, comparing this
> to the documentation of heapallindexed makes me rather nervous. This
> is only trying to verify that the index contains all the tuples in the
> heap, not that the values in the heap and index tuples actually match.

That's a good point. As things stand, heapallindexed verification does
not notice when there are extra index tuples in the index that are in
some way inconsistent with the heap. Hopefully this isn't too much of
a problem in practice because the presence of extra spurious tuples
gets detected by the index structure verification process. But in
general that might not happen.

Ideally heapallindex verification would verify 1:1 correspondence. It
doesn't do that right now, but it could.

This could work by having two bloom filters -- one for the heap,
another for the index. The implementation would look for the absence
of index tuples that should be in the index initially, just like
today. But at the end it would modify the index bloom filter by &= it
with the complement of the heap bloom filter. If any bits are left set
in the index bloom filter, we go back through the index once more and
locate index tuples that have at least some matching bits in the index
bloom filter (we cannot expect all of the bits from each of the hash
functions used by the bloom filter to still be matches).

From here we can do some kind of lookup for maybe-not-matching index
tuples that we locate. Make sure that they point to an LP_DEAD line
item in the heap or something. Make sure that they have the same
values as the heap tuple if they're still retrievable (i.e. if we
haven't pruned the heap tuple away already).

> This to me seems too conservative. The result is that by default we
> check only tables, not indexes. I don't think that's going to be what
> users want. I don't know whether they want the heapallindexed or
> rootdescend behaviors for index checks, but I think they want their
> indexes checked. Happy to hear opinions from actual users on what they
> want; this is just me guessing that you've guessed wrong. :-)

My thoughts on these two options:

* I don't think that users will ever want rootdescend verification.

That option exists now because I wanted to have something that relied
on the uniqueness property of B-Tree indexes following the Postgres 12
work. I didn't add retail index tuple deletion, so it seemed like a
good idea to have something that makes the same assumptions that it
would have to make. To validate the design.

Another factor is that Alexander Korotkov made the basic
bt_index_parent_check() tests a lot better for Postgres 13. This
undermined the practical argument for using rootdescend verification.

Finally, note that bt_index_parent_check() was always supposed to be
something that was to be used only when you already knew that you had
big problems, and wanted absolutely thorough verification without
regard for the costs. This isn't the common case at all. It would be
reasonable to not expose anything from bt_index_parent_check() at all,
or to give it much less prominence. Not really sure of what the right
balance is here myself, so I'm not insisting on anything. Just telling
you what I know about it.

* heapallindexed is kind of expensive, but valuable. But the extra
check is probably less likely to help on the second or subsequent
index on a table.

It might be worth considering an option that only uses it with only
one index: Preferably the primary key index, failing that some unique
index, and failing that some other index.

> This seems pretty lame to me. Even if the btree checker can't tolerate
> corruption to the extent that the heap checker does, seg faulting
> because of a missing file seems like a bug that we should just fix
> (and probably back-patch). I'm not very convinced by the decision to
> override the user's decision about heapallindexed either.

I strongly agree.

> Maybe I lack
> imagination, but that seems pretty arbitrary. Suppose there's a giant
> index which is missing entries for 5 million heap tuples and also
> there's 1 entry in the table which has an xmin that is less than the
> pg_clas.relfrozenxid value by 1. You are proposing that because I have
> the latter problem I don't want you to check for the former one. But
> I, John Q. Smartuser, do not want you to second-guess what I told you
> on the command line that I wanted. :-)

Even if your user is just average, they still have one major advantage
over the architects of pg_amcheck: actual knowledge of the problem in
front of them.

> I think in general you're worrying too much about the possibility of
> this tool causing backend crashes. I think it's good that you wrote
> the heapcheck code in a way that's hardened against that, and I think
> we should try to harden other things as time permits. But I don't
> think that the remote possibility of a crash due to the lack of such
> hardening should dictate the design behavior of this tool. If the
> crash possibilities are not remote, then I think the solution is to
> fix them, rather than cutting out important checks.

I couldn't agree more.

I think that you need to have a kind of epistemic modesty with this
stuff. Okay, we guarantee that the backend won't crash when certain
amcheck functions are run, based on these caveats. But don't we always
guarantee something like that? And are the specific caveats actually
that different in each case, when you get right down to it? A
guarantee does not exist in a vacuum. It always has implicit
limitations. For example, any guarantee implicitly comes with the
caveat "unless I, the guarantor, am wrong". Normally this doesn't
really matter because normally we're not concerned about extreme
events that will probably never happen even once. But amcheck is very
much not like that. The chances of the guarantor being the weakest
link are actually rather high. Everyone is better off with a design
that accepts this view of things.

I'm also suspicious of guarantees like this for less philosophical
reasons. It seems to me like it solves our problem rather than the
user's problem. Having data that is so badly corrupt that it's
difficult to avoid segfaults when we perform some kind of standard
transformations on it is an appalling state of affairs for the user.
The segfault itself is very much not the point at all. We should focus
on making the tool as thorough and low overhead as possible. If we
have to make the tool significantly more complicated to avoid
extremely unlikely segfaults then we're actually doing the user a
disservice, because we're increasing the chances that we the
guarantors will be the weakest link (which was already high enough).
This smacks of hubris.

I also agree that hardening is a worthwhile exercise here, of course.
We should be holding amcheck to a higher standard when it comes to not
segfaulting with corrupt data.

-- 
Peter Geoghegan



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Nov 19, 2020 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote:
> Ideally heapallindex verification would verify 1:1 correspondence. It
> doesn't do that right now, but it could.

Well, that might be a cool new mode, but it doesn't necessarily have
to supplant the thing we have now. The problem immediately before us
is just making sure that the user can understand what we will and
won't be checking.

> My thoughts on these two options:
>
> * I don't think that users will ever want rootdescend verification.

That seems too absolute. I think it's fine to say, we don't think that
users will want this, so let's not do it by default. But if it's so
useless as to not be worth a command-line option, then it was a
mistake to put it into contrib at all. Let's expose all the things we
have, and try to set the defaults according to what we expect to be
most useful.

> * heapallindexed is kind of expensive, but valuable. But the extra
> check is probably less likely to help on the second or subsequent
> index on a table.
>
> It might be worth considering an option that only uses it with only
> one index: Preferably the primary key index, failing that some unique
> index, and failing that some other index.

This seems a bit too clever for me. I would prefer a simpler schema,
where we choose the default we think most people will want and use it
for everything -- and allow the user to override.

> Even if your user is just average, they still have one major advantage
> over the architects of pg_amcheck: actual knowledge of the problem in
> front of them.

Quite so.

> I think that you need to have a kind of epistemic modesty with this
> stuff. Okay, we guarantee that the backend won't crash when certain
> amcheck functions are run, based on these caveats. But don't we always
> guarantee something like that? And are the specific caveats actually
> that different in each case, when you get right down to it? A
> guarantee does not exist in a vacuum. It always has implicit
> limitations. For example, any guarantee implicitly comes with the
> caveat "unless I, the guarantor, am wrong".

Yep.

> I'm also suspicious of guarantees like this for less philosophical
> reasons. It seems to me like it solves our problem rather than the
> user's problem. Having data that is so badly corrupt that it's
> difficult to avoid segfaults when we perform some kind of standard
> transformations on it is an appalling state of affairs for the user.
> The segfault itself is very much not the point at all.

I mostly agree with everything you say here, but I think we need to be
careful not to accept the position that seg faults are no big deal.
Consider the following users, all of whom start with a database that
they believe to be non-corrupt:

Alice runs pg_amcheck. It says that nothing is wrong, and that happens
to be true.
Bob runs pg_amcheck. It says that there are problems, and there are.
Carol runs pg_amcheck. It says that nothing is wrong, but in fact
something is wrong.
Dan runs pg_amcheck. It says that there are problems, but in fact
there are none.
Erin runs pg_amcheck. The server crashes.

Alice and Bob are clearly in the best shape here, but Carol and Dan
arguably haven't been harmed very much. Sure, Carol enjoys a false
sense of security, but since she otherwise believed things were OK,
the impact of whatever problems exist is evidently not that bad. Dan
is worrying over nothing, but the damage is only to his psyche, not
his database; we can hope he'll eventually sort out what has happened
without grave consequences. Erin, on the other hand, is very possibly
in a lot of trouble with her boss and her coworkers. She had what
seemed to be a healthy database, and from their perspective, she shot
it in the head without any real cause. It will be faint consolation to
her and her coworkers that the database was corrupt all along: until
she ran the %$! tool, they did not have a problem that affected the
ability of their business to generate revenue. Now they had an outage,
and that does.

While I obviously haven't seen this exact scenario play out for a
customer, because pg_amcheck is not committed, I have seen similar
scenarios over and over. It's REALLY bad when the database goes down.
Then the application goes down, and then it gets really ugly. As long
as the database was just returning wrong answers or eating data,
nobody's boss really cared that much, but now that it's down, they
care A LOT. This is of course not to say that nobody cares about the
accuracy of results from the database: many people care a lot, and
that's why it's good to have tools like this. But we should not
underestimate the horror caused by a crash. A working database, even
with some wrong data in it, is a problem people would probably like to
get fixed. A down database is an emergency. So I think we should
actually get a lot more serious about ensuring that corrupt data on
disk doesn't cause crashes, even for regular SELECT statements. I
don't think we can take an arbitrary performance hit to get there,
which is a challenge, but I do think that even a brief outage is
nothing to take lightly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Nov 19, 2020 at 12:06 PM Robert Haas <robertmhaas@gmail.com> wrote:
> I originally intended to review the docs and regression tests in the
> same email as the patch itself, but this email has gotten rather long
> and taken rather longer to get together than I had hoped, so I'm going
> to stop here for now and come back to that stuff.

Broad question: Does pg_amcheck belong in src/bin, or in contrib? You
have it in the latter place, but I'm not sure if that's the right
idea. I'm not saying it *isn't* the right idea, but I'm just wondering
what other people think.

Now, on to the docs:

+  Currently, this requires execute privileges on <xref linkend="amcheck"/>'s
+  <function>bt_index_parent_check</function> and
<function>verify_heapam</function>

This makes me wonder why there isn't an option to call
bt_index_check() rather than bt_index_parent_check().

It doesn't seem to be standard practice to include the entire output
of the command's --help option in the documentation. That means as
soon as anybody changes anything they've got to change the
documentation too. I don't see anything like that in the pages for
psql or vacuumlo or pg_verifybackup. It also doesn't seem like a
useful thing to do. Anyone who is reading the documentation probably
is in a position to try --help if they wish; they don't need that
duplicated here.

Looking at those other pages, what seems to be typical for an SGML is
to list all the options and give a short paragraph on what each one
does. What you have instead is a narrative description. I recommend
looking over the reference page for one of those other command-line
utilities and adapting it to this case.

Back to the the code:

+static const char *
+get_index_relkind_quals(void)
+{
+       if (!index_relkind_quals)
+               index_relkind_quals = psprintf("'%c'", RELKIND_INDEX);
+       return index_relkind_quals;
+}

I feel like there ought to be a way to work this out at compile time
rather than leaving it to runtime. I think that replacing the function
body with "return CppAsString2(RELKIND_INDEX);" would have the same
result, and once you do that you don't really need the function any
more. This is arguably cheating a bit: RELKIND_INDEX is defined as 'i'
and CppAsString2() turns that into a string containing those three
characters. That happens to work because what we want to do is quote
this for use in SQL, and SQL happens to use single quotes for literals
just like C does for individual characters. It would be mor elegant to
figure out a way to interpolate just the character into C string, but
I don't know of a macro trick that will do that. I think one could
write char *something = { '\'', RELKIND_INDEX, '\'', '\0' } but that
would be pretty darn awkward for the table case where you want an ANY
with three relkinds in there.

But maybe you could get around that by changing the query slightly.
Suppose instead of relkind = BLAH, you write POSITION(relkind IN '%s')
> 0. Then you could just have the caller pass either:

char *index_relkinds = { RELKIND_INDEX, '\0' };
-or-
char *table_relkinds = { RELKIND_RELATION, RELKIND_MATVIEW,
RELKIND_TOASTVALUE, '\0' };

The patch actually has RELKIND_PARTITIONED_TABLE there rather than
RELKIND_RELATION, but that seems wrong to me, because partitioned
tables don't have storage, and toast tables do. And if we're going to
include RELKIND_PARTITIONED_TABLE for some reason, then why not
RELKIND_PARTITIONED_INDEX for the index case?

On the tests:

I think 003_check.pl needs to stop and restart the table between
populating the tables and corrupting them. Otherwise, how do we know
that the subsequent checks are going to actually see the corruption
rather than something already cached in memory?

There are some philosophical questions to consider too, about how
these tests are written and what our philosophy ought to be here, but
I am again going to push that off to a future email.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Nov 19, 2020, at 11:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
>> I think in general you're worrying too much about the possibility of
>> this tool causing backend crashes. I think it's good that you wrote
>> the heapcheck code in a way that's hardened against that, and I think
>> we should try to harden other things as time permits. But I don't
>> think that the remote possibility of a crash due to the lack of such
>> hardening should dictate the design behavior of this tool. If the
>> crash possibilities are not remote, then I think the solution is to
>> fix them, rather than cutting out important checks.
>
> I couldn't agree more.

Owing to how much run-time overhead it would entail, much of the backend code has not been, and probably will not be,
hardenedagainst corruption.  The amcheck code uses backend code for accessing heaps and indexes.  Only some of those
usescan be preceded with sufficient safety checks to avoid stepping on landmines.  It makes sense to me to have a
"don'trun through minefields" option, and a "go ahead, run through minefields" option for pg_amcheck, given that users
indiffering situations will have differing business consequences to bringing down the server in question. 

As an example that we've already looked at, checking the status of an xid against clog is a dangerous thing to do.  I
wrotea patch to make it safer to query clog (0003) and a patch for pg_amcheck to use the safer interface (0004) and it
looksunlikely either of those will ever be committed.  I doubt other backend hardening is any more likely to get
committed. It doesn't follow that if crash possibilities are not remote that we should therefore harden the backend.
Theperformance considerations of the backend are not well aligned with the safety considerations of this tool.  The
backendcode is written with the assumption of non-corrupt data, and this tool with the assumption of corrupt data, or
atleast a fair probability of corrupt data.  I don't see how any one-hardening-fits-all will ever work. 

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Peter Geoghegan
Date:
On Thu, Nov 19, 2020 at 1:50 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> It makes sense to me to have a "don't run through minefields" option, and a "go ahead, run through minefields" option
forpg_amcheck, given that users in differing situations will have differing business consequences to bringing down the
serverin question. 

This kind of framing suggests zero-risk bias to me:

https://en.wikipedia.org/wiki/Zero-risk_bias

It's simply not helpful to think of the risks as "running through a
minefield" versus "not running through a minefield". I also dislike
this framing because in reality nobody runs through a minefield,
unless maybe it's a battlefield and the alternative is probably even
worse. Risks are not discrete -- they're continuous. And they're
situational.

I accept that there are certain reasonable gradations in the degree to
which a segfault is bad, even in contexts in which pg_amcheck runs
into actual serious problems. And as Robert points out, experience
suggests that on average people care about availability the most when
push comes to shove (though I hasten to add that that's not the same
thing as considering a once-off segfault to be the greater evil here).
Even still, I firmly believe that it's a mistake to assign *infinite*
weight to not having a segfault. That is likely to have certain
unintended consequences that could be even worse than a segfault, such
as not detecting pernicious corruption over many months because our
can't-segfault version of core functionality fails to have the same
bugs as the actual core functionality (and thus fails to detect a
problem in the core functionality).

The problem with giving infinite weight to any one bad outcome is that
it makes it impossible to draw reasonable distinctions between it and
some other extreme bad outcome. For example, I would really not like
to get infected with Covid-19. But I also think that it would be much
worse to get infected with Ebola. It follows that Covid-19 must not be
infinitely bad, because if it is then I can't make this useful
distinction -- which might actually matter. If somebody hears me say
this, and takes it as evidence of my lackadaisical attitude towards
Covid-19, I can live with that. I care about avoiding criticism as
much as the next person, but I refuse to prioritize it over all other
things.

> I doubt other backend hardening is any more likely to get committed.

I suspect you're right about that. Because of the risks of causing
real harm to users.

The backend code is obviously *not* written with the assumption that
data cannot be corrupt. There are lots of specific ways in which it is
hardened (e.g., there are many defensive "can't happen" elog()
statements). I really don't know why you insist on this black and
white framing.

--
Peter Geoghegan



Re: new heapcheck contrib module

From
Thomas Munro
Date:
On Tue, Oct 27, 2020 at 5:12 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase.  (0001 was already committed
lastweek.)
 
>
> Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming.  There are no
substantialchanges.
 

Hi Mark,

The command line stuff fails to build on Windows[1].  I think it's
just missing #include "getopt_long.h" (see
contrib/vacuumlo/vacuumlo.c).

[1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.123328



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Nov 19, 2020, at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Oct 26, 2020 at 12:12 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase.  (0001 was already committed
lastweek.) 
>>
>> Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming.  There are no
substantialchanges. 
>
> Here's a review of 0002. I basically like the direction this is going
> but I guess nobody will be surprised that there are some things in
> here that I think could be improved.

Thanks for the review!

The tools pg_dump and pg_amcheck both need to allow the user to specify which schemas, tables, and indexes either to
dumpor to check.  There are command line options in pg_dump for this purpose, and functions for compiling lists of
correspondingdatabase objects.  In prior versions of the pg_amcheck patch, I did some copy-and-pasting of this logic,
andthen had to fix up the copied functions a bit, given that pg_dump has its own ecosystem with things like fatal() and
exit_nicely()and such. 

In hindsight, it would have been better to factor these functions out into a shared location.  I have done that,
factoringthem into fe_utils, and am attaching a series of patches that accomplishes that refactoring.  Here are some
briefexplanations of what these are for.  See also the commit comments in each patch: 


v3-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch

pg_dump allows on-exit callbacks to be registered, which it expects to get called when exit_nicely() is invoked.  It
doesn'twork to factor functions out of pg_dump without having this infrastructure, as the functions being factored out
includefacilities for logging and exiting on error.  Therefore, moving these functions into fe_utils. 


v3-0002-Refactoring-ExecuteSqlQuery-and-related-functions.patch

pg_dump has functions for running queries, but those functions take a pg_dump specific argument of type Archive rather
thanPGconn, with the expectation that the Archive's connection will be used.  This has to be cleaned up a bit before
thesefunctions can be moved out of pg_dump to a shared location.  Also, pg_dump has a fixed expectation that when a
queryfails, specific steps will be taken to print out the error information and exit.  That's reasonable behavior, but
notall callers will want that.  Since the ultimate goal of this refactoring is to have higher level functions that
translateshell patterns into oid lists, it's reasonable to imagine that not all callers will want to exit if the query
fails. In particular, pg_amcheck won't want errors to automatically trigger exit() calls, given that pg_amcheck tries
tocontinue in the face of errors.  Therefore, adding a default error handler that does what pg_dump expects, but with
aneye towards other callers being able to define handlers that behave differently. 


v3-0003-Creating-query_utils-frontend-utility.patch

Moving the refactored functions to the shared location in fe_utils.  This is kept separate from 0002 for ease of
review.


v3-0004-Adding-CurrentQueryHandler-logic.patch

Extending the query error handling logic begun in the 0002 patch.  It wasn't appropriate in the pg_dump project, but
nowthe logic is in fe_utils. 


v3-0005-Refactoring-pg_dumpall-functions.patch

Refactoring some remaining functions in the pg_dump project to use the new fe_utils facilities.


v3-0006-Refactoring-expand_schema_name_patterns-and-frien.patch

Refactoring functions in pg_dump that expand a list of patterns into a list of matching database objects.
Specifically,changing them to no take pg_dump specific argument types, just as was done in 0002. 


v3-0007-Moving-pg_dump-functions-to-new-file-option_utils.patch

Moving the functions refactored in 0006 into a new location fe_utils/option_utils


v3-0008-Normalizing-option_utils-interface.patch

Reworking the functions moved in 0007 to have a more general purpose interface.  The refactoring in 0006 only went so
faras to make the functions moveable out of pg_dump.  This refactoring is intentionally kept separate for ease of
review.


v3-0009-Adding-contrib-module-pg_amcheck.patch

Adding contrib/pg_amcheck project, about which your review comments below apply.



Not included in this patch set, but generated during the development of this patch set, I refactored
processSQLNamePattern. string_utils mixes the logic for converting a shell-style pattern into a SQL style regex with
thelogic of performing the sql query to look up matching database objects.  That makes it hard to look up multiple
patternsin a single query, something that an intermediate version of this patch set was doing.  I ultimately stopped
doingthat, as the code was overly complex, but the refactoring of processSQLNamePattern is not over-complicated and
probablyhas some merit in its own right.  Since it is not related to the pg_amcheck code, I expect that I will be
postingthat separately. 

Also not included in this patch set, but likely to be in the next rev, is a patch that adds more interesting table and
indexcorruption via PostgresNode, creating torn pages and such.  That work is complete so far as I know, but I don't
haveall the regression tests that use it written yet, so I'll hold off posting it for now. 

Not yet written but still needed is the parallelization of the checking.  I'll be working on that for the next patch
set.

There is enough work here in need of review that I'm posting this now, hoping to get feedback on the general direction
I'mgoing with this. 


To your review....

>
> +const char *usage_text[] = {
> +       "pg_amcheck is the PostgreSQL command line frontend for the
> amcheck database corruption checker.",
> +       "",
>
> This looks like a novel approach to the problem of printing out the
> usage() information, and I think that it's inferior to the technique
> used elsewhere of just having a bunch of printf() statements, because
> unless I misunderstand, it doesn't permit localization.

Since contrib modules are not localized, it seemed not to be a problem, but you've raised the question of whether
pg_amcheckmight be moved into core.  I've changed it as suggested so that such a move would incur less code churn.  The
advantageto how I had it before was that each line was a bit shorter, making it fit better into the 80 column limit. 

> +       "  -b, --startblock             begin checking table(s) at the
> given starting block number",
> +       "  -e, --endblock               check table(s) only up to the
> given ending block number",
> +       "  -B, --toast-startblock       begin checking toast table(s)
> at the given starting block",
> +       "  -E, --toast-endblock         check toast table(s) only up
> to the given ending block",
>
> I am not very convinced by this. What's the use case? If you're just
> checking a single table, you might want to specify a start and end
> block, but then you don't need separate options for the TOAST and
> non-TOAST cases, do you? If I want to check pg_statistic, I'll say
> pg_amcheck -t pg_catalog.pg_statistic. If I want to check the TOAST
> table for pg_statistic, I'll say pg_amcheck -t pg_toast.pg_toast_2619.
> In either case, if I want to check just the first three blocks, I can
> add -b 0 -e 2.

Removed -B, --toast-startblock and -E, --toast-endblock.

>
> +       "  -f, --skip-all-frozen        do NOT check blocks marked as
> all frozen",
> +       "  -v, --skip-all-visible       do NOT check blocks marked as
> all visible",
>
> I think this is using up too many one character option names for too
> little benefit on things that are too closely related. How about, -s,
> --skip=all-frozen|all-visible|none?

I'm already using -s for "strict-names', but I implemented your suggestion with -S, --skip

> And then -v could mean verbose,
> which could trigger things like printing all the queries sent to the
> server, setting PQERRORS_VERBOSE, etc.

I added -v, --verbose as you suggest.

> +       "  -x, --check-indexes          check btree indexes associated
> with tables being checked",
> +       "  -X, --skip-indexes           do NOT check any btree indexes",
> +       "  -i, --index=PATTERN          check the specified index(es) only",
> +       "  -I, --exclude-index=PATTERN  do NOT check the specified index(es)",
>
> This is a lotta controls for something that has gotta have some
> default. Either the default is everything, in which case I don't see
> why I need -x, or it's nothing, in which case I don't see why I need
> -X.

I removed -x, --check-indexes and instead made that the default.

>
> +       "  -c, --check-corrupt          check indexes even if their
> associated table is corrupt",
> +       "  -C, --skip-corrupt           do NOT check indexes if their
> associated table is corrupt",
>
> Ditto. (I think the default be to check corrupt, and there can be an
> option to skip it.)

Likewise, I removed -c, --check-corrupt and made that the default.

> +       "  -a, --heapallindexed         check index tuples against the
> table tuples",
> +       "  -A, --no-heapallindexed      do NOT check index tuples
> against the table tuples",
>
> Ditto. (Not sure what the default should be, though.)

I removed -A, --no-heapallindexed and made that the default.

>
> +       "  -r, --rootdescend            search from the root page for
> each index tuple",
> +       "  -R, --no-rootdescend         do NOT search from the root
> page for each index tuple",
>
> Ditto. (Again, not sure about the default.)

I removed -R, --no-rootdescend and made that the default.   Peter argued elsewhere for removing this altogether, but as
Irecall you argued against that, so for now I'm keeping the --rootdescend option. 

> I'm also not sure if these descriptions are clear enough, but it may
> also be hard to do a good job in a brief space.

Yes.  Better verbiage welcome.

> Still, comparing this
> to the documentation of heapallindexed makes me rather nervous. This
> is only trying to verify that the index contains all the tuples in the
> heap, not that the values in the heap and index tuples actually match.

This is complicated.  The most reasonable approach from the point of view of somebody running pg_amcheck is to have the
scanof the table and the scan of the index cooperate so that work is not duplicated.  But from the point of view of
amcheck(not pg_amcheck), there is no assumption that the table is being scanned just because the index is being
checked. I'm not sure how best to resolve this, except that I'd rather punt this to a future version rather than
requirethe first version of pg_amcheck to deal with it. 

> +typedef struct
> +AmCheckSettings
> +{
> +       char       *dbname;
> +       char       *host;
> +       char       *port;
> +       char       *username;
> +} ConnectOptions;
>
> Making the struct name different from the type name seems not good,
> and the struct name also shouldn't be on a separate line.

Fixed.

> +typedef enum trivalue
> +{
> +       TRI_DEFAULT,
> +       TRI_NO,
> +       TRI_YES
> +} trivalue;
>
> Ugh. It's not this patch's fault, but we really oughta move this to
> someplace more centralized.

Not changed in this patch.

> +typedef struct
> ...
> +} AmCheckSettings;
>
> I'm not sure I consider all of these things settings, "db" in
> particular. But maybe that's nitpicking.

It is definitely nitpicking, but I agree with it.  This next patch uses a static variable named "conn" rather than
"settings.db".

> +static void expand_schema_name_patterns(const SimpleStringList *patterns,
> +
>         const SimpleOidList *exclude_oids,
> +
>         SimpleOidList *oids
> +
>         bool strict_names);
>
> This is copied from pg_dump, along with I think at least one other
> function from nearby. Unlike the trivalue case above, this would be
> the first duplication of this logic. Can we push this stuff into
> pgcommon, perhaps?

Yes, these functions were largely copied from pg_dump.  I have moved them out of pg_dump and into fe_utils, but that
wasa large enough effort that it deserves its own thread, so I'm creating a thread for that work independent of this
thread.

> +       /*
> +        * Default behaviors for user settable options.  Note that these default
> +        * to doing all the safe checks and none of the unsafe ones,
> on the theory
> +        * that if a user says "pg_amcheck mydb" without specifying
> any additional
> +        * options, we should check everything we know how to check without
> +        * risking any backend aborts.
> +        */
>
> This to me seems too conservative. The result is that by default we
> check only tables, not indexes. I don't think that's going to be what
> users want.

Checking indexes has been made the default, as discussed above.

> I don't know whether they want the heapallindexed or
> rootdescend behaviors for index checks, but I think they want their
> indexes checked. Happy to hear opinions from actual users on what they
> want; this is just me guessing that you've guessed wrong. :-)

The heapallindexed and rootdescend options still exist but are false by default.

> +               if (settings.db == NULL)
> +               {
> +                       pg_log_error("no connection to server after
> initial attempt");
> +                       exit(EXIT_BADCONN);
> +               }
>
> I think this is documented as meaning out of memory, and reported that
> way elsewhere. Anyway I am going to keep complaining until there are
> no cases where we tell the user it broke without telling them what
> broke. Which means this bit is a problem too:
>
> +       if (!settings.db)
> +       {
> +               pg_log_error("no connection to server");
> +               exit(EXIT_BADCONN);
> +       }
>
> Something went wrong, good luck figuring out what it was!

I have changed this to more closely follow the behavior in scripts/common.c:connectDatabase.  If pg_amcheck were moved
intosrc/bin/scripts, I could just use that function outright. 

> +       /*
> +        * All information about corrupt indexes are returned via
> ereport, not as
> +        * tuples.  We want all the details to report if corruption exists.
> +        */
> +       PQsetErrorVerbosity(settings.db, PQERRORS_VERBOSE);
>
> Really? Why? If I need the source code file name, function name, and
> line number to figure out what went wrong, that is not a great sign
> for the quality of the error reports it produces.

Yeah, you are right about that.  In any event, the user can now specifiy --verbose if they like and get that extra
information(not that they need it).  I have removed this offending bit of code. 

> +                       /*
> +                        * The btree checking logic which optionally
> checks the contents
> +                        * of an index against the corresponding table
> has not yet been
> +                        * sufficiently hardened against corrupt
> tables.  In particular,
> +                        * when called with heapallindexed true, it
> segfaults if the file
> +                        * backing the table relation has been
> erroneously unlinked.  In
> +                        * any event, it seems unwise to reconcile an
> index against its
> +                        * table when we already know the table is corrupt.
> +                        */
> +                       old_heapallindexed = settings.heapallindexed;
> +                       if (corruptions)
> +                               settings.heapallindexed = false;
>
> This seems pretty lame to me. Even if the btree checker can't tolerate
> corruption to the extent that the heap checker does, seg faulting
> because of a missing file seems like a bug that we should just fix
> (and probably back-patch). I'm not very convinced by the decision to
> override the user's decision about heapallindexed either. Maybe I lack
> imagination, but that seems pretty arbitrary. Suppose there's a giant
> index which is missing entries for 5 million heap tuples and also
> there's 1 entry in the table which has an xmin that is less than the
> pg_clas.relfrozenxid value by 1. You are proposing that because I have
> the latter problem I don't want you to check for the former one. But
> I, John Q. Smartuser, do not want you to second-guess what I told you
> on the command line that I wanted. :-)

I've removed this bit.  I'm not sure what I was seeing back when I first wrote this code, but I no longer see any
segfaultsfor missing relation files. 

> I think in general you're worrying too much about the possibility of
> this tool causing backend crashes. I think it's good that you wrote
> the heapcheck code in a way that's hardened against that, and I think
> we should try to harden other things as time permits. But I don't
> think that the remote possibility of a crash due to the lack of such
> hardening should dictate the design behavior of this tool. If the
> crash possibilities are not remote, then I think the solution is to
> fix them, rather than cutting out important checks.

Right.  I've been worrying a bit less about this lately, in part because you and Peter are less concerned about it than
Iwas, and in part because I've been banging away with various test cases and don't see all that much worth worrying
about.

> It doesn't seem like great design to me that get_table_check_list()
> gets just the OID of the table itself, and then later if we decide to
> check the TOAST table we've got to run a separate query for each table
> we want to check to fetch the TOAST OID, when we could've just fetched
> both in get_table_check_list() by including two columns in the query
> rather than one and it would've been basically free. Imagine if some
> user wrote a query that fetched the primary key value for all their
> rows and then had their application run a separate query to fetch the
> entire contents of each of those rows, said contents consisting of one
> more integer. And then suppose they complained about performance. We'd
> tell them they were doing it wrong, and so here.

Good points.  I've changed get_table_check_list to query both the main table and toast table oids as you suggest.

> +       if (settings.db == NULL)
> +               fatal("no connection on entry to check_table");
>
> Uninformative. Is this basically an Assert? If so maybe just make it
> one. If not maybe fail somewhere else with a better message?

Looking at this again, I don't think it is even worth making it into an Assert, so I just removed it, along with
similaruseless checks of the same type elsewhere. 

>
> +       if (startblock == NULL)
> +               startblock = "NULL";
> +       if (endblock == NULL)
> +               endblock = "NULL";
>
> It seems like it would be more elegant to initialize
> settings.startblock and settings.endblock to "NULL." However, there's
> also a related problem, which is that the startblock and endblock
> values can be anything, and are interpolated with quoting. I don't
> think that it's good to ship a tool with SQL injection hazards built
> into it. I think that you should (a) check that these values are
> integers during argument parsing and error out if they are not and
> then (b) use either a prepared query or PQescapeLiteral() anyway.

I've changed the logic to use strtol to parse these, and I'm storing them as long rather than as strings.

> +       stop = (on_error_stop) ? "true" : "false";
> +       toast = (check_toast) ? "true" : "false";
>
> The parens aren't really needed here.

True.  Removed.

> +
> printf("(relname=%s,blkno=%s,offnum=%s,attnum=%s)\n%s\n",
> +                                  PQgetvalue(res, i, 0),       /* relname */
> +                                  PQgetvalue(res, i, 1),       /* blkno */
> +                                  PQgetvalue(res, i, 2),       /* offnum */
> +                                  PQgetvalue(res, i, 3),       /* attnum */
> +                                  PQgetvalue(res, i, 4));      /* msg */
>
> I am not quite sure how to format the output, but this looks like
> something designed by an engineer who knows too much about the topic.
> I suspect users won't find the use of things like "relname" and
> "blkno" too easy to understand. At least I think we should say
> "relation, block, offset, attribute" instead of "relname, blkno,
> offnum, attnum". I would probably drop the parenthesis and add spaces,
> so that you end up with something like:
>
> relation "%s", block "%s", offset "%s", attribute "%s":
>
> I would also define variant strings so that we entirely omit things
> that are NULL. e.g. have four strings:
>
> relation "%s":
> relation "%s", block "%s":(
> relation "%s", block "%s", offset "%s":
> relation "%s", block "%s", offset "%s", attribute "%s":
>
> Would it make it more readable if we indented the continuation line by
> four spaces or something?

I tried it that way and agree it looks better, including having the msg line indented four spaces.  Changed.

> +               corruption_cnt++;
> +               printf("%s\n", error);
> +               pfree(error);
>
> Seems like we could still print the relation name in this case, and
> that it would be a good idea to do so, in case it's not in the message
> that the server returns.

We don't know the relation name in this case, only the oid, but I agree that would be useful to have, so I added that.

> The general logic in this part of the code looks a bit strange to me.
> If ExecuteSqlQuery() returns PGRES_TUPLES_OK, we print out the details
> for each returned row. Otherwise, if error = true, we print the error.
> But, what if neither of those things are the case? Then we'd just
> print nothing despite having gotten back some weird response from the
> server. That actually can't happen, because ExecuteSqlQuery() always
> sets *error when the return code is not PGRES_TUPLES_OK, but you
> wouldn't know that from looking at this code.
>
> Honestly, as written, ExecSqlQuery() seems like kind of a waste. The
> OrDie() version is useful as a notational shorthand, but this version
> seems to add more confusion than clarity. It has only three callers:
> the ones in check_table() and check_indexes() have the problem
> described above, and the one in get_toast_oid() could just as well be
> using the OrDie() version. And also we should probably get rid of it
> entirely by fetching the toast OIDs the first time around, as
> mentioned above.

These functions have been factored out of pg_dump into fe_utils, so this bit of code review doesn't refer to anything
now.

> check_indexes() lacks a function comment. It seems to have more or
> less the same problem as get_toast_oid() -- an extra query per table
> to get the list of indexes. I guess it has a better excuse: there
> could be lots of indexes per table, and we're fetching multiple
> columns of data for each one, whereas in the TOAST case we are issuing
> an extra query per table to fetch a single integer. But, couldn't we
> fetch information about all the indexes we want to check in one go,
> rather than fetching them separately for each table being checked? I'm
> not sure if that would create too much other complexity, but it seems
> like it would be quicker.

If the --skip-corrupt option is given, we need to only check the indexes associated with a table once the table has
beenfound to be non-corrupt.  Querying for all the indexes upfront, we'd need to keep information about which table the
indexcame from, and check that against lists of tables that have been checked, etc.  It seems pretty messy, even more
sowhen considering the limited list facilities available to frontend code. 

I have made no changes in this version, though I'm not rejecting your idea here.  Maybe I'll think of a clean way to do
thisfor a later patch?  

> +       if (settings.db == NULL)
> +               fatal("no connection on entry to check_index");
> +       if (idxname == NULL)
> +               fatal("no index name on entry to check_index");
> +       if (tblname == NULL)
> +               fatal("no table name on entry to check_index");
>
> Again, probably these should be asserts, or if they're not, the error
> should be reported better and maybe elsewhere.
>
> Similarly in some other places, like expand_schema_name_patterns().

I removed these checks entirely.

> +        * The loop below runs multiple SELECTs might sometimes result in
> +        * duplicate entries in the Oid list, but we don't care.
>
> This is missing a which, like the place you copied it from, but the
> version in pg_dumpall.c is better.
>
> expand_table_name_patterns() should be reformatted to not gratuitously
> exceed 80 columns.  Ditto for expand_index_name_patterns().

Refactoring into fe_utils, as mentioned above.

> I sort of expected that this patch might use threads to allow parallel
> checking - seems like it would be a useful feature.

Yes, I think that makes sense, but I'm going to work on that in the next patch.

> I originally intended to review the docs and regression tests in the
> same email as the patch itself, but this email has gotten rather long
> and taken rather longer to get together than I had hoped, so I'm going
> to stop here for now and come back to that stuff.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 6, 2021, at 11:05 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> I have done that, factoring them into fe_utils, and am attaching a series of patches that accomplishes that
refactoring.

The previous set should have been named v30, not v3.  My apologies for any confusion.

The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's
review,and a .gitignore file added in contrib/pg_amcheck/ 



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Nov 19, 2020, at 11:47 AM, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Nov 19, 2020 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> I'm also not sure if these descriptions are clear enough, but it may
>> also be hard to do a good job in a brief space. Still, comparing this
>> to the documentation of heapallindexed makes me rather nervous. This
>> is only trying to verify that the index contains all the tuples in the
>> heap, not that the values in the heap and index tuples actually match.
>
> That's a good point. As things stand, heapallindexed verification does
> not notice when there are extra index tuples in the index that are in
> some way inconsistent with the heap. Hopefully this isn't too much of
> a problem in practice because the presence of extra spurious tuples
> gets detected by the index structure verification process. But in
> general that might not happen.
>
> Ideally heapallindex verification would verify 1:1 correspondence. It
> doesn't do that right now, but it could.
>
> This could work by having two bloom filters -- one for the heap,
> another for the index. The implementation would look for the absence
> of index tuples that should be in the index initially, just like
> today. But at the end it would modify the index bloom filter by &= it
> with the complement of the heap bloom filter. If any bits are left set
> in the index bloom filter, we go back through the index once more and
> locate index tuples that have at least some matching bits in the index
> bloom filter (we cannot expect all of the bits from each of the hash
> functions used by the bloom filter to still be matches).
>
> From here we can do some kind of lookup for maybe-not-matching index
> tuples that we locate. Make sure that they point to an LP_DEAD line
> item in the heap or something. Make sure that they have the same
> values as the heap tuple if they're still retrievable (i.e. if we
> haven't pruned the heap tuple away already).

This approach sounds very good to me, but beyond the scope of what I'm planning for this release cycle.

>> This to me seems too conservative. The result is that by default we
>> check only tables, not indexes. I don't think that's going to be what
>> users want. I don't know whether they want the heapallindexed or
>> rootdescend behaviors for index checks, but I think they want their
>> indexes checked. Happy to hear opinions from actual users on what they
>> want; this is just me guessing that you've guessed wrong. :-)
>
> My thoughts on these two options:
>
> * I don't think that users will ever want rootdescend verification.
>
> That option exists now because I wanted to have something that relied
> on the uniqueness property of B-Tree indexes following the Postgres 12
> work. I didn't add retail index tuple deletion, so it seemed like a
> good idea to have something that makes the same assumptions that it
> would have to make. To validate the design.
>
> Another factor is that Alexander Korotkov made the basic
> bt_index_parent_check() tests a lot better for Postgres 13. This
> undermined the practical argument for using rootdescend verification.

The latest version of the patch has rootdescend off by default, but a switch to turn it on.  The documentation for that
switchin doc/src/sgml/pgamcheck.sgml summarizes your comments: 

+       This form of verification was originally written to help in the
+       development of btree index features.  It may be of limited or even of no
+       use in helping detect the kinds of corruption that occur in practice.
+       In any event, it is known to be a rather expensive check to perform.

For my own self, I don't care if rootdescend is an option in pg_amcheck.  You and Robert expressed somewhat different
opinions,and I tried to split the difference.  I'm happy to go a different direction if that's what the consensus is. 

> Finally, note that bt_index_parent_check() was always supposed to be
> something that was to be used only when you already knew that you had
> big problems, and wanted absolutely thorough verification without
> regard for the costs. This isn't the common case at all. It would be
> reasonable to not expose anything from bt_index_parent_check() at all,
> or to give it much less prominence. Not really sure of what the right
> balance is here myself, so I'm not insisting on anything. Just telling
> you what I know about it.

This still needs work.  Currently, there is a switch to turn off index checking, with the checks on by default.  But
thereis no switch controlling which kind of check is performed (bt_index_check vs. bt_index_parent_check).  Making
mattersmore complicated, selecting both rootdescend and bt_index_check wouldn't make sense, as there is no rootdescend
optionon that function.  So users would need multiple flags to turn on various options, with some flag combinations
drawingan error about the flags not being mutually compatible.  That's doable, but people may not like that interface. 

> * heapallindexed is kind of expensive, but valuable. But the extra
> check is probably less likely to help on the second or subsequent
> index on a table.

There is a switch for enabling this.  It is off by default.

> It might be worth considering an option that only uses it with only
> one index: Preferably the primary key index, failing that some unique
> index, and failing that some other index.

It might make sense for somebody to submit this for a later release.  I don't have any plans to work on this during
thisrelease cycle. 

>> I'm not very convinced by the decision to
>> override the user's decision about heapallindexed either.
>
> I strongly agree.

I have removed the override.

>
>> Maybe I lack
>> imagination, but that seems pretty arbitrary. Suppose there's a giant
>> index which is missing entries for 5 million heap tuples and also
>> there's 1 entry in the table which has an xmin that is less than the
>> pg_clas.relfrozenxid value by 1. You are proposing that because I have
>> the latter problem I don't want you to check for the former one. But
>> I, John Q. Smartuser, do not want you to second-guess what I told you
>> on the command line that I wanted. :-)
>
> Even if your user is just average, they still have one major advantage
> over the architects of pg_amcheck: actual knowledge of the problem in
> front of them.

There is a switch for skipping index checks on corrupt tables.  By default, the indexes will be checked.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Thomas Munro
Date:
On Fri, Jan 8, 2021 at 6:33 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's
review,and a .gitignore file added in contrib/pg_amcheck/
 

I couple more little things from Windows CI:

    C:\projects\postgresql\src\include\fe_utils/option_utils.h(19):
fatal error C1083: Cannot open include file: 'libpq-fe.h': No such
file or directory [C:\projects\postgresql\pg_amcheck.vcxproj]

Does contrib/amcheck/Makefile need to say "SHLIB_PREREQS =
submake-libpq" like other contrib modules that use libpq?

    pg_backup_utils.obj : error LNK2001: unresolved external symbol
exit_nicely [C:\projects\postgresql\pg_dump.vcxproj]

I think this is probably because additions to src/fe_utils/Makefile's
OBJS list need to be manually replicated in
src/tools/msvc/Mkvcbuild.pm's @pgfeutilsfiles list.  (If I'm right
about that, perhaps it needs a comment to remind us Unix hackers of
that, or perhaps it should be automated...)



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 10, 2021, at 12:41 PM, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Fri, Jan 8, 2021 at 6:33 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>> The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's
review,and a .gitignore file added in contrib/pg_amcheck/ 
>
> I couple more little things from Windows CI:
>
>    C:\projects\postgresql\src\include\fe_utils/option_utils.h(19):
> fatal error C1083: Cannot open include file: 'libpq-fe.h': No such
> file or directory [C:\projects\postgresql\pg_amcheck.vcxproj]
>
> Does contrib/amcheck/Makefile need to say "SHLIB_PREREQS =
> submake-libpq" like other contrib modules that use libpq?

Added in v32.

>    pg_backup_utils.obj : error LNK2001: unresolved external symbol
> exit_nicely [C:\projects\postgresql\pg_dump.vcxproj]
>
> I think this is probably because additions to src/fe_utils/Makefile's
> OBJS list need to be manually replicated in
> src/tools/msvc/Mkvcbuild.pm's @pgfeutilsfiles list.  (If I'm right
> about that, perhaps it needs a comment to remind us Unix hackers of
> that, or perhaps it should be automated...)

Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon

There are also a few additions in v32 to typedefs.list, and some whitespace changes due to running pgindent.




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Mon, Jan 11, 2021 at 1:16 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon

exit_utils.c fails to achieve the goal of making this code independent
of pg_dump, because of:

#ifdef WIN32
        if (parallel_init_done && GetCurrentThreadId() != mainThreadId)
                _endthreadex(code);
#endif

parallel_init_done is a pg_dump-ism. Perhaps this chunk of code could
be a handler that gets registered using exit_nicely() rather than
hard-coded like this. Note that the function comments for
exit_nicely() are heavily implicated in this problem, since they also
apply to stuff that only happens in pg_dump and not other utilities.

I'm skeptical about the idea of putting functions into string_utils.c
with names as generic as include_filter() and exclude_filter().
Existing cases like fmtId() and fmtQualifiedId() are not great either,
but I think this is worse and that we should do some renaming. On a
related note, it's not clear to me why these should be classified as
string_utils while stuff like expand_schema_name_patterns() gets
classified as option_utils. These are neither generic
string-processing functions nor are they generic options-parsing
functions. They are functions for expanding shell-glob style patterns
for database object names. And they seem like they ought to be
together, because they seem to do closely-related things. I'm open to
an argument that this is wrongheaded on my part, but it looks weird to
me the way it is.

I'm pretty unimpressed by query_utils.c. The CurrentResultHandler
stuff looks grotty, and you don't seem to really use it anywhere. And
it seems woefully overambitious to me anyway: this doesn't apply to
every kind of "result" we've got hanging around, absolutely nothing
even close to that, even though a name like CurrentResultHandler
sounds very broad. It also means more global variables, which is a
thing of which the PostgreSQL codebase already has a deplorable
oversupply. quiet_handler() and noop_handler() aren't used anywhere
either, AFAICS.

I wonder if it would be better to pass in callbacks rather than
relying on global variables. e.g.:

typedef void (*fatal_error_callback)(const char *fmt,...)
pg_attribute_printf(1, 2) pg_attribute_noreturn();

Then you could have a few helper functions that take an argument of
type fatal_error_callback and throw the right fatal error for (a)
wrong PQresultStatus() and (b) result is not one row. Do you need any
other cases? exiting_handler() seems to think that the caller might
want to allow any number of tuples, or any positive number, or any
particular cout, but I'm not sure if all of those cases are really
needed.

This stuff is finnicky and hard to get right. You don't really want to
create a situation where the same code keeps getting duplicated, or
the behavior's just a little bit inconsistent everywhere, but it also
isn't great to build layers upon layers of abstraction around
something like ExecuteSqlQuery which is, in the end, a four-line
function. I don't think there's any problem with something like
pg_dump having it's own function to execute-a-query-or-die. Maybe that
function ends up doing something like
TheGenericFunctionToExecuteOrDie(my_die_fn, the_query), or maybe
pg_dump can just open-code it but have a my_die_fn to pass down to the
glob-expansion stuff, or, well, I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 14, 2021, at 1:13 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Jan 11, 2021 at 1:16 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon
>
> exit_utils.c fails to achieve the goal of making this code independent
> of pg_dump, because of:
>
> #ifdef WIN32
>        if (parallel_init_done && GetCurrentThreadId() != mainThreadId)
>                _endthreadex(code);
> #endif
>
> parallel_init_done is a pg_dump-ism. Perhaps this chunk of code could
> be a handler that gets registered using exit_nicely() rather than
> hard-coded like this. Note that the function comments for
> exit_nicely() are heavily implicated in this problem, since they also
> apply to stuff that only happens in pg_dump and not other utilities.

The 0001 patch has been restructured to not have this problem.

> I'm skeptical about the idea of putting functions into string_utils.c
> with names as generic as include_filter() and exclude_filter().
> Existing cases like fmtId() and fmtQualifiedId() are not great either,
> but I think this is worse and that we should do some renaming. On a
> related note, it's not clear to me why these should be classified as
> string_utils while stuff like expand_schema_name_patterns() gets
> classified as option_utils. These are neither generic
> string-processing functions nor are they generic options-parsing
> functions. They are functions for expanding shell-glob style patterns
> for database object names. And they seem like they ought to be
> together, because they seem to do closely-related things. I'm open to
> an argument that this is wrongheaded on my part, but it looks weird to
> me the way it is.

The logic to filter which relations are checked is completely restructured and is kept in pg_amcheck.c

> I'm pretty unimpressed by query_utils.c. The CurrentResultHandler
> stuff looks grotty, and you don't seem to really use it anywhere. And
> it seems woefully overambitious to me anyway: this doesn't apply to
> every kind of "result" we've got hanging around, absolutely nothing
> even close to that, even though a name like CurrentResultHandler
> sounds very broad. It also means more global variables, which is a
> thing of which the PostgreSQL codebase already has a deplorable
> oversupply. quiet_handler() and noop_handler() aren't used anywhere
> either, AFAICS.
>
> I wonder if it would be better to pass in callbacks rather than
> relying on global variables. e.g.:
>
> typedef void (*fatal_error_callback)(const char *fmt,...)
> pg_attribute_printf(1, 2) pg_attribute_noreturn();
>
> Then you could have a few helper functions that take an argument of
> type fatal_error_callback and throw the right fatal error for (a)
> wrong PQresultStatus() and (b) result is not one row. Do you need any
> other cases? exiting_handler() seems to think that the caller might
> want to allow any number of tuples, or any positive number, or any
> particular cout, but I'm not sure if all of those cases are really
> needed.

The error callback stuff has been refactored in this next patch set, and also now includes handlers for parallel slots,
asthe src/bin/scripts/scripts_parallel.c stuff has been moved to fe_utils and made more general.  As it was, there were
hardcodedassumptions that are valid for reindexdb and vacuumdb, but not general enough for pg_amcheck to use.  The
refactoringin patches 0002 through 0005 make it more generally usable.  Patch 0008 uses it in pg_amcheck. 

> This stuff is finnicky and hard to get right. You don't really want to
> create a situation where the same code keeps getting duplicated, or
> the behavior's just a little bit inconsistent everywhere, but it also
> isn't great to build layers upon layers of abstraction around
> something like ExecuteSqlQuery which is, in the end, a four-line
> function. I don't think there's any problem with something like
> pg_dump having it's own function to execute-a-query-or-die. Maybe that
> function ends up doing something like
> TheGenericFunctionToExecuteOrDie(my_die_fn, the_query), or maybe
> pg_dump can just open-code it but have a my_die_fn to pass down to the
> glob-expansion stuff, or, well, I don't know.

There are some real improvements in this next patch set.

The number of queries issued to the database to determine the databases to use is much reduced.  I had been following
thepattern in pg_dump, but abandoned that for something new. 

The parallel slots stuff is now used for parallelism, much like what is done in vacuumdb and reindexdb.

The pg_amcheck application can now be run over one database, multiple specified databases, or all databases.

Relations, schemas, and databases can be included and excluded by pattern, like
"(db1|db2|db3).myschema.(mytable|myindex)". The real-world use-cases for this that I have in mind are things like: 

    pg_amcheck --jobs=12 --all \
        --exclude-relation="db7.schema.known_corrupt_table" \
        --exclude-relation="db*.schema.known_big_table"

and

    pg_amcheck --jobs=20 \
        --include-relation="*.compliance.audited"

I might be missing something, but I think the interface is a superset of the interface from reindexdb and vacuumdb.
Noneof the new interface stuff (patterns, allowing multiple databases to be given on the command line, etc) is
required.


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
I like 0007 quite a bit and am inclined to commit it soon, as it
doesn't depend on the earlier patches. But:

- I think the residual comment in processSQLNamePattern beginning with
"Note:" could use some wordsmithing to account for the new structure
of things -- maybe just "this pass" -> "this function".
- I suggest changing initializations like maxbuf = buf + 2 to maxbuf =
&buf[2] for clarity.

Regarding 0001:

- My preference would be to dump on_exit_nicely_final() and just rely
on order of registration.
- I'm not entirely sure it's a good ideal to expose something named
fatal() like this, because that's a fairly short and general name. On
the other hand, it's pretty descriptive and it's not clear why someone
including exit_utils.h would want any other definitiion. I guess we
can always change it later if it proves to be problematic; it's got a
lot of callers and I guess there's no point in churning the code
without a clear reason.
- I don't quite see why we need this at all. Like, exit_nicely() is a
pg_dump-ism. It would make sense to centralize it if we were going to
use it for pg_amcheck, but you don't. If you were going to, you'd need
to adapt 0003 to use exit_nicely() instead of exit(), but you don't,
nor do you add any other new calls to exit_nicely() anywhere, except
for one in 0002. That makes the PGresultHandler stuff depend on
exit_nicely(), which might be important if you were going to refactor
pg_dump to use that abstraction, but you don't. I'm not opposed to the
idea of centralized exit processing for frontend utilities; indeed, it
seems like a good idea. But this doesn't seem to get us there. AFAICS
it just entangles pg_dump with pg_amcheck unnecessarily in a way that
doesn't really benefit either of them.

Regarding 0002:

- I don't think this is separately committable because it adds an
abstraction but not any uses of that abstraction to demonstrate that
it's actually any good. Perhaps it should just be merged into 0005,
and even into parallel_slot.h vs. having its own header. I'm not
really sure about that, though
- Is this really much of an abstraction layer? Like, how generic can
this be when the argument list includes ExecStatusType expected_status
and int expected_ntups?
- The logic seems to be very similar to some of the stuff that you
move around in 0003, like executeQuery() and executeCommand(), but it
doesn't get unified. I'm not necessarily saying it should be, but it's
weird to do all this refactoring and end up with something that still
looks this

0003, 0004, and 0006 look pretty boring; they are just moving code
around. Is there any point in splitting the code from 0003 across two
files? Maybe it's fine.

If I run pg_amcheck --all -j4 do I get a serialization boundary across
databases? Like, I have to completely finish db1 before I can go onto
db2, even though maybe only one worker is still busy with it?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> If I run pg_amcheck --all -j4 do I get a serialization boundary across
> databases? Like, I have to completely finish db1 before I can go onto
> db2, even though maybe only one worker is still busy with it?

Yes, you do.  That's patterned on reindexdb and vacuumdb.

Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> I like 0007 quite a bit and am inclined to commit it soon, as it
> doesn't depend on the earlier patches. But:
>
> - I think the residual comment in processSQLNamePattern beginning with
> "Note:" could use some wordsmithing to account for the new structure
> of things -- maybe just "this pass" -> "this function".
> - I suggest changing initializations like maxbuf = buf + 2 to maxbuf =
> &buf[2] for clarity.

Ok, I should be able to get you an updated version of 0007 with those changes here soon for you to commit.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Jan 28, 2021 at 12:40 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> > On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > If I run pg_amcheck --all -j4 do I get a serialization boundary across
> > databases? Like, I have to completely finish db1 before I can go onto
> > db2, even though maybe only one worker is still busy with it?
>
> Yes, you do.  That's patterned on reindexdb and vacuumdb.

Sounds lame, but fair enough. We can leave that problem for another day.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 28, 2021, at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Jan 28, 2021 at 12:40 PM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>>> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>> If I run pg_amcheck --all -j4 do I get a serialization boundary across
>>> databases? Like, I have to completely finish db1 before I can go onto
>>> db2, even though maybe only one worker is still busy with it?
>>
>> Yes, you do.  That's patterned on reindexdb and vacuumdb.
>
> Sounds lame, but fair enough. We can leave that problem for another day.

Yeah, I agree that it's lame, and should eventually be addressed.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company






Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 28, 2021, at 9:41 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
>
>
>> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> I like 0007 quite a bit and am inclined to commit it soon, as it
>> doesn't depend on the earlier patches. But:
>>
>> - I think the residual comment in processSQLNamePattern beginning with
>> "Note:" could use some wordsmithing to account for the new structure
>> of things -- maybe just "this pass" -> "this function".
>> - I suggest changing initializations like maxbuf = buf + 2 to maxbuf =
>> &buf[2] for clarity.
>
> Ok, I should be able to get you an updated version of 0007 with those changes here soon for you to commit.

I made those changes, and fixed a bug that would impact the pg_amcheck callers.  I'll have to extend the regression
testcoverage in 0008 since it obviously wasn't caught, but that's not part of this patch since there are no callers
thatuse the dbname.schema.relname format as yet. 

This is the only patch for v34, since you want to commit it separately.  It's renamed as 0001 here....



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Attached is patch set 35.  Per your review comments, I have restructured the patches in the following way:

v33's 0007 is now the first patch, v35's 0001

v33's 0001 is no more.  The frontend infrastructure for error handling and exiting may be resubmitted someday in
anotherpatch, but they aren't necessary for pg_amcheck 

v33's 0002 is no more.  The PGresultHandler stuff that it defined inspires some of what comes later in v35's 0003, but
itisn't sufficiently similar to what v35 does to be thought of as moving from v33-0002 into v35-0003. 

v33's 0003, 0004 and 0006 are combined into v35's 0002

v33's 0005 becomes v35's 0003

v33's 0007 becomes v35's 0004

Additionally, pg_amcheck testing is extended beyond what v33 had in v35's new 0005 patch, but pg_amcheck doesn't depend
onthis new 0005 patch ever being committed, so if you don't like it, just throw it in the bit bucket. 

>
> I like 0007 quite a bit and am inclined to commit it soon, as it
> doesn't depend on the earlier patches. But:
>
> - I think the residual comment in processSQLNamePattern beginning with
> "Note:" could use some wordsmithing to account for the new structure
> of things -- maybe just "this pass" -> "this function".
> - I suggest changing initializations like maxbuf = buf + 2 to maxbuf =
> &buf[2] for clarity

Already responded to this in the v34 development a few days ago.  Nothing meaningfully changes between 34 and 35.

> Regarding 0001:
>
> - My preference would be to dump on_exit_nicely_final() and just rely
> on order of registration.
> - I'm not entirely sure it's a good ideal to expose something named
> fatal() like this, because that's a fairly short and general name. On
> the other hand, it's pretty descriptive and it's not clear why someone
> including exit_utils.h would want any other definitiion. I guess we
> can always change it later if it proves to be problematic; it's got a
> lot of callers and I guess there's no point in churning the code
> without a clear reason.
> - I don't quite see why we need this at all. Like, exit_nicely() is a
> pg_dump-ism. It would make sense to centralize it if we were going to
> use it for pg_amcheck, but you don't. If you were going to, you'd need
> to adapt 0003 to use exit_nicely() instead of exit(), but you don't,
> nor do you add any other new calls to exit_nicely() anywhere, except
> for one in 0002. That makes the PGresultHandler stuff depend on
> exit_nicely(), which might be important if you were going to refactor
> pg_dump to use that abstraction, but you don't. I'm not opposed to the
> idea of centralized exit processing for frontend utilities; indeed, it
> seems like a good idea. But this doesn't seem to get us there. AFAICS
> it just entangles pg_dump with pg_amcheck unnecessarily in a way that
> doesn't really benefit either of them.

Removed from v35.

> Regarding 0002:
>
> - I don't think this is separately committable because it adds an
> abstraction but not any uses of that abstraction to demonstrate that
> it's actually any good. Perhaps it should just be merged into 0005,
> and even into parallel_slot.h vs. having its own header. I'm not
> really sure about that, though

Yeah, this is gone from v35, with hints of it moved into 0003 as part of the parallel slots refactoring.

> - Is this really much of an abstraction layer? Like, how generic can
> this be when the argument list includes ExecStatusType expected_status
> and int expected_ntups?

The new format takes a void *context argument.

> - The logic seems to be very similar to some of the stuff that you
> move around in 0003, like executeQuery() and executeCommand(), but it
> doesn't get unified. I'm not necessarily saying it should be, but it's
> weird to do all this refactoring and end up with something that still
> looks this

Yeah, I agree with this.  The refactoring is a lot less ambitious in v35, to avoid these issues.

> 0003, 0004, and 0006 look pretty boring; they are just moving code
> around. Is there any point in splitting the code from 0003 across two
> files? Maybe it's fine.

Combined.

> If I run pg_amcheck --all -j4 do I get a serialization boundary across
> databases? Like, I have to completely finish db1 before I can go onto
> db2, even though maybe only one worker is still busy with it?

The command line interface and corresponding semantics for specifying which tables to check, which schemas to check,
andwhich databases to check should be the same as that for reindexdb and vacuumdb, and the behavior for handing off
thosetargets to be checked/reindexed/vacuumed through the parallel slots interface should be the same.  It seems a bit
muchto refactor reindexdb and vacuumdb to match pg_amcheck when pg_amcheck hasn't been accepted for commit as yet.
If/whenthat happens, and if the project generally approves of going in this direction, I think the next step will be to
refactorsome of this logic out of pg_amcheck into fe_utils and use it from all three utilities.  At that time, I'd like
totackle the serialization choke point in all three, and handle it in the same way for them all. 


For the new v35-0005 patch, I have extended PostgresNode.pm with some new corruption abilities.  In short, it can now
takea snapshot of the files that back a relation, and can corruptly rollback those files to prior versions, in full or
inpart.  This allows creating kinds of corruption that are hard to create through mere bit twiddling.  For example, if
therelation backing an index is rolled back to a prior version, amcheck's btree checking sees the index as not corrupt,
butwhen asked to reconcile the entries in the heap with the index, it can see that not all of them are present.  This
givestest coverage of corruption checking functionality that is otherwise hard to achieve. 

To check that the PostgresNode.pm changes themselves work, v35-0005 adds src/test/modules/corruption

To check pg_amcheck, and by implication amcheck, v35-0005 adds contrib/pg_amcheck/t/006_relfile_damage.pl

Once again, v35-0005 does not need to be committed -- pg_amcheck works just fine without it.


You and I have discussed this off-list, but for the record, amcheck and pg_amcheck currently only check heaps and btree
indexes. Other object types, such as sequences and non-btree indexes, are not checked.  Some basic sanity checking of
otherobject types would be a good addition, and pg_amcheck has been structured in a way where it should be fairly
straightforwardto add support for those.  The only such sanity checking that I thought could be done in a short
timeframewas to check that the relation files backing the objects were not missing, and we decided off-list such
checkingwasn't worth much, so I didn't add it. 





—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Jan 31, 2021, at 4:05 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>
> Attached is patch set 35.

I found some things to improve in the v35 patch set.  Please find attached the v36 patch set, which differs from v35 in
thefollowing ways: 

0001 -- no changes

0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm

0003 -- no changes

0004:
  -- Fixes handling of amcheck contrib module installed in non-default schema.
  -- Adds database name to corruption messages to make identifying the relation being complained about unambiguous in
multi-databasechecks 
  -- Fixes an instance where pg_amcheck was querying pg_database without schema-qualifying it
  -- Simplifying some functions in pg_amcheck.c
  -- Updating a comment to reflect the renaming of a variable that the comment mentioned by name

0005 -- fixes =pod added in PostgresNode.pm.  The =pod was grammatically correct so far I can tell, but rendered
strangelyin perldoc. 




—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Tue, Feb 2, 2021 at 6:10 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
> 0001 -- no changes

Committed.

> 0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm

Here are a few minor cosmetic issues with this patch:

- connect_utils.c lacks a file header comment.
- Some or perhaps all of the other file header comments need an update for 2021.
- There's bogus hunks in the diff for string_utils.c.

I think the rest of this looks good. I spent a long time puzzling over
whether consumeQueryResult() and processQueryResult() needed to be
moved, but then I realized that this patch actually makes them into
static functions inside parallel_slot.c, rather than public functions
as they were before. I like that. The only reason those functions need
to be moved at all is so that the scripts_parallel/parallel_slot stuff
can continue to do its thing, so this is actually a better way of
grouping things together than what we have now.

> 0003 -- no changes

I think it would be better if there were no handler by default, and
failing to set one leads to an assertion failure when we get to the
point where one would be called.

I don't think I understand the point of renaming processQueryResult
and consumeQueryResult. Isn't that just code churn for its own sake?

PGresultHandler seems too generic. How about ParallelSlotHandler or
ParallelSlotResultHandler?

I'm somewhat inclined to propose s/ParallelSlot/ConnectionSlot/g but I
guess it's better not to get sucked into renaming things.

It's a little strange that we end up with mutators to set the slot's
handler and handler context when we elsewhere feel free to monkey with
a slot's connection directly, but it's not a perfect world and I can't
think of anything I'd like better.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Feb 3, 2021, at 2:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Tue, Feb 2, 2021 at 6:10 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote:
>> 0001 -- no changes
>
> Committed.

Thanks!

>> 0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm

Numbered 0001 in this next patch set.

> Here are a few minor cosmetic issues with this patch:
>
> - connect_utils.c lacks a file header comment.

Fixed

> - Some or perhaps all of the other file header comments need an update for 2021.

Fixed.

> - There's bogus hunks in the diff for string_utils.c.

Removed.

> I think the rest of this looks good. I spent a long time puzzling over
> whether consumeQueryResult() and processQueryResult() needed to be
> moved, but then I realized that this patch actually makes them into
> static functions inside parallel_slot.c, rather than public functions
> as they were before. I like that. The only reason those functions need
> to be moved at all is so that the scripts_parallel/parallel_slot stuff
> can continue to do its thing, so this is actually a better way of
> grouping things together than what we have now.


>> 0003 -- no changes

Numbered 0002 in this next patch set.

> I think it would be better if there were no handler by default, and
> failing to set one leads to an assertion failure when we get to the
> point where one would be called.

Changed to have no default handler, and to use Assert(PointerIsValid(handler)) as you suggest.

> I don't think I understand the point of renaming processQueryResult
> and consumeQueryResult. Isn't that just code churn for its own sake?

I didn't like the names.  I had to constantly look back where they were defined to remember which of them
processed/consumedall the results and which only processed/consumed one of them.  Part of that problem was that their
namesare both singular.  I have restored the names in this next patch set. 

> PGresultHandler seems too generic. How about ParallelSlotHandler or
> ParallelSlotResultHandler?

ParallelSlotResultHandler works for me.  I'm using that, and renaming
s/TableCommandSlotHandler/TableCommandResultHandler/to be consistent. 

> I'm somewhat inclined to propose s/ParallelSlot/ConnectionSlot/g but I
> guess it's better not to get sucked into renaming things.

I admit that I lost a fair amount of time on this project because I thought "scripts_parallel.c" and "parallel_slot"
referredto some kind of threading, but only later looked closely enough to see that this is an event loop, not a
parallelthreading system.  I don't think "slot" is terribly informative, and if we rename I don't think it needs to be
partof the name we choose.  ConnectionEventLoop would be more intuitive to me than either of
ParallelSlot/ConnectionSlot,but this seems like bikeshedding so I'm going to ignore it for now. 

> It's a little strange that we end up with mutators to set the slot's
> handler and handler context when we elsewhere feel free to monkey with
> a slot's connection directly, but it's not a perfect world and I can't
> think of anything I'd like better.

I created those mutators in an earlier version of the patch where the slot had a few more fields to set, and it helped
tohave a single function call set all the fields.  I agree it looks less nice now that there are only two fields to
set.


I also made changes to clean up 0003 (formerly numbered 0004)


—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> I also made changes to clean up 0003 (formerly numbered 0004)

"deduplice" is a typo.

I'm not sure that I agree with check_each_database()'s commentary
about why it doesn't make sense to optimize the resolve-the-databases
step. Like, suppose I type 'pg_amcheck sasquatch'. I think the way you
have it coded it's going to tell me that there are no databases to
check, which might make me think I used the wrong syntax or something.
I want it to tell me that sasquatch does not exist. If I happen to be
a cryptid believer, I may reject that explanation as inaccurate, but
at least there's no question about what pg_amcheck thinks the problem
is.

Why does check_each_database() go out of its way to run the main query
without the always-secure search path? If there's a good reason, I
think it deserves a comment saying what the reason is. If there's not
a good reason, then I think it should use the always-secure search
path for 100% of everything. Same question applies to
check_one_database().

ParallelSlotSetHandler(free_slot, VerifyHeapamSlotHandler, sql.data)
could stand to be split over two lines, like you do for the nearly
run_command() call, so that it doesn't go past 80 columns.

I suggest having two variables instead of one for amcheck_schema.
Using the same variable to store the unescaped value and then later
the escaped value is, IMHO, confusing. Whatever you call the escaped
version, I'd rename the function parameters elsewhere to match.

"status = PQsendQuery(conn, sql) == 1" seems a bit uptight to me. Why
not just make status an int and then just "status = PQsendQuery(conn,
sql)" and then test for status != 0? I don't really care if you don't
change this, it's not actually important. But personally I'd rather
code it as if any non-zero value meant success.

I think the pg_log_error() in run_command() could be worded a bit
better. I don't think it's a good idea to try to include the type of
object in there like this, because of the translatability guidelines
around assembling messages from fragments. And I don't think it's good
to say that the check failed because the reality is that we weren't
able to ask for the check to be run in the first place. I would rather
log this as something like "unable to send query: %s". I would also
assume we need to bail out entirely if that happens. I'm not totally
sure what sorts of things can make PQsendQuery() fail but I bet it
boils down to having lost the server connection. Should that occur,
trying to send queries for all of the remaining objects is going to
result in repeating the same error many times, which isn't going to be
what anybody wants. It's unclear to me whether we should give up on
the whole operation but I think we have to at least give up on that
connection... unless I'm confused about what the failure mode is
likely to be here.

It looks to me like the user won't be able to tell by the exit code
what happened. What I did with pg_verifybackup, and what I suggest we
do here, is exit(1) if anything went wrong, either in terms of failing
to execute queries or in terms of those queries returning problem
reports. With pg_verifybackup, I thought about trying to make it like
0 => backup OK, 2 => backup not OK, 2 => trouble, but I found it too
hard to distinguish what should be exit(1) and what should be exit(2)
and the coding wasn't trivial either, so I went with the simpler
scheme.

The opening line of appendDatabaseSelect() could be adjusted to put
the regexps parameter on the next line, avoiding awkward wrapping.

If they are being run with a safe search path, the queries in
appendDatabaseSelect(), appendSchemaSelect(), etc. could be run
without all the paranoia. If not, maybe they should be. The casts to
text don't include the paranoia: with an unsafe search path, we need
pg_catalog.text here. Or no cast at all, which seems like it ought to
be fine too. Not quite sure why you are doing all that casting to
text; the datatype is presumably 'name' and ought to collate like
collate "C" which is probably fine.

It would probably be a better idea for appendSchemaSelect to declare a
PQExpBuffer and call initPQExpBuffer just once, and then
resetPQExpBuffer after each use, and finally termPQExpBuffer just
once. The way you have it is not expensive enough to really matter,
but avoiding repeated allocate/free cycles is probably best.

I wonder if a pattern like .foo.bar ends up meaning the same thing as
a pattern like foo.bar, with the empty database name being treated the
same as if nothing were specified.

From the way appendTableCTE() is coded, it seems to me that if I ask
for tables named j* excluding tables named jam* I still might get
toast tables for my jam, which seems wrong.

There does not seem to be any clear benefit to defining CT_TABLE = 0
in this case, so I would let the compiler deal with it. We should not
be depending on that to have any particular numeric value.

Why does pg_amcheck.c have a header file pg_amcheck.h if there's only
one source file? If you had multiple source files then the header
would be a reasonable place to put stuff they all need, but you don't.

Copying the definitions of HEAP_TABLE_AM_OID and BTREE_AM_OID into
pg_amcheck.h or anywhere else seems bad. I think you just be doing
#include "catalog/pg_am_d.h".

I think I'm out of steam for today but I'll try to look at this more
soon. In general I think this patch and the whole series are pretty
close to being ready to commit, even though there are still things I
think need fixing here and there.

Thanks,

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Numbered 0001 in this next patch set.

Hi,

I committed 0001 as you had it and 0002 with some more cleanups. Things I did:

- Adjusted some comments.
- Changed processQueryResult so that it didn't do foo(bar) with foo
being a pointer. Generally we prefer (*foo)(bar) when it can be
confused with a direct function call, but wunk->foo(bar) is also
considered acceptable.
- Changed the return type of ParallelSlotResultHandler to be bool,
because having it return PGresult * seemed to offer no advantages.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Mark Dilger
Date:

> On Feb 4, 2021, at 1:04 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger
> <mark.dilger@enterprisedb.com> wrote:
>> I also made changes to clean up 0003 (formerly numbered 0004)
>
> "deduplice" is a typo.

Fixed.

> I'm not sure that I agree with check_each_database()'s commentary
> about why it doesn't make sense to optimize the resolve-the-databases
> step. Like, suppose I type 'pg_amcheck sasquatch'. I think the way you
> have it coded it's going to tell me that there are no databases to
> check, which might make me think I used the wrong syntax or something.
> I want it to tell me that sasquatch does not exist. If I happen to be
> a cryptid believer, I may reject that explanation as inaccurate, but
> at least there's no question about what pg_amcheck thinks the problem
> is.

The way v38 is coded, 'pg_amcheck sasquatch" will return a non-zero error code with an error message, database
"sasquatch"does not exist. 

The problem only comes up if you run it like one of the following:

  pg_amcheck --maintenance-db postgres sasquatch
  pg_amcheck postgres sasquatch
  pg_amcheck "sasquatch.myschema.mytable"

In each of those, pg_amcheck first connects to the initial database ("postgres" or whatever) and tries to resolve all
databasesto check matching patterns like '^(postgres)$' and '^(sasquatch)$' and doesn't find any sasquatch matches, but
alsodoesn't complain. 

In v39, this is changed to complain when patterns do not match.  This can be turned off with --no-strict-names.

> Why does check_each_database() go out of its way to run the main query
> without the always-secure search path? If there's a good reason, I
> think it deserves a comment saying what the reason is. If there's not
> a good reason, then I think it should use the always-secure search
> path for 100% of everything. Same question applies to
> check_one_database().

That bit of code survived some refactoring, but it doesn't make sense to keep it, assuming it ever made sense at all.
Removedin v39.  The calls to connectDatabase will always secure the search_path, so pg_amcheck need not touch that
directly.

> ParallelSlotSetHandler(free_slot, VerifyHeapamSlotHandler, sql.data)
> could stand to be split over two lines, like you do for the nearly
> run_command() call, so that it doesn't go past 80 columns.

Fair enough.  The code has been treated to a pass through pgindent as well.

> I suggest having two variables instead of one for amcheck_schema.
> Using the same variable to store the unescaped value and then later
> the escaped value is, IMHO, confusing. Whatever you call the escaped
> version, I'd rename the function parameters elsewhere to match.

The escaped version is now part of a struct, so there shouldn't be any confusion about this.

> "status = PQsendQuery(conn, sql) == 1" seems a bit uptight to me. Why
> not just make status an int and then just "status = PQsendQuery(conn,
> sql)" and then test for status != 0? I don't really care if you don't
> change this, it's not actually important. But personally I'd rather
> code it as if any non-zero value meant success.

I couldn't remember why I coded it like that, since it doesn't look like my style, then noticed I copied that from
reindexdb.c,upon which this code is patterned.  I agree it looks strange, and I've changed it in v39.  Unlike the call
sitein reindexdb, there isn't any reason for pg_amcheck to store the returned value in a variable, so in v39 it
doesn't.

> I think the pg_log_error() in run_command() could be worded a bit
> better. I don't think it's a good idea to try to include the type of
> object in there like this, because of the translatability guidelines
> around assembling messages from fragments. And I don't think it's good
> to say that the check failed because the reality is that we weren't
> able to ask for the check to be run in the first place. I would rather
> log this as something like "unable to send query: %s". I would also
> assume we need to bail out entirely if that happens. I'm not totally
> sure what sorts of things can make PQsendQuery() fail but I bet it
> boils down to having lost the server connection. Should that occur,
> trying to send queries for all of the remaining objects is going to
> result in repeating the same error many times, which isn't going to be
> what anybody wants. It's unclear to me whether we should give up on
> the whole operation but I think we have to at least give up on that
> connection... unless I'm confused about what the failure mode is
> likely to be here.

Changed in v39 to report the error as you suggest.

It will reconnect and retry a command one time on error.  That should cover the case that the connection to the
databasewas merely lost.  If the second attempt also fails, no further retry of the same command is attempted, though
commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other
remainingdatabases in the list. 

Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relation
indb2 before finally moving on to db3.  That seems pretty awful considering how many relations that could be, but
failingto soldier on in the face of errors seems a strange design for a corruption checking tool. 

> It looks to me like the user won't be able to tell by the exit code
> what happened. What I did with pg_verifybackup, and what I suggest we
> do here, is exit(1) if anything went wrong, either in terms of failing
> to execute queries or in terms of those queries returning problem
> reports. With pg_verifybackup, I thought about trying to make it like
> 0 => backup OK, 2 => backup not OK, 2 => trouble, but I found it too
> hard to distinguish what should be exit(1) and what should be exit(2)
> and the coding wasn't trivial either, so I went with the simpler
> scheme.

In v39, exit(1) is used for all errors which are intended to stop the program.  It is important to recognize that
findingcorruption is not an error in this sense.  A query to verify_heapam() can fail if the relation's checksums are
bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers.  There can be
errorsif the file backing a relation is missing.  There may be other corruption error cases that I have not yet thought
about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them.  As
discussedabove, failing to send the query to the server is not viewed as a reason to exit, either.  It would be hard to
quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such
failures,and I'm not sure that pg_amcheck should just abort. 

>
> The opening line of appendDatabaseSelect() could be adjusted to put
> the regexps parameter on the next line, avoiding awkward wrapping.
>
> If they are being run with a safe search path, the queries in
> appendDatabaseSelect(), appendSchemaSelect(), etc. could be run
> without all the paranoia. If not, maybe they should be. The casts to
> text don't include the paranoia: with an unsafe search path, we need
> pg_catalog.text here. Or no cast at all, which seems like it ought to
> be fine too. Not quite sure why you are doing all that casting to
> text; the datatype is presumably 'name' and ought to collate like
> collate "C" which is probably fine.

In v39, everything is being run with a safe search path, and the paranoia and casts are largely gone.

> It would probably be a better idea for appendSchemaSelect to declare a
> PQExpBuffer and call initPQExpBuffer just once, and then
> resetPQExpBuffer after each use, and finally termPQExpBuffer just
> once. The way you have it is not expensive enough to really matter,
> but avoiding repeated allocate/free cycles is probably best.

I'm not sure what this comment refers to, but this function doesn't exist in v39.

> I wonder if a pattern like .foo.bar ends up meaning the same thing as
> a pattern like foo.bar, with the empty database name being treated the
> same as if nothing were specified.

That's really a question of how patternToSQLRegex parses that string.  In general, "a.b.c" => ("^(a)$", "^(b)$",
"^(c)$"),so I would expect your example to have a database pattern "^()$" which should only match databases with zero
lengthnames, presumably none.  I've added a regression test for this, and indeed that's what it does. 

> From the way appendTableCTE() is coded, it seems to me that if I ask
> for tables named j* excluding tables named jam* I still might get
> toast tables for my jam, which seems wrong.

In v39, the query is entirely reworked, so I can't respond directly to this, though I agree that excluding a table
shouldmean the toast table does not automatically get included.  There is an interaction, though, if you select both
"j*'and "pg_toast.*" and then exclude "jam". 

> There does not seem to be any clear benefit to defining CT_TABLE = 0
> in this case, so I would let the compiler deal with it. We should not
> be depending on that to have any particular numeric value.

The enum is removed in v39.

> Why does pg_amcheck.c have a header file pg_amcheck.h if there's only
> one source file? If you had multiple source files then the header
> would be a reasonable place to put stuff they all need, but you don't.

Everything is in pg_amcheck.c now.

> Copying the definitions of HEAP_TABLE_AM_OID and BTREE_AM_OID into
> pg_amcheck.h or anywhere else seems bad. I think you just be doing
> #include "catalog/pg_am_d.h".

Good point.  Done.

> I think I'm out of steam for today but I'll try to look at this more
> soon. In general I think this patch and the whole series are pretty
> close to being ready to commit, even though there are still things I
> think need fixing here and there.

Reworking the code took a while.  Version 39 patches attached.



—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Attachment

Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> It will reconnect and retry a command one time on error.  That should cover the case that the connection to the
databasewas merely lost.  If the second attempt also fails, no further retry of the same command is attempted, though
commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other
remainingdatabases in the list. 
>
> Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per
relationin db2 before finally moving on to db3.  That seems pretty awful considering how many relations that could be,
butfailing to soldier on in the face of errors seems a strange design for a corruption checking tool. 

That doesn't seem right at all. I think a PQsendQuery() failure is so
remote that it's probably justification for giving up on the entire
operation. If it's caused by a problem with some object, it probably
means that accessing that object caused the whole database to go down,
and retrying the object will take the database down again. Retrying
the object is betting that the user interrupted connectivity between
pg_amcheck and the database but the interruption is only momentary and
the user actually wants to complete the operation. That seems unlikely
to me. I think it's far more probably that the database crashed or got
shut down and continuing is futile.

My proposal is: if we get an ERROR trying to *run* a query, give up on
that object but still try the other ones after reconnecting. If we get
a FATAL or PANIC trying to *run* a query, give up on the entire
operation. If even sending a query fails, also give up.

> In v39, exit(1) is used for all errors which are intended to stop the program.  It is important to recognize that
findingcorruption is not an error in this sense.  A query to verify_heapam() can fail if the relation's checksums are
bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers.  There can be
errorsif the file backing a relation is missing.  There may be other corruption error cases that I have not yet thought
about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them.  As
discussedabove, failing to send the query to the server is not viewed as a reason to exit, either.  It would be hard to
quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such
failures,and I'm not sure that pg_amcheck should just abort. 

I agree that exit(1) should happen after any error intended to stop
the program. But I think it should also happen at the end of the run
if we hit any problems for which we did not stop, so that exit(0)
means your database is healthy.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: new heapcheck contrib module

From
Robert Haas
Date:
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger
<mark.dilger@enterprisedb.com> wrote:
> Reworking the code took a while.  Version 39 patches attached.

Regarding the documentation, I think the Usage section at the top is
far too extensive and duplicates the option description section to far
too great an extent. You have 21 usage examples for a command with 34
options. Even if we think it's a good idea to give a brief summary of
usage, it's got to be brief; we certainly don't need examples of
obscure special-purpose options like --maintenance-db here. Looking
through the commands in "PostgreSQL Client Applications" and
"Additional Supplied Programs," most of them just have a synopsis
section and nothing like this Usage section. Those that do have a
Usage section typically use it for a narrative description of what to
do with the tool (e.g. see pg_test_timing), not a long list of
examples. I'm inclined to think you should nuke all the examples and
incorporate the descriptive text, to the extent that it's needed,
either into the descriptions of the individual options or, if the
behavior spans many options, into the Description section.

A few of these examples could move down into an Examples section at
the bottom, perhaps, but I think 21 is still too many. I'd try to
limit it to 5-7. Just hit the highlights.

I also think that perhaps it's not best to break up the list of
options into so many different categories the way you have. Notice
that for example pg_dump and psql don't do this, instead putting
everything into one ordered list, despite also having a lot of
options. This is arguably worse if you want to understand which
options are related to each other, but it's better if you are just
looking for something based on alphabetical order.

-- 
Robert Haas
EDB: http://www.enterprisedb.com