Thread: new heapcheck contrib module
Hackers, I have been talking with Robert about table corruption that occurs from time to time. The page checksum feature seems sufficientto detect most random corruption problems, but it can't detect "logical" corruption, where the page is valid butinconsistent with the rest of the database cluster. This can happen due to faulty or ill-conceived backup and restoretools, or bad storage, or user error, or bugs in the server itself. (Also, not everyone enables checksums.) The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax valuesagainst relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded toperform additional checks. There was a prior v1 patch, discussed offlist with Robert, not posted. Here is v2: — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Apr 20, 2020 at 10:59 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The attached module provides the means to scan a relation and sanity check it. Currently, it checks xmin and xmax valuesagainst relfrozenxid and relminmxid, and also validates TOAST pointers. If people like this, it could be expanded toperform additional checks. Cool. Why not make it part of contrib/amcheck? We talked about the kinds of checks that we'd like to have for a tool like this before: https://postgr.es/m/20161017014605.GA1220186@tornado.leadboat.com -- Peter Geoghegan
On Mon, Apr 20, 2020 at 2:09 PM Peter Geoghegan <pg@bowt.ie> wrote: > Cool. Why not make it part of contrib/amcheck? I wondered if people would suggest that. Didn't take long. The documentation would need some updating, but that's doable. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2020 at 11:19 AM Robert Haas <robertmhaas@gmail.com> wrote: > I wondered if people would suggest that. Didn't take long. You were the one that pointed out that my first version of contrib/amcheck, which was called "contrib/btreecheck", should have a more general name. And rightly so! The basic interface used for the heap checker functions seem very similar to what amcheck already offers for B-Tree indexes, so it seems very natural to distribute them together. IMV, the problem that we have with amcheck is that it's too hard to use in a top down kind of way. Perhaps there is an opportunity to provide a more top-down interface to an expanded version of amcheck that does heap checking. Something with a high level practical focus, in addition to the low level functions. I'm not saying that Mark should be required to solve that problem, but it certainly seems worth considering now. > The documentation would need some updating, but that's doable. It would also probably need a bit of renaming, so that analogous function names are used. -- Peter Geoghegan
> On Apr 20, 2020, at 11:31 AM, Peter Geoghegan <pg@bowt.ie> wrote: > > IMV, the problem that we have with amcheck is that it's too hard to > use in a top down kind of way. Perhaps there is an opportunity to > provide a more top-down interface to an expanded version of amcheck > that does heap checking. Something with a high level practical focus, > in addition to the low level functions. I'm not saying that Mark > should be required to solve that problem, but it certainly seems worth > considering now. Thanks for your quick response and interest in this submission! Can you elaborate on "top-down"? I'm not sure what that means in this context. I don't mind going further with this project if I understand what you are suggesting. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required. Something that provides a reasonable mix of verification options based on high level directives. All verification methods can be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel.
For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's not hard to combine it with the rootdescend stuff from amcheck. It should be composable.
The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work together.
Peter Geoghegan
(Sent from my phone)
(Sent from my phone)
> On Apr 20, 2020, at 12:37 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > I mean an interface that's friendly to DBAs, that verifies an entire database. No custom sql query required. Somethingthat provides a reasonable mix of verification options based on high level directives. All verification methodscan be combined in a granular, possibly randomized fashion. Maybe we can make this run in parallel. > > For example, maybe your heap checker code sometimes does index probes for a subset of indexes and heap tuples. It's nothard to combine it with the rootdescend stuff from amcheck. It should be composable. > > The interface you've chosen is a good starting point. But let's not miss an opportunity to make everything work together. Ok, I'll work in that direction and repost when I have something along those lines. Thanks again for your input. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-20 10:59:28 -0700, Mark Dilger wrote: > I have been talking with Robert about table corruption that occurs > from time to time. The page checksum feature seems sufficient to > detect most random corruption problems, but it can't detect "logical" > corruption, where the page is valid but inconsistent with the rest of > the database cluster. This can happen due to faulty or ill-conceived > backup and restore tools, or bad storage, or user error, or bugs in > the server itself. (Also, not everyone enables checksums.) This is something we really really really need. I'm very excited to see progress! > From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001 > From: Mark Dilger <mark.dilger@enterprisedb.com> > Date: Mon, 20 Apr 2020 08:05:58 -0700 > Subject: [PATCH v2] Adding heapcheck contrib module. > > The heapcheck module introduces a new function for checking a heap > relation and associated toast relation, if any, for corruption. Why not add it to amcheck? I wonder if a mode where heapcheck optionally would only checks non-frozen (perhaps also non-all-visible) regions of a table would be a good idea? Would make it a lot more viable to run this regularly on bigger databases. Even if there's a window to not check some data (because it's frozen before the next heapcheck run). > The attached module provides the means to scan a relation and sanity > check it. Currently, it checks xmin and xmax values against > relfrozenxid and relminmxid, and also validates TOAST pointers. If > people like this, it could be expanded to perform additional checks. > The postgres backend already defends against certain forms of > corruption, by checking the page header of each page before allowing > it into the page cache, and by checking the page checksum, if enabled. > Experience shows that broken or ill-conceived backup and restore > mechanisms can result in a page, or an entire file, being overwritten > with an earlier version of itself, restored from backup. Pages thus > overwritten will appear to have valid page headers and checksums, > while potentially containing xmin, xmax, and toast pointers that are > invalid. We also had a *lot* of bugs that we'd have found a lot earlier, possibly even during development, if we had a way to easily perform these checks. > contrib/heapcheck introduces a function, heapcheck_relation, that > takes a regclass argument, scans the given heap relation, and returns > rows containing information about corruption found within the table. > The main focus of the scan is to find invalid xmin, xmax, and toast > pointer values. It also checks for structural corruption within the > page (such as invalid t_hoff values) that could lead to the backend > aborting should the function blindly trust the data as it finds it. > +typedef struct CorruptionInfo > +{ > + BlockNumber blkno; > + OffsetNumber offnum; > + int16 lp_off; > + int16 lp_flags; > + int16 lp_len; > + int32 attnum; > + int32 chunk; > + char *msg; > +} CorruptionInfo; Adding a short comment explaining what this is for would be good. > +/* Internal implementation */ > +void record_corruption(HeapCheckContext * ctx, char *msg); > +TupleDesc heapcheck_relation_tupdesc(void); > + > +void beginRelBlockIteration(HeapCheckContext * ctx); > +bool relBlockIteration_next(HeapCheckContext * ctx); > +void endRelBlockIteration(HeapCheckContext * ctx); > + > +void beginPageTupleIteration(HeapCheckContext * ctx); > +bool pageTupleIteration_next(HeapCheckContext * ctx); > +void endPageTupleIteration(HeapCheckContext * ctx); > + > +void beginTupleAttributeIteration(HeapCheckContext * ctx); > +bool tupleAttributeIteration_next(HeapCheckContext * ctx); > +void endTupleAttributeIteration(HeapCheckContext * ctx); > + > +void beginToastTupleIteration(HeapCheckContext * ctx, > + struct varatt_external *toast_pointer); > +void endToastTupleIteration(HeapCheckContext * ctx); > +bool toastTupleIteration_next(HeapCheckContext * ctx); > + > +bool TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid); > +bool HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx); > +void check_toast_tuple(HeapCheckContext * ctx); > +bool check_tuple_attribute(HeapCheckContext * ctx); > +void check_tuple(HeapCheckContext * ctx); > + > +List *check_relation(Oid relid); > +void check_relation_relkind(Relation rel); Why aren't these static? > +/* > + * record_corruption > + * > + * Record a message about corruption, including information > + * about where in the relation the corruption was found. > + */ > +void > +record_corruption(HeapCheckContext * ctx, char *msg) > +{ Given that you went through the trouble of adding prototypes for all of these, I'd start with the most important functions, not the unimportant details. > +/* > + * Helper function to construct the TupleDesc needed by heapcheck_relation. > + */ > +TupleDesc > +heapcheck_relation_tupdesc() Missing (void) (it's our style, even though you could theoretically not have it as long as you have a prototype). > +{ > + TupleDesc tupdesc; > + AttrNumber maxattr = 8; This 8 is in multiple places, I'd add a define for it. > + AttrNumber a = 0; > + > + tupdesc = CreateTemplateTupleDesc(maxattr); > + TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0); > + TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0); > + Assert(a == maxattr); > + > + return BlessTupleDesc(tupdesc); > +} > +/* > + * heapcheck_relation > + * > + * Scan and report corruption in heap pages or in associated toast relation. > + */ > +Datum > +heapcheck_relation(PG_FUNCTION_ARGS) > +{ > + FuncCallContext *funcctx; > + CheckRelCtx *ctx; > + > + if (SRF_IS_FIRSTCALL()) > + { I think it'd be good to have a version that just returned a boolean. For one, in many cases that's all we care about when scripting things. But also, on a large relation, there could be a lot of errors. > + Oid relid = PG_GETARG_OID(0); > + MemoryContext oldcontext; > + > + /* > + * Scan the entire relation, building up a list of corruption found in > + * ctx->corruption, for returning later. The scan must be performed > + * in a memory context that will survive until after all rows are > + * returned. > + */ > + funcctx = SRF_FIRSTCALL_INIT(); > + oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); > + funcctx->tuple_desc = heapcheck_relation_tupdesc(); > + ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx)); > + ctx->corruption = check_relation(relid); > + ctx->idx = 0; /* start the iterator at the beginning */ > + funcctx->user_fctx = (void *) ctx; > + MemoryContextSwitchTo(oldcontext); Hm. This builds up all the errors in memory. Is that a good idea? I mean for a large relation having one returned value for each tuple could be a heck of a lot of data. I think it'd be better to use the spilling SRF protocol here. It's not like you're benefitting from deferring the tuple construction to the return currently. > +/* > + * beginRelBlockIteration > + * > + * For the given heap relation being checked, as recorded in ctx, sets up > + * variables for iterating over the heap's pages. > + * > + * The caller should have already opened the heap relation, ctx->rel > + */ > +void > +beginRelBlockIteration(HeapCheckContext * ctx) > +{ > + ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel); > + ctx->blkno = InvalidBlockNumber; > + ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD); > + ctx->buffer = InvalidBuffer; > + ctx->page = NULL; > +} > + > +/* > + * endRelBlockIteration > + * > + * Releases resources that were reserved by either beginRelBlockIteration or > + * relBlockIteration_next. > + */ > +void > +endRelBlockIteration(HeapCheckContext * ctx) > +{ > + /* > + * Clean up. If the caller iterated to the end, the final call to > + * relBlockIteration_next will already have released the buffer, but if > + * the caller is bailing out early, we have to release it ourselves. > + */ > + if (InvalidBuffer != ctx->buffer) > + UnlockReleaseBuffer(ctx->buffer); > +} These seem mighty granular and generically named to me. > + * pageTupleIteration_next > + * > + * Advances the state tracked in ctx to the next tuple on the page. > + * > + * Caller should have already set up the iteration via > + * beginPageTupleIteration, and should stop calling when this function > + * returns false. > + */ > +bool > +pageTupleIteration_next(HeapCheckContext * ctx) I don't think this is a naming scheme we use anywhere in postgres. I don't think it's a good idea to add yet more of those. > +{ > + /* > + * Iterate to the next interesting line pointer, if any. Unused, dead and > + * redirect line pointers are of no interest. > + */ > + do > + { > + ctx->offnum = OffsetNumberNext(ctx->offnum); > + if (ctx->offnum > ctx->maxoff) > + return false; > + ctx->itemid = PageGetItemId(ctx->page, ctx->offnum); > + } while (!ItemIdIsUsed(ctx->itemid) || > + ItemIdIsDead(ctx->itemid) || > + ItemIdIsRedirected(ctx->itemid)); This is an odd loop. Part of the test is in the body, part of in the loop header. > +/* > + * Given a TransactionId, attempt to interpret it as a valid > + * FullTransactionId, neither in the future nor overlong in > + * the past. Stores the inferred FullTransactionId in *fxid. > + * > + * Returns whether the xid is newer than the oldest clog xid. > + */ > +bool > +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid) I don't at all like the naming of this function. This isn't a reliable check. As before, it obviously also shouldn't be static. > +{ > + FullTransactionId fnow; > + uint32 epoch; > + > + /* Initialize fxid; we'll overwrite this later if needed */ > + *fxid = FullTransactionIdFromEpochAndXid(0, xid); > + /* Special xids can quickly be turned into invalid fxids */ > + if (!TransactionIdIsValid(xid)) > + return false; > + if (!TransactionIdIsNormal(xid)) > + return true; > + > + /* > + * Charitably infer the full transaction id as being within one epoch ago > + */ > + fnow = ReadNextFullTransactionId(); > + epoch = EpochFromFullTransactionId(fnow); > + *fxid = FullTransactionIdFromEpochAndXid(epoch, xid); So now you're overwriting the fxid value from above unconditionally? > + if (!FullTransactionIdPrecedes(*fxid, fnow)) > + *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid); I think it'd be better to do the conversion the following way: *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow) + (int32) (XidFromFullTransactionId(fnow) - xid)); > + if (!FullTransactionIdPrecedes(*fxid, fnow)) > + return false; > + /* The oldestClogXid is protected by CLogTruncationLock */ > + Assert(LWLockHeldByMe(CLogTruncationLock)); > + if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid)) > + return false; > + return true; > +} Why is this testing oldestClogXid instead of oldestXid? > +/* > + * HeapTupleIsVisible > + * > + * Determine whether tuples are visible for heapcheck. Similar to > + * HeapTupleSatisfiesVacuum, but with critical differences. > + * > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > + * to a table during a corruption check. > + * 2) Gracefully handles xids that are too old by calling > + * TransactionIdStillValid before TransactionLogFetch, thus avoiding > + * a backend abort. I think it'd be better to protect against this by avoiding checks for xids that are older than relfrozenxid. And ones that are newer than ReadNextTransactionId(). But all of those cases should be errors anyway, so it doesn't seem like that should be handled within the visibility routine. > + * 3) Only makes a boolean determination of whether heapcheck should > + * see the tuple, rather than doing extra work for vacuum-related > + * categorization. > + */ > +bool > +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > +{ > + FullTransactionId fxmin, > + fxmax; > + uint16 infomask = tuphdr->t_infomask; > + TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr); > + > + if (!HeapTupleHeaderXminCommitted(tuphdr)) > + { Hm. I wonder if it'd be good to crosscheck the xid committed hint bits with clog? > + else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr))) > + { > + LWLockRelease(CLogTruncationLock); > + return false; /* HEAPTUPLE_DEAD */ > + } Note that this actually can error out, if xmin is a subtransaction xid, because pg_subtrans is truncated a lot more aggressively than anything else. I think you'd need to filter against subtransactions older than RecentXmin before here, and treat that as an error. > + if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask)) > + { > + if (infomask & HEAP_XMAX_IS_MULTI) > + { > + TransactionId xmax = HeapTupleGetUpdateXid(tuphdr); > + > + /* not LOCKED_ONLY, so it has to have an xmax */ > + if (!TransactionIdIsValid(xmax)) > + { > + record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is " > + "neither LOCKED_ONLY nor has a " > + "valid xmax")); > + return false; > + } I think it's bad to have code like this in a routine that's named like a generic visibility check routine. > + if (TransactionIdIsInProgress(xmax)) > + return false; /* HEAPTUPLE_DELETE_IN_PROGRESS */ > + > + LWLockAcquire(CLogTruncationLock, LW_SHARED); > + if (!TransactionIdStillValid(xmax, &fxmax)) > + { > + LWLockRelease(CLogTruncationLock); > + record_corruption(ctx, psprintf("tuple xmax = %u (interpreted " > + "as " UINT64_FORMAT > + ") not or no longer valid", > + xmax, fxmax.value)); > + return false; > + } > + else if (TransactionIdDidCommit(xmax)) > + { > + LWLockRelease(CLogTruncationLock); > + return false; /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */ > + } > + LWLockRelease(CLogTruncationLock); > + /* Ok, the tuple is live */ I don't think random interspersed uses of CLogTruncationLock are a good idea. If you move to only checking visibility after tuple fits into [relfrozenxid, nextXid), then you don't need to take any locks here, as long as a lock against vacuum is taken (which I think this should do anyway). > +/* > + * check_tuple > + * > + * Checks the current tuple as tracked in ctx for corruption. Records any > + * corruption found in ctx->corruption. > + * > + * The caller should have iterated to a tuple via pageTupleIteration_next. > + */ > +void > +check_tuple(HeapCheckContext * ctx) > +{ > + bool fatal = false; Wait, aren't some checks here duplicate with ones in HeapTupleIsVisible()? > + /* Check relminmxid against mxid, if any */ > + if (ctx->infomask & HEAP_XMAX_IS_MULTI && > + MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid)) > + { > + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation " > + "relminmxid = %u", > + ctx->xmax, ctx->relminmxid)); > + } It's pretty weird that the routines here access xmin/xmax/... via HeapCheckContext, but HeapTupleIsVisible() doesn't. > + /* Check xmin against relfrozenxid */ > + if (TransactionIdIsNormal(ctx->relfrozenxid) && > + TransactionIdIsNormal(ctx->xmin) && > + TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid)) > + { > + record_corruption(ctx, psprintf("tuple xmin = %u precedes relation " > + "relfrozenxid = %u", > + ctx->xmin, ctx->relfrozenxid)); > + } > + > + /* Check xmax against relfrozenxid */ > + if (TransactionIdIsNormal(ctx->relfrozenxid) && > + TransactionIdIsNormal(ctx->xmax) && > + TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid)) > + { > + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation " > + "relfrozenxid = %u", > + ctx->xmax, ctx->relfrozenxid)); > + } these all should be fatal. You definitely cannot just continue afterwards given the justification below: > + /* > + * Iterate over the attributes looking for broken toast values. This > + * roughly follows the logic of heap_deform_tuple, except that it doesn't > + * bother building up isnull[] and values[] arrays, since nobody wants > + * them, and it unrolls anything that might trip over an Assert when > + * processing corrupt data. > + */ > + beginTupleAttributeIteration(ctx); > + while (tupleAttributeIteration_next(ctx) && > + check_tuple_attribute(ctx)) > + ; > + endTupleAttributeIteration(ctx); > +} I really don't find these helpers helpful. > +/* > + * check_relation > + * > + * Checks the relation given by relid for corruption, returning a list of all > + * it finds. > + * > + * The caller should set up the memory context as desired before calling. > + * The returned list belongs to the caller. > + */ > +List * > +check_relation(Oid relid) > +{ > + HeapCheckContext ctx; > + > + memset(&ctx, 0, sizeof(HeapCheckContext)); > + > + /* Open the relation */ > + ctx.relid = relid; > + ctx.corruption = NIL; > + ctx.rel = relation_open(relid, AccessShareLock); I think you need to protect at least against concurrent schema changes given some of your checks. But I think it'd be better to also conflict with vacuum here. > + check_relation_relkind(ctx.rel); I think you also need to ensure that the table is actually using heap AM, not another tableam. Oh - you're doing that inside the check. But that's confusing, because that's not 'relkind'. > + ctx.relDesc = RelationGetDescr(ctx.rel); > + ctx.rel_natts = RelationGetDescr(ctx.rel)->natts; > + ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid; > + ctx.relminmxid = ctx.rel->rd_rel->relminmxid; three naming schemes in three lines... > + /* check all blocks of the relation */ > + beginRelBlockIteration(&ctx); > + while (relBlockIteration_next(&ctx)) > + { > + /* Perform tuple checks */ > + beginPageTupleIteration(&ctx); > + while (pageTupleIteration_next(&ctx)) > + check_tuple(&ctx); > + endPageTupleIteration(&ctx); > + } > + endRelBlockIteration(&ctx); I again do not find this helper stuff helpful. > + /* Close the associated toast table and indexes, if any. */ > + if (ctx.has_toastrel) > + { > + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes, > + AccessShareLock); > + table_close(ctx.toastrel, AccessShareLock); > + } > + > + /* Close the main relation */ > + relation_close(ctx.rel, AccessShareLock); Why the closing here? > +# This regression test demonstrates that the heapcheck_relation() function > +# supplied with this contrib module correctly identifies specific kinds of > +# corruption within pages. To test this, we need a mechanism to create corrupt > +# pages with predictable, repeatable corruption. The postgres backend cannot be > +# expected to help us with this, as its design is not consistent with the goal > +# of intentionally corrupting pages. > +# > +# Instead, we create a table to corrupt, and with careful consideration of how > +# postgresql lays out heap pages, we seek to offsets within the page and > +# overwrite deliberately chosen bytes with specific values calculated to > +# corrupt the page in expected ways. We then verify that heapcheck_relation > +# reports the corruption, and that it runs without crashing. Note that the > +# backend cannot simply be started to run queries against the corrupt table, as > +# the backend will crash, at least for some of the corruption types we > +# generate. > +# > +# Autovacuum potentially touching the table in the background makes the exact > +# behavior of this test harder to reason about. We turn it off to keep things > +# simpler. We use a "belt and suspenders" approach, turning it off for the > +# system generally in postgresql.conf, and turning it off specifically for the > +# test table. > +# > +# This test depends on the table being written to the heap file exactly as we > +# expect it to be, so we take care to arrange the columns of the table, and > +# insert rows of the table, that give predictable sizes and locations within > +# the table page. I have a hard time believing this is going to be really reliable. E.g. the alignment requirements will vary between platforms, leading to different layouts. In particular, MAXALIGN differs between platforms. Also, it's supported to compile postgres with a different pagesize. Greetings, Andres Freund
[ retrying from the email address I intended to use ] On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote: > I don't think random interspersed uses of CLogTruncationLock are a good > idea. If you move to only checking visibility after tuple fits into > [relfrozenxid, nextXid), then you don't need to take any locks here, as > long as a lock against vacuum is taken (which I think this should do > anyway). I think it would be *really* good to avoid ShareUpdateExclusiveLock here. Running with only AccessShareLock would be a big advantage. I agree that any use of CLogTruncationLock should not be "random", but I don't see why the same method we use to make txid_status() safe to expose to SQL shouldn't also be used here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-04-20 15:59:49 -0400, Robert Haas wrote: > On Mon, Apr 20, 2020 at 3:42 PM Andres Freund <andres@anarazel.de> wrote: > > I don't think random interspersed uses of CLogTruncationLock are a good > > idea. If you move to only checking visibility after tuple fits into > > [relfrozenxid, nextXid), then you don't need to take any locks here, as > > long as a lock against vacuum is taken (which I think this should do > > anyway). > > I think it would be *really* good to avoid ShareUpdateExclusiveLock > here. Running with only AccessShareLock would be a big advantage. I > agree that any use of CLogTruncationLock should not be "random", but I > don't see why the same method we use to make txid_status() safe to > expose to SQL shouldn't also be used here. A few billion CLogTruncationLock acquisitions in short order will likely have at least as big an impact as ShareUpdateExclusiveLock held for the duration of the check. That's not really a relevant concern or txid_status(). Per-tuple lock acquisitions aren't great. I think it might be doable to not need either. E.g. we could set the checking backend's xmin to relfrozenxid, and set somethign like PROC_IN_VACUUM. That should, I think, prevent clog from being truncated in a problematic way (clog truncations look at PROC_IN_VACUUM backends), while not blocking vacuum. The similar concern for ReadNewTransactionId() can probably more easily be addressed, by only calling ReadNewTransactionId() when encountering an xid that's newer than the last value read. I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version of it) while checking anyway. Reading the full relation can take quite a while, and we shouldn't prevent hot pruning while doing so. There's some things we'd need to figure out to be able to use PROC_IN_VACUUM, as that's really only safe in some circumstances. Possibly it'd be easiest to address that if we'd make the check a procedure... Greetings, Andres Freund
On Mon, Apr 20, 2020 at 12:42 PM Andres Freund <andres@anarazel.de> wrote: > This is something we really really really need. I'm very excited to see > progress! +1 My experience with amcheck was that the requirement that we document and verify pretty much every invariant (the details of which differ slightly based on the B-Tree version in use) has had intangible benefits. It helped me come up with a simpler, better design in the first place. Also, many of the benchmarks that I perform get to be a stress-test of the feature itself. It saves quite a lot of testing work in the long run. > I wonder if a mode where heapcheck optionally would only checks > non-frozen (perhaps also non-all-visible) regions of a table would be a > good idea? Would make it a lot more viable to run this regularly on > bigger databases. Even if there's a window to not check some data > (because it's frozen before the next heapcheck run). That's a great idea. It could also make it practical to use the rootdescend verification option to verify indexes selectively -- if you don't have too many blocks to check on average, the overhead is tolerable. This is the kind of thing that naturally belongs in the higher level interface that I sketched already. > We also had a *lot* of bugs that we'd have found a lot earlier, possibly > even during development, if we had a way to easily perform these checks. I can think of a case where it was quite unclear what the invariants for the heap even were, at least temporarily. And this was in the context of fixing a bug that was really quite nasty. Formally defining the invariants in one place, and taking a position on exactly what correct looks like seems like a very valuable exercise. Even without the tool catching a single bug. > I have a hard time believing this is going to be really > reliable. E.g. the alignment requirements will vary between platforms, > leading to different layouts. In particular, MAXALIGN differs between > platforms. Over on another thread, I suggested that Mark might want to have a corruption test framework that exposes some of the bufpage.c routines. The idea is that you can destructively manipulate a page using the logical page interface. Something that works one level below the access method, but one level above the raw page image. It probably wouldn't test everything that Mark wants to test, but it would test some things in a way that seems maintainable to me. -- Peter Geoghegan
On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote: > A few billion CLogTruncationLock acquisitions in short order will likely > have at least as big an impact as ShareUpdateExclusiveLock held for the > duration of the check. That's not really a relevant concern or > txid_status(). Per-tuple lock acquisitions aren't great. Yeah, that's true. Doing it for every tuple is going to be too much, I think. I was hoping we could avoid that. > I think it might be doable to not need either. E.g. we could set the > checking backend's xmin to relfrozenxid, and set somethign like > PROC_IN_VACUUM. That should, I think, prevent clog from being truncated > in a problematic way (clog truncations look at PROC_IN_VACUUM backends), > while not blocking vacuum. Hmm, OK, I don't know if that would be OK or not. > The similar concern for ReadNewTransactionId() can probably more easily > be addressed, by only calling ReadNewTransactionId() when encountering > an xid that's newer than the last value read. Yeah, if we can cache some things to avoid repetitive calls, that would be good. > I think it'd be good to set PROC_IN_VACUUM (or maybe a separate version > of it) while checking anyway. Reading the full relation can take quite a > while, and we shouldn't prevent hot pruning while doing so. > > There's some things we'd need to figure out to be able to use > PROC_IN_VACUUM, as that's really only safe in some > circumstances. Possibly it'd be easiest to address that if we'd make the > check a procedure... I think we sure want to set things up so that we do this check without holding a snapshot, if we can. Not sure exactly how to get there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Apr 20, 2020 at 12:40 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Ok, I'll work in that direction and repost when I have something along those lines. Great, thanks! It also occurs to me that the B-Tree checks that amcheck already has have one remaining blindspot: While the heapallindexed verification option has the ability to detect the absence of an index tuple that the dummy CREATE INDEX that we perform under the hood says should be in the index, it cannot do the opposite: It cannot detect the presence of a malformed tuple that shouldn't be there at all, unless the index tuple itself is corrupt. That could miss an inconsistent page image when a few tuples have been VACUUMed away, but still appear in the index. In order to do that, we'd have to have something a bit like the validate_index() heap scan that CREATE INDEX CONCURRENTLY uses. We'd have to get a list of heap TIDs that any index tuple might be pointing to, and then make sure that there were no TIDs in the index that were not in that list -- tuples that were pointing to nothing in the heap at all. This could use the index_bulk_delete() interface. This is the kind of verification option that I might work on for debugging purposes, but not the kind of thing I could really recommend to ordinary users outside of exceptional cases. This is the kind of thing that argues for more or less providing all of the verification functionality we have through both high level and low level interfaces. This isn't likely to be all that valuable most of the time, and users shouldn't have to figure that out for themselves the hard way. (BTW, I think that this could be implemented in an index-AM-agnostic way, I think, so perhaps you can consider adding it too, if you have time.) One last thing for now: take a look at amcheck's bt_tuple_present_callback() function. It has comments about HOT chain corruption that you may find interesting. Note that this check played a role in the "freeze the dead" corruption bug [1] -- it detected that our initial fix for that was broken. It seems like it would be a good idea to go back through the reproducers we've seen for some of the more memorable corruption bugs, and actually make sure that your tool detects them where that isn't clear. History doesn't repeat itself, but it often rhymes. [1] https://postgr.es/m/CAH2-Wznm4rCrhFAiwKPWTpEw2bXDtgROZK7jWWGucXeH3D1fmA@mail.gmail.com -- Peter Geoghegan
On Mon, Apr 20, 2020 at 1:40 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Apr 20, 2020 at 4:30 PM Andres Freund <andres@anarazel.de> wrote: > > A few billion CLogTruncationLock acquisitions in short order will likely > > have at least as big an impact as ShareUpdateExclusiveLock held for the > > duration of the check. That's not really a relevant concern or > > txid_status(). Per-tuple lock acquisitions aren't great. > > Yeah, that's true. Doing it for every tuple is going to be too much, I > think. I was hoping we could avoid that. What about the visibility map? It would be nice if pg_visibility was merged into amcheck, since it mostly provides integrity checking for the visibility map. Maybe we could just merge the functions that perform verification, and leave other functions (like pg_truncate_visibility_map()) where they are. We could keep the current interface for functions like pg_check_visible(), but also allow the same verification to occur in passing, as part of a higher level check. It wouldn't be so bad if pg_visibility was an expert-only tool. But ISTM that the verification performed by code like collect_corrupt_items() could easily take place at the same time as the new checks that Mark proposes. Possibly only some of the time. It can work in a totally additive way. (Though like Andres I don't really like the current "helper" functions used to iterate through a heap relation; they seem like they'd make this harder.) -- Peter Geoghegan
> On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-04-20 10:59:28 -0700, Mark Dilger wrote: >> I have been talking with Robert about table corruption that occurs >> from time to time. The page checksum feature seems sufficient to >> detect most random corruption problems, but it can't detect "logical" >> corruption, where the page is valid but inconsistent with the rest of >> the database cluster. This can happen due to faulty or ill-conceived >> backup and restore tools, or bad storage, or user error, or bugs in >> the server itself. (Also, not everyone enables checksums.) > > This is something we really really really need. I'm very excited to see > progress! Thanks for the review! >> From 2a1bc0bb9fa94bd929adc1a408900cb925ebcdd5 Mon Sep 17 00:00:00 2001 >> From: Mark Dilger <mark.dilger@enterprisedb.com> >> Date: Mon, 20 Apr 2020 08:05:58 -0700 >> Subject: [PATCH v2] Adding heapcheck contrib module. >> >> The heapcheck module introduces a new function for checking a heap >> relation and associated toast relation, if any, for corruption. > > Why not add it to amcheck? That seems to be the general consensus. The functionality has been moved there, renamed as "verify_heapam", as that seemsmore in line with the "verify_nbtree" name already present in that module. The docs have also been moved there, althoughnot very gracefully. It seems premature to polish the documentation given that the interface will likely changeat least one more time, to incorporate more of Peter's suggestions. There are still design differences between thetwo implementations that need to be harmonized. The verify_heapam function returns rows detailing the corruption found,which is inconsistent with how verify_heapam does things. > I wonder if a mode where heapcheck optionally would only checks > non-frozen (perhaps also non-all-visible) regions of a table would be a > good idea? Would make it a lot more viable to run this regularly on > bigger databases. Even if there's a window to not check some data > (because it's frozen before the next heapcheck run). Perhaps we should come back to that. Version 3 of this patch addresses concerns about the v2 patch without adding too manynew features. >> The attached module provides the means to scan a relation and sanity >> check it. Currently, it checks xmin and xmax values against >> relfrozenxid and relminmxid, and also validates TOAST pointers. If >> people like this, it could be expanded to perform additional checks. > > >> The postgres backend already defends against certain forms of >> corruption, by checking the page header of each page before allowing >> it into the page cache, and by checking the page checksum, if enabled. >> Experience shows that broken or ill-conceived backup and restore >> mechanisms can result in a page, or an entire file, being overwritten >> with an earlier version of itself, restored from backup. Pages thus >> overwritten will appear to have valid page headers and checksums, >> while potentially containing xmin, xmax, and toast pointers that are >> invalid. > > We also had a *lot* of bugs that we'd have found a lot earlier, possibly > even during development, if we had a way to easily perform these checks. I certainly hope this is useful for testing. >> contrib/heapcheck introduces a function, heapcheck_relation, that >> takes a regclass argument, scans the given heap relation, and returns >> rows containing information about corruption found within the table. >> The main focus of the scan is to find invalid xmin, xmax, and toast >> pointer values. It also checks for structural corruption within the >> page (such as invalid t_hoff values) that could lead to the backend >> aborting should the function blindly trust the data as it finds it. > > >> +typedef struct CorruptionInfo >> +{ >> + BlockNumber blkno; >> + OffsetNumber offnum; >> + int16 lp_off; >> + int16 lp_flags; >> + int16 lp_len; >> + int32 attnum; >> + int32 chunk; >> + char *msg; >> +} CorruptionInfo; > > Adding a short comment explaining what this is for would be good. This struct has been removed. >> +/* Internal implementation */ >> +void record_corruption(HeapCheckContext * ctx, char *msg); >> +TupleDesc heapcheck_relation_tupdesc(void); >> + >> +void beginRelBlockIteration(HeapCheckContext * ctx); >> +bool relBlockIteration_next(HeapCheckContext * ctx); >> +void endRelBlockIteration(HeapCheckContext * ctx); >> + >> +void beginPageTupleIteration(HeapCheckContext * ctx); >> +bool pageTupleIteration_next(HeapCheckContext * ctx); >> +void endPageTupleIteration(HeapCheckContext * ctx); >> + >> +void beginTupleAttributeIteration(HeapCheckContext * ctx); >> +bool tupleAttributeIteration_next(HeapCheckContext * ctx); >> +void endTupleAttributeIteration(HeapCheckContext * ctx); >> + >> +void beginToastTupleIteration(HeapCheckContext * ctx, >> + struct varatt_external *toast_pointer); >> +void endToastTupleIteration(HeapCheckContext * ctx); >> +bool toastTupleIteration_next(HeapCheckContext * ctx); >> + >> +bool TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid); >> +bool HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx); >> +void check_toast_tuple(HeapCheckContext * ctx); >> +bool check_tuple_attribute(HeapCheckContext * ctx); >> +void check_tuple(HeapCheckContext * ctx); >> + >> +List *check_relation(Oid relid); >> +void check_relation_relkind(Relation rel); > > Why aren't these static? They are now, except for the iterator style functions, which are gone. >> +/* >> + * record_corruption >> + * >> + * Record a message about corruption, including information >> + * about where in the relation the corruption was found. >> + */ >> +void >> +record_corruption(HeapCheckContext * ctx, char *msg) >> +{ > > Given that you went through the trouble of adding prototypes for all of > these, I'd start with the most important functions, not the unimportant > details. Yeah, good idea. The most important functions are now at the top. >> +/* >> + * Helper function to construct the TupleDesc needed by heapcheck_relation. >> + */ >> +TupleDesc >> +heapcheck_relation_tupdesc() > > Missing (void) (it's our style, even though you could theoretically not > have it as long as you have a prototype). That was unintentional, and is now fixed. >> +{ >> + TupleDesc tupdesc; >> + AttrNumber maxattr = 8; > > This 8 is in multiple places, I'd add a define for it. Done. >> + AttrNumber a = 0; >> + >> + tupdesc = CreateTemplateTupleDesc(maxattr); >> + TupleDescInitEntry(tupdesc, ++a, "blkno", INT8OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "offnum", INT4OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "lp_off", INT2OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "lp_flags", INT2OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "lp_len", INT2OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "attnum", INT4OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "chunk", INT4OID, -1, 0); >> + TupleDescInitEntry(tupdesc, ++a, "msg", TEXTOID, -1, 0); >> + Assert(a == maxattr); >> + >> + return BlessTupleDesc(tupdesc); >> +} > > >> +/* >> + * heapcheck_relation >> + * >> + * Scan and report corruption in heap pages or in associated toast relation. >> + */ >> +Datum >> +heapcheck_relation(PG_FUNCTION_ARGS) >> +{ >> + FuncCallContext *funcctx; >> + CheckRelCtx *ctx; >> + >> + if (SRF_IS_FIRSTCALL()) >> + { > > I think it'd be good to have a version that just returned a boolean. For > one, in many cases that's all we care about when scripting things. But > also, on a large relation, there could be a lot of errors. There is now a second parameter to the function, "stop_on_error". The function performs exactly the same checks, but returnsafter the first page that contains corruption. >> + Oid relid = PG_GETARG_OID(0); >> + MemoryContext oldcontext; >> + >> + /* >> + * Scan the entire relation, building up a list of corruption found in >> + * ctx->corruption, for returning later. The scan must be performed >> + * in a memory context that will survive until after all rows are >> + * returned. >> + */ >> + funcctx = SRF_FIRSTCALL_INIT(); >> + oldcontext = MemoryContextSwitchTo(funcctx->multi_call_memory_ctx); >> + funcctx->tuple_desc = heapcheck_relation_tupdesc(); >> + ctx = (CheckRelCtx *) palloc0(sizeof(CheckRelCtx)); >> + ctx->corruption = check_relation(relid); >> + ctx->idx = 0; /* start the iterator at the beginning */ >> + funcctx->user_fctx = (void *) ctx; >> + MemoryContextSwitchTo(oldcontext); > > Hm. This builds up all the errors in memory. Is that a good idea? I mean > for a large relation having one returned value for each tuple could be a > heck of a lot of data. > > I think it'd be better to use the spilling SRF protocol here. It's not > like you're benefitting from deferring the tuple construction to the > return currently. Done. >> +/* >> + * beginRelBlockIteration >> + * >> + * For the given heap relation being checked, as recorded in ctx, sets up >> + * variables for iterating over the heap's pages. >> + * >> + * The caller should have already opened the heap relation, ctx->rel >> + */ >> +void >> +beginRelBlockIteration(HeapCheckContext * ctx) >> +{ >> + ctx->nblocks = RelationGetNumberOfBlocks(ctx->rel); >> + ctx->blkno = InvalidBlockNumber; >> + ctx->bstrategy = GetAccessStrategy(BAS_BULKREAD); >> + ctx->buffer = InvalidBuffer; >> + ctx->page = NULL; >> +} >> + >> +/* >> + * endRelBlockIteration >> + * >> + * Releases resources that were reserved by either beginRelBlockIteration or >> + * relBlockIteration_next. >> + */ >> +void >> +endRelBlockIteration(HeapCheckContext * ctx) >> +{ >> + /* >> + * Clean up. If the caller iterated to the end, the final call to >> + * relBlockIteration_next will already have released the buffer, but if >> + * the caller is bailing out early, we have to release it ourselves. >> + */ >> + if (InvalidBuffer != ctx->buffer) >> + UnlockReleaseBuffer(ctx->buffer); >> +} > > These seem mighty granular and generically named to me. Removed. >> + * pageTupleIteration_next >> + * >> + * Advances the state tracked in ctx to the next tuple on the page. >> + * >> + * Caller should have already set up the iteration via >> + * beginPageTupleIteration, and should stop calling when this function >> + * returns false. >> + */ >> +bool >> +pageTupleIteration_next(HeapCheckContext * ctx) > > I don't think this is a naming scheme we use anywhere in postgres. I > don't think it's a good idea to add yet more of those. Removed. >> +{ >> + /* >> + * Iterate to the next interesting line pointer, if any. Unused, dead and >> + * redirect line pointers are of no interest. >> + */ >> + do >> + { >> + ctx->offnum = OffsetNumberNext(ctx->offnum); >> + if (ctx->offnum > ctx->maxoff) >> + return false; >> + ctx->itemid = PageGetItemId(ctx->page, ctx->offnum); >> + } while (!ItemIdIsUsed(ctx->itemid) || >> + ItemIdIsDead(ctx->itemid) || >> + ItemIdIsRedirected(ctx->itemid)); > > This is an odd loop. Part of the test is in the body, part of in the > loop header. Refactored. >> +/* >> + * Given a TransactionId, attempt to interpret it as a valid >> + * FullTransactionId, neither in the future nor overlong in >> + * the past. Stores the inferred FullTransactionId in *fxid. >> + * >> + * Returns whether the xid is newer than the oldest clog xid. >> + */ >> +bool >> +TransactionIdStillValid(TransactionId xid, FullTransactionId *fxid) > > I don't at all like the naming of this function. This isn't a reliable > check. As before, it obviously also shouldn't be static. Renamed and refactored. >> +{ >> + FullTransactionId fnow; >> + uint32 epoch; >> + >> + /* Initialize fxid; we'll overwrite this later if needed */ >> + *fxid = FullTransactionIdFromEpochAndXid(0, xid); > >> + /* Special xids can quickly be turned into invalid fxids */ >> + if (!TransactionIdIsValid(xid)) >> + return false; >> + if (!TransactionIdIsNormal(xid)) >> + return true; >> + >> + /* >> + * Charitably infer the full transaction id as being within one epoch ago >> + */ >> + fnow = ReadNextFullTransactionId(); >> + epoch = EpochFromFullTransactionId(fnow); >> + *fxid = FullTransactionIdFromEpochAndXid(epoch, xid); > > So now you're overwriting the fxid value from above unconditionally? > > >> + if (!FullTransactionIdPrecedes(*fxid, fnow)) >> + *fxid = FullTransactionIdFromEpochAndXid(epoch - 1, xid); > > > I think it'd be better to do the conversion the following way: > > *fxid = FullTransactionIdFromU64(U64FromFullTransactionId(fnow) > + (int32) (XidFromFullTransactionId(fnow) - xid)); This has been refactored to the point that these review comments cannot be directly replied to. >> + if (!FullTransactionIdPrecedes(*fxid, fnow)) >> + return false; >> + /* The oldestClogXid is protected by CLogTruncationLock */ >> + Assert(LWLockHeldByMe(CLogTruncationLock)); >> + if (TransactionIdPrecedes(xid, ShmemVariableCache->oldestClogXid)) >> + return false; >> + return true; >> +} > > Why is this testing oldestClogXid instead of oldestXid? References to clog have been refactored out of this module. >> +/* >> + * HeapTupleIsVisible >> + * >> + * Determine whether tuples are visible for heapcheck. Similar to >> + * HeapTupleSatisfiesVacuum, but with critical differences. >> + * >> + * 1) Does not touch hint bits. It seems imprudent to write hint bits >> + * to a table during a corruption check. >> + * 2) Gracefully handles xids that are too old by calling >> + * TransactionIdStillValid before TransactionLogFetch, thus avoiding >> + * a backend abort. > > I think it'd be better to protect against this by avoiding checks for > xids that are older than relfrozenxid. And ones that are newer than > ReadNextTransactionId(). But all of those cases should be errors > anyway, so it doesn't seem like that should be handled within the > visibility routine. The new implementation caches a range of expected xids. With the relation locked against concurrent vacuum runs, it cantrust that the old end of the range won't move during the course of the scan. The newest end may move, but it only hasto check for that when it encounters a newer than expected xid, and it updates the cache with the new maximum. > >> + * 3) Only makes a boolean determination of whether heapcheck should >> + * see the tuple, rather than doing extra work for vacuum-related >> + * categorization. >> + */ >> +bool >> +HeapTupleIsVisible(HeapTupleHeader tuphdr, HeapCheckContext * ctx) >> +{ > >> + FullTransactionId fxmin, >> + fxmax; >> + uint16 infomask = tuphdr->t_infomask; >> + TransactionId xmin = HeapTupleHeaderGetXmin(tuphdr); >> + >> + if (!HeapTupleHeaderXminCommitted(tuphdr)) >> + { > > Hm. I wonder if it'd be good to crosscheck the xid committed hint bits > with clog? This is not done in v3, as it no longer checks clog. >> + else if (!TransactionIdDidCommit(HeapTupleHeaderGetRawXmin(tuphdr))) >> + { >> + LWLockRelease(CLogTruncationLock); >> + return false; /* HEAPTUPLE_DEAD */ >> + } > > Note that this actually can error out, if xmin is a subtransaction xid, > because pg_subtrans is truncated a lot more aggressively than anything > else. I think you'd need to filter against subtransactions older than > RecentXmin before here, and treat that as an error. Calls to TransactionIdDidCommit are now preceded by checks that the xid argument is not too old. >> + if (!(infomask & HEAP_XMAX_INVALID) && !HEAP_XMAX_IS_LOCKED_ONLY(infomask)) >> + { >> + if (infomask & HEAP_XMAX_IS_MULTI) >> + { >> + TransactionId xmax = HeapTupleGetUpdateXid(tuphdr); >> + >> + /* not LOCKED_ONLY, so it has to have an xmax */ >> + if (!TransactionIdIsValid(xmax)) >> + { >> + record_corruption(ctx, _("heap tuple with XMAX_IS_MULTI is " >> + "neither LOCKED_ONLY nor has a " >> + "valid xmax")); >> + return false; >> + } > > I think it's bad to have code like this in a routine that's named like a > generic visibility check routine. Renamed. >> + if (TransactionIdIsInProgress(xmax)) >> + return false; /* HEAPTUPLE_DELETE_IN_PROGRESS */ >> + >> + LWLockAcquire(CLogTruncationLock, LW_SHARED); >> + if (!TransactionIdStillValid(xmax, &fxmax)) >> + { >> + LWLockRelease(CLogTruncationLock); >> + record_corruption(ctx, psprintf("tuple xmax = %u (interpreted " >> + "as " UINT64_FORMAT >> + ") not or no longer valid", >> + xmax, fxmax.value)); >> + return false; >> + } >> + else if (TransactionIdDidCommit(xmax)) >> + { >> + LWLockRelease(CLogTruncationLock); >> + return false; /* HEAPTUPLE_RECENTLY_DEAD or HEAPTUPLE_DEAD */ >> + } >> + LWLockRelease(CLogTruncationLock); >> + /* Ok, the tuple is live */ > > I don't think random interspersed uses of CLogTruncationLock are a good > idea. If you move to only checking visibility after tuple fits into > [relfrozenxid, nextXid), then you don't need to take any locks here, as > long as a lock against vacuum is taken (which I think this should do > anyway). Done. >> +/* >> + * check_tuple >> + * >> + * Checks the current tuple as tracked in ctx for corruption. Records any >> + * corruption found in ctx->corruption. >> + * >> + * The caller should have iterated to a tuple via pageTupleIteration_next. >> + */ >> +void >> +check_tuple(HeapCheckContext * ctx) >> +{ >> + bool fatal = false; > > Wait, aren't some checks here duplicate with ones in > HeapTupleIsVisible()? Yeah, there was some overlap. That should be better now. >> + /* Check relminmxid against mxid, if any */ >> + if (ctx->infomask & HEAP_XMAX_IS_MULTI && >> + MultiXactIdPrecedes(ctx->xmax, ctx->relminmxid)) >> + { >> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation " >> + "relminmxid = %u", >> + ctx->xmax, ctx->relminmxid)); >> + } > > It's pretty weird that the routines here access xmin/xmax/... via > HeapCheckContext, but HeapTupleIsVisible() doesn't. Fair point. HeapCheckContext no longer has fields for xmin/xmax after the refactoring. >> + /* Check xmin against relfrozenxid */ >> + if (TransactionIdIsNormal(ctx->relfrozenxid) && >> + TransactionIdIsNormal(ctx->xmin) && >> + TransactionIdPrecedes(ctx->xmin, ctx->relfrozenxid)) >> + { >> + record_corruption(ctx, psprintf("tuple xmin = %u precedes relation " >> + "relfrozenxid = %u", >> + ctx->xmin, ctx->relfrozenxid)); >> + } >> + >> + /* Check xmax against relfrozenxid */ >> + if (TransactionIdIsNormal(ctx->relfrozenxid) && >> + TransactionIdIsNormal(ctx->xmax) && >> + TransactionIdPrecedes(ctx->xmax, ctx->relfrozenxid)) >> + { >> + record_corruption(ctx, psprintf("tuple xmax = %u precedes relation " >> + "relfrozenxid = %u", >> + ctx->xmax, ctx->relfrozenxid)); >> + } > > these all should be fatal. You definitely cannot just continue > afterwards given the justification below: They are now fatal. >> + /* >> + * Iterate over the attributes looking for broken toast values. This >> + * roughly follows the logic of heap_deform_tuple, except that it doesn't >> + * bother building up isnull[] and values[] arrays, since nobody wants >> + * them, and it unrolls anything that might trip over an Assert when >> + * processing corrupt data. >> + */ >> + beginTupleAttributeIteration(ctx); >> + while (tupleAttributeIteration_next(ctx) && >> + check_tuple_attribute(ctx)) >> + ; >> + endTupleAttributeIteration(ctx); >> +} > > I really don't find these helpers helpful. Removed. >> +/* >> + * check_relation >> + * >> + * Checks the relation given by relid for corruption, returning a list of all >> + * it finds. >> + * >> + * The caller should set up the memory context as desired before calling. >> + * The returned list belongs to the caller. >> + */ >> +List * >> +check_relation(Oid relid) >> +{ >> + HeapCheckContext ctx; >> + >> + memset(&ctx, 0, sizeof(HeapCheckContext)); >> + >> + /* Open the relation */ >> + ctx.relid = relid; >> + ctx.corruption = NIL; >> + ctx.rel = relation_open(relid, AccessShareLock); > > I think you need to protect at least against concurrent schema changes > given some of your checks. But I think it'd be better to also conflict > with vacuum here. The relation is now opened with ShareUpdateExclusiveLock. > >> + check_relation_relkind(ctx.rel); > > I think you also need to ensure that the table is actually using heap > AM, not another tableam. Oh - you're doing that inside the check. But > that's confusing, because that's not 'relkind'. It is checking both relkind and relam. The function has been renamed to reflect that. >> + ctx.relDesc = RelationGetDescr(ctx.rel); >> + ctx.rel_natts = RelationGetDescr(ctx.rel)->natts; >> + ctx.relfrozenxid = ctx.rel->rd_rel->relfrozenxid; >> + ctx.relminmxid = ctx.rel->rd_rel->relminmxid; > > three naming schemes in three lines... Fixed. >> + /* check all blocks of the relation */ >> + beginRelBlockIteration(&ctx); >> + while (relBlockIteration_next(&ctx)) >> + { >> + /* Perform tuple checks */ >> + beginPageTupleIteration(&ctx); >> + while (pageTupleIteration_next(&ctx)) >> + check_tuple(&ctx); >> + endPageTupleIteration(&ctx); >> + } >> + endRelBlockIteration(&ctx); > > I again do not find this helper stuff helpful. Removed. >> + /* Close the associated toast table and indexes, if any. */ >> + if (ctx.has_toastrel) >> + { >> + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes, >> + AccessShareLock); >> + table_close(ctx.toastrel, AccessShareLock); >> + } >> + >> + /* Close the main relation */ >> + relation_close(ctx.rel, AccessShareLock); > > Why the closing here? As opposed to where...? It seems fairly standard to close the relation in the function where it was opened. Do you preferthat the relation not be closed? Or that it be closed but the lock retained? >> +# This regression test demonstrates that the heapcheck_relation() function >> +# supplied with this contrib module correctly identifies specific kinds of >> +# corruption within pages. To test this, we need a mechanism to create corrupt >> +# pages with predictable, repeatable corruption. The postgres backend cannot be >> +# expected to help us with this, as its design is not consistent with the goal >> +# of intentionally corrupting pages. >> +# >> +# Instead, we create a table to corrupt, and with careful consideration of how >> +# postgresql lays out heap pages, we seek to offsets within the page and >> +# overwrite deliberately chosen bytes with specific values calculated to >> +# corrupt the page in expected ways. We then verify that heapcheck_relation >> +# reports the corruption, and that it runs without crashing. Note that the >> +# backend cannot simply be started to run queries against the corrupt table, as >> +# the backend will crash, at least for some of the corruption types we >> +# generate. >> +# >> +# Autovacuum potentially touching the table in the background makes the exact >> +# behavior of this test harder to reason about. We turn it off to keep things >> +# simpler. We use a "belt and suspenders" approach, turning it off for the >> +# system generally in postgresql.conf, and turning it off specifically for the >> +# test table. >> +# >> +# This test depends on the table being written to the heap file exactly as we >> +# expect it to be, so we take care to arrange the columns of the table, and >> +# insert rows of the table, that give predictable sizes and locations within >> +# the table page. > > I have a hard time believing this is going to be really > reliable. E.g. the alignment requirements will vary between platforms, > leading to different layouts. In particular, MAXALIGN differs between > platforms. > > Also, it's supported to compile postgres with a different pagesize. It's simple enough to extend the tap test a little to check for those things. In v3, the tap test skips tests if the pagesize is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment issues,gremlins, or whatever.). There are other approaches, though. The HeapFile/HeapPage/HeapTuple perl modules recentlysubmitted on another thread *could* be used here, but only if those modules are likely to be committed. This test*could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples will be onthe page, but only if folks don't mind the test having that extra complexity in it. (There is a school of thought thatregression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this? Here is the work thus far: — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
>> I wonder if a mode where heapcheck optionally would only checks >> non-frozen (perhaps also non-all-visible) regions of a table would be a >> good idea? Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible. >> Would make it a lot more viable to run this regularly on >> bigger databases. Even if there's a window to not check some data >> (because it's frozen before the next heapcheck run). Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions, a bgworkerbased mode that periodically checks your database? The work along those lines is not included in v4, but if it werepart of v5, would you have specific design preferences? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Do you think it would make sense to have the amcheck contrib module have, in addition to the SQL queriable functions, abgworker based mode that periodically checks your database? The work along those lines is not included in v4, but if itwere part of v5, would you have specific design preferences? -1 on that idea from me. That sounds like it's basically building "cron" into PostgreSQL, but in a way that can only be used by amcheck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > It's simple enough to extend the tap test a little to check for those things. In v3, the tap test skips tests if the pagesize is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignment issues,gremlins, or whatever.). Skipping the test if the tuple isn't in the expected location sounds really bad. That will just lead to the tests passing without actually doing anything. If the tuple isn't in the expected location, the tests should fail. > There are other approaches, though. The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another thread*could* be used here, but only if those modules are likely to be committed. Yeah, I don't know if we want that stuff or not. > This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples willbe on the page, but only if folks don't mind the test having that extra complexity in it. (There is a school of thoughtthat regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this? How much extra complexity are we talking about? It feels to me like for a heap page, the only things that are going to affect the position of the tuples on the page -- supposing we know the tuple size -- are the page size and, I think, MAXALIGN, and that doesn't sound too bad. Another possibility is to use pageinspect's heap_page_items() to determine the position within the page (lp_off), which seems like it might simplify things considerably. Then, we're entirely relying on the backend to tell us where the tuples are, and we only need to worry about the offsets relative to the start of the tuple. I kind of like that approach, because it doesn't involve having Perl code that knows how heap pages are laid out; we rely entirely on the C code for that. I'm not sure if it'd be a problem to have a TAP test for one contrib module that uses another contrib module, but maybe there's some way to figure that problem out. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Apr 29, 2020 at 12:30 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Version 4 of this patch now includes boolean options skip_all_frozen and skip_all_visible. I'm not sure sure, but maybe there should just be one argument with three possible values, because skip_all_frozen = true and skip_all_visible = false seems nonsensical. On the other hand, if we used a text argument with three possible values, I'm not sure what we'd call the argument or what strings we'd use as the values. Also, what do people -- either those who have already responded, or others -- think about the idea of putting a command-line tool around this? I know that there were some rumblings about this in respect to pg_verifybackup, but I think a pg_amcheck binary would be well-received. It could do some interesting things, too. For instance, it could query pg_class for a list of relations that amcheck would know how to check, and then issue a separate query for each relation, which would avoid holding a snapshot or heavyweight locks across the whole operation. It could do parallelism across relations by opening multiple connections, or even within a single relation if -- as I think would be a good idea -- we extended heapcheck to take a range of block numbers after the style of pg_prewarm. Apart from allowing for client-driven parallelism, accepting block number ranges would have the advantage -- IMHO pretty significant -- of making it far easier to use this on a relation where some blocks are entirely unreadable. You could specify ranges to check out the remaining blocks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Apr 29, 2020, at 11:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Wed, Apr 22, 2020 at 10:43 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> It's simple enough to extend the tap test a little to check for those things. In v3, the tap test skips tests if thepage size is not 8k, and also if the tuples do not fall on the page where expected (which would happen due to alignmentissues, gremlins, or whatever.). > > Skipping the test if the tuple isn't in the expected location sounds > really bad. That will just lead to the tests passing without actually > doing anything. If the tuple isn't in the expected location, the tests > should fail. > >> There are other approaches, though. The HeapFile/HeapPage/HeapTuple perl modules recently submitted on another thread*could* be used here, but only if those modules are likely to be committed. > > Yeah, I don't know if we want that stuff or not. > >> This test *could* be extended to autodetect the page size and alignment issues and calculate at runtime where tuples willbe on the page, but only if folks don't mind the test having that extra complexity in it. (There is a school of thoughtthat regression tests should avoid excess complexity.). Do you have a recommendation about which way to go with this? > > How much extra complexity are we talking about? The page size is easy to query, and the test already does so, skipping if the answer isn't 8k. The test could recalculateoffsets based on the pagesize rather than skipping the test easily enough, but the MAXALIGN stuff is a littleharder. I don't know (perhaps someone would share?) how to easily query that from within a perl test. So the testcould guess all possible alignments that occur in the real world, read from the page at the offset that alignment wouldcreate, and check if the expected datum is there. The test would have to be careful to avoid false positives, by placingdata before and after the datum being checked with bit patterns that cannot be misinterpreted as a match. That levelof complexity seems unappealing, at least to me. It's not hard to write, but maintaining stuff like that is an unwelcomeburden. > It feels to me like > for a heap page, the only things that are going to affect the position > of the tuples on the page -- supposing we know the tuple size -- are > the page size and, I think, MAXALIGN, and that doesn't sound too bad. > Another possibility is to use pageinspect's heap_page_items() to > determine the position within the page (lp_off), which seems like it > might simplify things considerably. Then, we're entirely relying on > the backend to tell us where the tuples are, and we only need to worry > about the offsets relative to the start of the tuple. > > I kind of like that approach, because it doesn't involve having Perl > code that knows how heap pages are laid out; we rely entirely on the C > code for that. I'm not sure if it'd be a problem to have a TAP test > for one contrib module that uses another contrib module, but maybe > there's some way to figure that problem out. Yeah, I'll give this a try. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Here is v5 of the patch. Major changes in this version include: 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internallyit functions by querying the database for a list of tables which are appropriate given the command line switches,and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the pg_amcheckcommand line tool. 3) The regression test which generates corruption within a table uses the pageinspect module to determine the location ofeach tuple on disk for corrupting. This was suggested upthread. Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently crashesthe backend on certain corruptions. When I corrupt relation files for tables and indexes in the backend and thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code. I thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so. Here are someexample problems I'm seeing. Note the stack trace when calling from the command line tool includes the new verify_btreeamfunction, but you can get the same crashes using the old interface via psql: From psql, first error: test=# select bt_index_parent_check('corrupted_idx', true, true); TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663) 0 postgres 0x0000000106872977 ExceptionalCondition + 103 1 postgres 0x00000001063a33e2 _bt_compare + 1090 2 amcheck.so 0x0000000106d62921 bt_target_page_check + 6033 3 amcheck.so 0x0000000106d5fd2f bt_index_check_internal + 2847 4 amcheck.so 0x0000000106d60433 bt_index_parent_check + 67 5 postgres 0x00000001064d6762 ExecInterpExpr + 1634 6 postgres 0x000000010650d071 ExecResult + 321 7 postgres 0x00000001064ddc3d standard_ExecutorRun + 301 8 postgres 0x00000001066600c5 PortalRunSelect + 389 9 postgres 0x000000010665fc7f PortalRun + 527 10 postgres 0x000000010665ed59 exec_simple_query + 1641 11 postgres 0x000000010665c99d PostgresMain + 3661 12 postgres 0x00000001065d6a8a BackendRun + 410 13 postgres 0x00000001065d61c4 ServerLoop + 3044 14 postgres 0x00000001065d2fe9 PostmasterMain + 3769 15 postgres 0x000000010652e3b0 help + 0 16 libdyld.dylib 0x00007fff6725fcc9 start + 1 server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG: server process (PID 41309)was terminated by signal 6: Abort trap: 6 From commandline, second error: pgtest % pg_amcheck -i test (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=) tuple xmin = 3289393 is in the future (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) tuple xmax = 0 precedes relation relminmxid = 1 (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) tuple xmin = 12593 is in the future (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) <snip> (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=) tuple xmin = 306 precedes relation relfrozenxid = 487 (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) tuple xmax = 0 precedes relation relminmxid = 1 (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) tuple xmin = 305 precedes relation relfrozenxid = 487 (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) t_hoff > lp_len (54 > 34) (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) t_hoff not max-aligned (54) TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319) 0 postgres 0x0000000105b22977 ExceptionalCondition + 103 1 postgres 0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158 2 postgres 0x0000000105634aa1 heapam_index_build_range_scan + 1089 3 amcheck.so 0x00000001060100f3 bt_index_check_internal + 3811 4 amcheck.so 0x000000010601057c verify_btreeam + 316 5 postgres 0x0000000105796266 ExecMakeTableFunctionResult + 422 6 postgres 0x00000001057a8c35 FunctionNext + 101 7 postgres 0x00000001057bbf3e ExecNestLoop + 478 8 postgres 0x000000010578dc3d standard_ExecutorRun + 301 9 postgres 0x00000001059100c5 PortalRunSelect + 389 10 postgres 0x000000010590fc7f PortalRun + 527 11 postgres 0x000000010590ed59 exec_simple_query + 1641 12 postgres 0x000000010590c99d PostgresMain + 3661 13 postgres 0x0000000105886a8a BackendRun + 410 14 postgres 0x00000001058861c4 ServerLoop + 3044 15 postgres 0x0000000105882fe9 PostmasterMain + 3769 16 postgres 0x00000001057de3b0 help + 0 17 libdyld.dylib 0x00007fff6725fcc9 start + 1 pg_amcheck: error: query failed: server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, May 11, 2020 at 10:21 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. Somebody suggested that I make amcheck work in this way during its initial development. I rejected that idea at the time, though. It seems hard to make it work because the B-Tree index scan is a logical order index scan. It's quite possible that a corrupt index will have circular sibling links, and things like that. Making everything an error removes that concern. There are clearly some failures that we could just soldier on from, but the distinction gets rather blurred. I understand why you want to do it this way. It makes sense that the heap stuff would report all inconsistencies together, at the end. I don't think that that's really workable (or even desirable) in the case of B-Tree indexes, though. When an index is corrupt, the solution is always to do root cause analysis, to make sure that the issue does not recur, and then to REINDEX. There isn't really a question about doing data recovery of the index structure. Would it be possible to log the first B-Tree inconsistency, and then move on to the next high-level phase of verification? You don't have to throw an error, but it seems like a good idea for amcheck to still give up on further verification of the index. The assertion failure that you reported happens because of a generic assertion made from _bt_compare(). It doesn't have anything to do with amcheck (you'll see the same thing from regular index scans), really. I think that removing that assertion would be the opposite of hardening. Even if you removed it, the backend will still crash once you come up with a slightly more evil index tuple. Maybe *that* could be mostly avoided with widespread hardening; we could in principle perform cross-checks of varlena headers against the tuple or page layout at any point reachable from _bt_compare(). That seems like something that would have unacceptable overhead, because the cost would be imposed on everything. And even then you've only ameliorated the problem. Code like amcheck's PageGetItemIdCareful() goes further than the equivalent backend macro (PageGetItemId()) to avoid assertion failures and crashes with corrupt data. I doubt that it is practical to take it much further than that, though. It's subject to diminishing returns. In general, _bt_compare() calls user-defined code that is usually written in C. This C code could in principle feel entitled to do any number of scary things when you corrupt the input data. The amcheck module's dependency on user-defined operator code is totally unavoidable -- it is the single source of truth for the nbtree checks. It boils down to this: I think that regression tests that run on the buildfarm and actually corrupt data are not practical, at least in the case of the index checks -- though probably in all cases. Look at the pageinspect "btree.out" test output file -- it's very limited, because we have to work around a bunch of implementation details. It's no accident that the bt_page_items() test shows a palindrome value in the data column (the value is "01 00 00 00 00 00 00 01"). That's an endianness workaround. -- Peter Geoghegan
> On May 12, 2020, at 5:34 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, May 11, 2020 at 10:21 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. Thank you yet again for reviewing. I really appreciate the feedback! > Somebody suggested that I make amcheck work in this way during its > initial development. I rejected that idea at the time, though. It > seems hard to make it work because the B-Tree index scan is a logical > order index scan. It's quite possible that a corrupt index will have > circular sibling links, and things like that. Making everything an > error removes that concern. There are clearly some failures that we > could just soldier on from, but the distinction gets rather blurred. Ok, I take your point that the code cannot soldier on after the first error is returned. I'll change that for v6 of thepatch, moving on to the next relation after hitting the first corruption in any particular index. Do you mind that Irefactored the code to return the error rather than ereporting? If it offends your sensibilities, I could rip that backout, at the expense of having to use try/catch logic in some other places. I prefer to avoid the try/catch stuff, butI'm not going to put up a huge fuss. > I understand why you want to do it this way. It makes sense that the > heap stuff would report all inconsistencies together, at the end. I > don't think that that's really workable (or even desirable) in the > case of B-Tree indexes, though. When an index is corrupt, the solution > is always to do root cause analysis, to make sure that the issue does > not recur, and then to REINDEX. There isn't really a question about > doing data recovery of the index structure. Yes, I agree that reindexing is the most sensible remedy. I certainly have no plans to implement some pg_fsck_index typetool. Even for tables, I'm not interested in creating such a tool. I just want a good tool for finding out what thenature of the corruption is, as that might make it easier to debug what went wrong. It's not just for debugging productionsystems, but also for chasing down problems in half-baked code prior to release. > Would it be possible to log the first B-Tree inconsistency, and then > move on to the next high-level phase of verification? You don't have > to throw an error, but it seems like a good idea for amcheck to still > give up on further verification of the index. Ok, good, it sounds like we're converging on the same idea. I'm happy to do so. > The assertion failure that you reported happens because of a generic > assertion made from _bt_compare(). It doesn't have anything to do with > amcheck (you'll see the same thing from regular index scans), really. Oh, I know that already. I could see that easily enough in the backtrace. But if you look at the way I implemented verify_heapam,you might notice this: /* * check_tuphdr_xids * * Determine whether tuples are visible for verification. Similar to * HeapTupleSatisfiesVacuum, but with critical differences. * * 1) Does not touch hint bits. It seems imprudent to write hint bits * to a table during a corruption check. * 2) Only makes a boolean determination of whether verification should * see the tuple, rather than doing extra work for vacuum-related * categorization. * * The caller should already have checked that xmin and xmax are not out of * bounds for the relation. */ The point is that when checking the table for corruption I avoid calling anything that might assert (or segfault, or whatever). I was talking about refactoring the btree checking code to be similarly careful. > I think that removing that assertion would be the opposite of > hardening. Even if you removed it, the backend will still crash once > you come up with a slightly more evil index tuple. Maybe *that* could > be mostly avoided with widespread hardening; we could in principle > perform cross-checks of varlena headers against the tuple or page > layout at any point reachable from _bt_compare(). That seems like > something that would have unacceptable overhead, because the cost > would be imposed on everything. And even then you've only ameliorated > the problem. I think we may have different mental models of how this all works in practice. I am (or was) envisioning that the backend,during regular table and index scans, cannot afford to check for corruption at all steps along the way, and thereforedoes not, but that a corruption checking tool has a fundamentally different purpose, and can and should choose tooperate in a way that won't blow up when checking a corrupt relation. It's the difference between a car designed to drivedown the highway at high speed vs. a military vehicle designed to drive over a minefield with a guy on the front bumperscanning for landmines, the whole while going half a mile an hour. I'm starting to infer from your comments that you see the landmine detection vehicle as also driving at high speed, detectinglandmines on occasion by seeing them first, but frequently by failing to see them and just blowing up. > Code like amcheck's PageGetItemIdCareful() goes further than the > equivalent backend macro (PageGetItemId()) to avoid assertion failures > and crashes with corrupt data. I doubt that it is practical to take it > much further than that, though. It's subject to diminishing returns. Ok. > In general, _bt_compare() calls user-defined code that is usually > written in C. This C code could in principle feel entitled to do any > number of scary things when you corrupt the input data. The amcheck > module's dependency on user-defined operator code is totally > unavoidable -- it is the single source of truth for the nbtree checks. I don't really understand this argument, since users with buggy user defined operators are not the target audience, but Ialso don't think there is any point in arguing it, since I'm already resolved to take your advice about not hardening thebtree stuff any further. > It boils down to this: I think that regression tests that run on the > buildfarm and actually corrupt data are not practical, at least in the > case of the index checks -- though probably in all cases. Look at the > pageinspect "btree.out" test output file -- it's very limited, because > we have to work around a bunch of implementation details. It's no > accident that the bt_page_items() test shows a palindrome value in the > data column (the value is "01 00 00 00 00 00 00 01"). That's an > endianness workaround. One of the delays in submitting the most recent version of the patch is that I was having trouble creating a reliable, portablebtree corrupting regression test. Ultimately, I submitted v5 without any btree corrupting regression test, as itproved pretty difficult to write one good enough for submission, and I had already put a couple more days into developingv5 than I had intended. So I can't argue too much with your point here. I did however address (some?) issues that you and others mentioned about the table corrupting regression test. Perhaps thereare remaining issues that will show up on machines with different endianness than I have thus far tested, but I don'tsee that they will be insurmountable. Are you fundamentally opposed to that test framework? If you're going to voteagainst committing the patch with that test, I'll back down and just remove it from the patch, but it doesn't seem likea bad regression test to me. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 12, 2020 at 7:07 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Thank you yet again for reviewing. I really appreciate the feedback! Happy to help. It's important work. > Ok, I take your point that the code cannot soldier on after the first error is returned. I'll change that for v6 of thepatch, moving on to the next relation after hitting the first corruption in any particular index. Do you mind that Irefactored the code to return the error rather than ereporting? try/catch seems like the way to do it. Not all amcheck errors come from amcheck -- some are things that the backend code does, that are known to appear in amcheck from time to time. I'm thinking in particular of the table_index_build_scan()/heapam_index_build_range_scan() errors, as well as the errors from _bt_checkpage(). > Yes, I agree that reindexing is the most sensible remedy. I certainly have no plans to implement some pg_fsck_index typetool. Even for tables, I'm not interested in creating such a tool. I just want a good tool for finding out what thenature of the corruption is, as that might make it easier to debug what went wrong. It's not just for debugging productionsystems, but also for chasing down problems in half-baked code prior to release. All good goals. > * check_tuphdr_xids > The point is that when checking the table for corruption I avoid calling anything that might assert (or segfault, or whatever). I don't think that you can expect to avoid assertion failures in general. I'll stick with your example. You're calling TransactionIdDidCommit() from check_tuphdr_xids(), which will interrogate the commit log and pg_subtrans. It's just not under your control. I'm sure that you could get an assertion failure somewhere in there, and even if you couldn't that could change at any time. You've quasi-duplicated some sensitive code to do that much, which seems excessive. But it's also not enough. > I'm starting to infer from your comments that you see the landmine detection vehicle as also driving at high speed, detectinglandmines on occasion by seeing them first, but frequently by failing to see them and just blowing up. That's not it. I would certainly prefer if the landmine detector didn't blow up. Not having that happen is certainly a goal I share -- that's why PageGetItemIdCareful() exists. But not at any cost, especially not when "blow up" means an assertion failure that users won't actually see in production. Avoiding assertion failures like the one you showed is likely to have a high cost (removing defensive asserts in low level access method code) for a low benefit. Any attempt to avoid having the checker itself blow up rather than throw an error message needs to be assessed pragmatically, on a case-by-case basis. > One of the delays in submitting the most recent version of the patch is that I was having trouble creating a reliable,portable btree corrupting regression test. To be clear, I think that corrupting data is very helpful with ad-hoc testing during development. > I did however address (some?) issues that you and others mentioned about the table corrupting regression test. Perhapsthere are remaining issues that will show up on machines with different endianness than I have thus far tested, butI don't see that they will be insurmountable. Are you fundamentally opposed to that test framework? I haven't thought about it enough just yet, but I am certainly suspicious of it. -- Peter Geoghegan
On Tue, May 12, 2020 at 11:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > try/catch seems like the way to do it. Not all amcheck errors come > from amcheck -- some are things that the backend code does, that are > known to appear in amcheck from time to time. I'm thinking in > particular of the > table_index_build_scan()/heapam_index_build_range_scan() errors, as > well as the errors from _bt_checkpage(). That would require the use of a subtransaction. > You've quasi-duplicated some sensitive code to do that much, which > seems excessive. But it's also not enough. I think this is a good summary of the problems in this area. On the one hand, I think it's hideous that we sanity check user input to death, but blindly trust the bytes on disk to the point of seg faulting if they're wrong. The idea that int4 + int4 has to have overflow checking because otherwise a user might be sad when they get a negative result from adding two negative numbers, while at the same time supposing that the same user will be unwilling to accept the performance hit to avoid crashing if they have a bad tuple, is quite suspect in my mind. The overflow checking is also expensive, but we do it because it's the right thing to do, and then we try to minimize the overhead. It is unclear to me why we shouldn't also take that approach with bytes that come from disk. In particular, using Assert() checks for such things instead of elog() is basically Assert(there is no such thing as a corrupted database). On the other hand, that problem is clearly way above this patch's pay grade. There's a lot of stuff all over the code base that would have to be changed to fix it. It can't be done as an incidental thing as part of this patch or any other. It's a massive effort unto itself. We need to somehow draw a clean line between what this patch does and what it does not do, such that the scope of this patch remains something achievable. Otherwise, we'll end up with nothing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 13, 2020 at 12:22 PM Robert Haas <robertmhaas@gmail.com> wrote: > I think this is a good summary of the problems in this area. On the > one hand, I think it's hideous that we sanity check user input to > death, but blindly trust the bytes on disk to the point of seg > faulting if they're wrong. The idea that int4 + int4 has to have > overflow checking because otherwise a user might be sad when they get > a negative result from adding two negative numbers, while at the same > time supposing that the same user will be unwilling to accept the > performance hit to avoid crashing if they have a bad tuple, is quite > suspect in my mind. The overflow checking is also expensive, but we do > it because it's the right thing to do, and then we try to minimize the > overhead. It is unclear to me why we shouldn't also take that approach > with bytes that come from disk. In particular, using Assert() checks > for such things instead of elog() is basically Assert(there is no such > thing as a corrupted database). I think that it depends. It's nice to be able to add an Assert() without really having to worry about the overhead at all. I sometimes call relatively expensive functions in assertions. For example, there is an assert that calls _bt_compare() within _bt_check_unique() that I added at one point -- it caught a real bug a few weeks later. You could always be doing more. In general we don't exactly trust the bytes blindly. I've found that corrupting tuples in a creative way with pg_hexedit doesn't usually result in a segfault. Sometimes we'll do things like display NULL values when heap line pointers are corrupt, which isn't as good as an error but is still okay. We ought to protect against Murphy, not Machiavelli. ISTM that access method code naturally evolves towards avoiding the most disruptive errors in the event of real world corruption, in particular avoiding segfaulting. It's very hard to prove that, though. Do you recall seeing corruption resulting in segfaults in production? I personally don't recall seeing that. If it happened, the segfaults themselves probably wouldn't be the main concern. > On the other hand, that problem is clearly way above this patch's pay > grade. There's a lot of stuff all over the code base that would have > to be changed to fix it. It can't be done as an incidental thing as > part of this patch or any other. It's a massive effort unto itself. We > need to somehow draw a clean line between what this patch does and > what it does not do, such that the scope of this patch remains > something achievable. Otherwise, we'll end up with nothing. I can easily come up with an adversarial input that will segfault a backend, even amcheck, but it'll be somewhat contrived. It's hard to fool amcheck currently because it doesn't exactly trust line pointers. But I'm sure I could get the backend to segfault amcheck if I tried. I'd probably try to play around with varlena headers. It would require a certain amount of craftiness. It's not exactly clear where you draw the line here. And I don't think that the line will be very clearly defined, in the end. It'll be something that is subject to change over time, as new information comes to light. I think that it's necessary to accept a certain amount of ambiguity here. -- Peter Geoghegan
On 2020-May-12, Peter Geoghegan wrote: > > The point is that when checking the table for corruption I avoid > > calling anything that might assert (or segfault, or whatever). > > I don't think that you can expect to avoid assertion failures in > general. Hmm. I think we should (try to?) write code that avoids all crashes with production builds, but not extend that to assertion failures. Sticking again with the provided example, > I'll stick with your example. You're calling > TransactionIdDidCommit() from check_tuphdr_xids(), which will > interrogate the commit log and pg_subtrans. It's just not under your > control. in a production build this would just fail with an error that the pg_xact file cannot be found, which is fine -- if this happens in a production system, you're not disturbing any other sessions. Or maybe the file is there and the byte can be read, in which case you would get the correct response; but that's fine too. I don't know to what extent this is possible. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > Hmm. I think we should (try to?) write code that avoids all crashes > with production builds, but not extend that to assertion failures. Assertions are only a problem at all because Mark would like to write tests that involve a selection of truly corrupt data. That's a new requirement, and one that I have my doubts about. > > I'll stick with your example. You're calling > > TransactionIdDidCommit() from check_tuphdr_xids(), which will > > interrogate the commit log and pg_subtrans. It's just not under your > > control. > > in a production build this would just fail with an error that the > pg_xact file cannot be found, which is fine -- if this happens in a > production system, you're not disturbing any other sessions. Or maybe > the file is there and the byte can be read, in which case you would get > the correct response; but that's fine too. I think that this is fine, too, since I don't consider assertion failures with corrupt data all that important. I'd make some effort to avoid it, but not too much, and not at the expense of a useful general purpose assertion that could catch bugs in many different contexts. I would be willing to make a larger effort to avoid crashing a backend, since that affects production. I might go to some effort to not crash with downright adversarial inputs, for example. But it seems inappropriate to take extreme measures just to avoid a crash with extremely contrived inputs that will probably never occur. My sense is that this is subject to sharply diminishing returns. Completely nailing down hard crashes from corrupt data seems like the wrong priority, at the very least. Pursuing that objective over other objectives sounds like zero-risk bias. -- Peter Geoghegan
On 2020-May-13, Peter Geoghegan wrote: > On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > Hmm. I think we should (try to?) write code that avoids all crashes > > with production builds, but not extend that to assertion failures. > > Assertions are only a problem at all because Mark would like to write > tests that involve a selection of truly corrupt data. That's a new > requirement, and one that I have my doubts about. I agree that this (a test tool that exercises our code against arbitrarily corrupted data pages) is not going to work as a test that all buildfarm members run -- it seems something for specialized buildfarm members to run, or even something that's run outside of the buildfarm, like sqlsmith. Obviously such a tool would not be able to run against an assertion-enabled build, and we shouldn't even try. > I would be willing to make a larger effort to avoid crashing a > backend, since that affects production. I might go to some effort to > not crash with downright adversarial inputs, for example. But it seems > inappropriate to take extreme measures just to avoid a crash with > extremely contrived inputs that will probably never occur. My sense is > that this is subject to sharply diminishing returns. Completely > nailing down hard crashes from corrupt data seems like the wrong > priority, at the very least. Pursuing that objective over other > objectives sounds like zero-risk bias. I think my initial approach for this would be to use a fuzzing tool that generates data blocks semi-randomly, then uses them as Postgres data pages somehow, and see what happens -- examine any resulting crashes and make individual judgement calls about the fix(es) necessary to prevent each of them. I expect that many such pages would be rejected as corrupt by page header checks. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 13, 2020 at 4:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I think my initial approach for this would be to use a fuzzing tool that > generates data blocks semi-randomly, then uses them as Postgres data > pages somehow, and see what happens -- examine any resulting crashes and > make individual judgement calls about the fix(es) necessary to prevent > each of them. I expect that many such pages would be rejected as > corrupt by page header checks. As I mentioned in my response to Robert earlier, that's more or less been my experience with adversarial corruption generated using pg_hexedit. Within nbtree, as well as heapam. I put a lot of work into that tool, and have used it to simulate all kinds of weird scenarios. I've done things like corrupt individual tuple header fields, swap line pointers, create circular sibling links in indexes, corrupt varlena headers, and corrupt line pointer flags/status bits. Postgres itself rarely segfaults, and amcheck will only segfault with a truly contrived input. -- Peter Geoghegan
> On May 13, 2020, at 3:29 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 13, 2020 at 3:10 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> Hmm. I think we should (try to?) write code that avoids all crashes >> with production builds, but not extend that to assertion failures. > > Assertions are only a problem at all because Mark would like to write > tests that involve a selection of truly corrupt data. That's a new > requirement, and one that I have my doubts about. > >>> I'll stick with your example. You're calling >>> TransactionIdDidCommit() from check_tuphdr_xids(), which will >>> interrogate the commit log and pg_subtrans. It's just not under your >>> control. >> >> in a production build this would just fail with an error that the >> pg_xact file cannot be found, which is fine -- if this happens in a >> production system, you're not disturbing any other sessions. Or maybe >> the file is there and the byte can be read, in which case you would get >> the correct response; but that's fine too. > > I think that this is fine, too, since I don't consider assertion > failures with corrupt data all that important. I'd make some effort to > avoid it, but not too much, and not at the expense of a useful general > purpose assertion that could catch bugs in many different contexts. I am not removing any assertions. I do not propose to remove any assertions. When I talk about "hardening against assertions",that is not in any way a proposal to remove assertions from the code. What I'm talking about is writing theamcheck contrib module code in such a way that it only calls a function that could assert on bad data after checking thatthe data is not bad. I don't know that hardening against assertions in this manner is worth doing, but this is none the less what I'm talkingabout. You have made decent arguments that it probably isn't worth doing for the btree checking code. And in anyevent, it is probably something that could be addressed in a future patch after getting this patch committed. There is a separate but related question in the offing about whether the backend code, independently of any amcheck contribstuff, should be more paranoid in how it processes tuples to check for corruption. The heap deform tuple code inquestion is on a pretty hot code path, and I don't know that folks would accept the performance hit of more checks beingdone in that part of the system, but that's pretty far from relevant to this patch. That should be hashed out, or not,at some other time on some other thread. > I would be willing to make a larger effort to avoid crashing a > backend, since that affects production. I might go to some effort to > not crash with downright adversarial inputs, for example. But it seems > inappropriate to take extreme measures just to avoid a crash with > extremely contrived inputs that will probably never occur. I think this is a misrepresentation of the tests that I've been running. There are two kinds of tests that I have done: First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived. That was included in the regressiontest suite because it needed to be something other developers could read, verify, "yeah, I can see why that wouldbe corruption, and would give an error message of the sort the test expects", and then could be run to verify that indeedthat expected error message was generated. The second kind of corruption test I have been running is nothing more than writing random nonsense into randomly chosenlocations within heap files and then running verify_heapam against those heap relations. It's much more Murphy thanMachiavelli when it's just generated by calling random(). When I initially did this kind of testing, the heapam checkingcode had lots of problems. Now it doesn't. There's very little contrived about that which I can see. It's the kindof corruption you'd expect from any number of faulty storage systems. The one "contrived" aspect of my testing in thisregard is that the script I use to write random nonsense to random locations in heap files is smart enough not to writerandom junk to the page headers. This is because if I corrupt the page headers, the backend never even gets as faras running the verify_heapam functions, as the page cache rejects loading the page. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 13, 2020 at 5:18 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I am not removing any assertions. I do not propose to remove any assertions. When I talk about "hardening against assertions",that is not in any way a proposal to remove assertions from the code. I'm sorry if I seemed to suggest that you wanted to remove assertions, rather than test more things earlier. I recognize that that could be a useful thing to do, both in general, and maybe even in the specific example you gave -- on general robustness grounds. At the same time, it's something that can only be taken so far. It's probably not going to make it practical to corrupt data in a regression test or tap test. > There is a separate but related question in the offing about whether the backend code, independently of any amcheck contribstuff, should be more paranoid in how it processes tuples to check for corruption. I bet that there is something that we could do to be a bit more defensive. Of course, we do a certain amount of that on general robustness grounds already. A systematic review of that could be quite useful. But as you point out, it's not really in scope here. > > I would be willing to make a larger effort to avoid crashing a > > backend, since that affects production. I might go to some effort to > > not crash with downright adversarial inputs, for example. But it seems > > inappropriate to take extreme measures just to avoid a crash with > > extremely contrived inputs that will probably never occur. > > I think this is a misrepresentation of the tests that I've been running. I didn't actually mean it that way, but I can see how my words could reasonably be interpreted that way. I apologize. > There are two kinds of tests that I have done: > > First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived. That was included in the regressiontest suite because it needed to be something other developers could read, verify, "yeah, I can see why that wouldbe corruption, and would give an error message of the sort the test expects", and then could be run to verify that indeedthat expected error message was generated. I still don't think that this is necessary. It could work for one type of corruption, that happens to not have any of the problems, but just testing that one type of corruption seems rather arbitrary to me. > The second kind of corruption test I have been running is nothing more than writing random nonsense into randomly chosenlocations within heap files and then running verify_heapam against those heap relations. It's much more Murphy thanMachiavelli when it's just generated by calling random(). That sounds like a good initial test case, to guide your intuitions about how to make the feature robust. -- Peter Geoghegan
> On May 13, 2020, at 5:36 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Wed, May 13, 2020 at 5:18 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> I am not removing any assertions. I do not propose to remove any assertions. When I talk about "hardening against assertions",that is not in any way a proposal to remove assertions from the code. > > I'm sorry if I seemed to suggest that you wanted to remove assertions Not a problem at all. As always, I appreciate your involvement in this code and design review. >> I think this is a misrepresentation of the tests that I've been running. > > I didn't actually mean it that way, but I can see how my words could > reasonably be interpreted that way. I apologize. Again, no worries. >> There are two kinds of tests that I have done: >> >> First, there is the regression tests, t/004_verify_heapam.pl, which is obviously contrived. That was included in theregression test suite because it needed to be something other developers could read, verify, "yeah, I can see why thatwould be corruption, and would give an error message of the sort the test expects", and then could be run to verify thatindeed that expected error message was generated. > > I still don't think that this is necessary. It could work for one type > of corruption, that happens to not have any of the problems, but just > testing that one type of corruption seems rather arbitrary to me. As discussed with Robert off list, this probably doesn't matter. The patch can be committed with or without this particularTAP test. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 13, 2020 at 5:33 PM Peter Geoghegan <pg@bowt.ie> wrote: > Do you recall seeing corruption resulting in segfaults in production? I have seen that, I believe. I think it's more common to fail with errors about not being able to palloc>1GB, not being able to look up an xid or mxid, etc. but I am pretty sure I've seen multiple cases involving seg faults, too. Unfortunately for my credibility, I can't remember the details right now. > I personally don't recall seeing that. If it happened, the segfaults > themselves probably wouldn't be the main concern. I don't really agree. Hypothetically speaking, suppose you corrupt your only copy of a critical table in such a way that every time you select from it, the system seg faults. A user in this situation might ask questions like: 1. How did my table get corrupted? 2. Why do I only have one copy of it? 3. How do I retrieve the non-corrupted portion of my data from that table and get back up and running? In the grand scheme of things, #1 and #2 are the most important questions, but when something like this actually happens, #3 tends to be the most urgent question, and it's a lot harder to get the uncorrupted data out if the system keeps crashing. Also, a seg fault tends to lead customers to think that the database has a bug, rather than that the database is corrupted. Slightly off-topic here, but I think our error reporting in this area is pretty lame. I've learned over the years that when a customer reports that they get a complaint about a too-large memory allocation every time they access a table, they've probably got a corrupted varlena header. However, that's extremely non-obvious to a typical user. We should try to report errors indicative of corruption in a way that gives the user some clue that corruption has happened. Peter made a stab at improving things there by adding errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of users will never see the error code, only the message, and a lot of corruption produces still produces errors that weren't changed by that commit. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, May 13, 2020 at 7:32 PM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I agree that this (a test tool that exercises our code against > arbitrarily corrupted data pages) is not going to work as a test that > all buildfarm members run -- it seems something for specialized > buildfarm members to run, or even something that's run outside of the > buildfarm, like sqlsmith. Obviously such a tool would not be able to > run against an assertion-enabled build, and we shouldn't even try. I have a question about what you mean here by "arbitrarily." If you mean that we shouldn't have the buildfarm run the proposed heap corruption checker against heap pages full of randomly-generated garbage, I tend to agree. Such a test wouldn't be very stable and might fail in lots of low-probability ways that could require unreasonable effort to find and fix. If you mean that we shouldn't have the buildfarm run the proposed heap corruption checker against any corrupted heap pages at all, I tend to disagree. If we did that, then we'd basically be releasing a heap corruption checker with very limited test coverage. Like, we shouldn't only have negative test cases, where the absence of corruption produces no results. We should also have positive test cases, where the thing finds some problem... At least, that's what I think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 14, 2020 at 11:33 AM Robert Haas <robertmhaas@gmail.com> wrote: > I have seen that, I believe. I think it's more common to fail with > errors about not being able to palloc>1GB, not being able to look up > an xid or mxid, etc. but I am pretty sure I've seen multiple cases > involving seg faults, too. Unfortunately for my credibility, I can't > remember the details right now. I believe you, both in general, and also because what you're saying here is plausible, even if it doesn't fit my own experience. Corruption is by its very nature exceptional. At least, if that isn't true then something must be seriously wrong, so the idea that it will be different in some way each time seems like a good working assumption. Your exceptional cases are not necessarily the same as mine, especially where hardware problems are concerned. On the other hand, it's also possible for corruption that originates from very different sources to exhibit the same basic inconsistencies and symptoms. I've noticed that SLRU corruption is often a leading indicator of general storage problems. The inconsistencies between certain SLRU state and the heap happens to be far easier to notice in practice, particularly when VACUUM runs. But it's not fundamentally different to inconsistencies from pages within one single main fork of some heap relation. > > I personally don't recall seeing that. If it happened, the segfaults > > themselves probably wouldn't be the main concern. > > I don't really agree. Hypothetically speaking, suppose you corrupt > your only copy of a critical table in such a way that every time you > select from it, the system seg faults. A user in this situation might > ask questions like: I agree that that could be a problem. But that's not what I've seen happen in production systems myself. Maybe there is some low hanging fruit here. Perhaps we can make the real PageGetItemId() a little closer to PageGetItemIdCareful() without noticeable overhead, as I suggested already. Are there any real generalizations that we can make about why backends segfault with corrupt data? Maybe there is. That seems important. > Slightly off-topic here, but I think our error reporting in this area > is pretty lame. I've learned over the years that when a customer > reports that they get a complaint about a too-large memory allocation > every time they access a table, they've probably got a corrupted > varlena header. I certainlt learned the same lesson in the same way. > However, that's extremely non-obvious to a typical > user. We should try to report errors indicative of corruption in a way > that gives the user some clue that corruption has happened. Peter made > a stab at improving things there by adding > errcode(ERRCODE_DATA_CORRUPTED) in a bunch of places, but a lot of > users will never see the error code, only the message, and a lot of > corruption produces still produces errors that weren't changed by that > commit. The theory is that "can't happen" errors having an errcode that should be considered similar to or equivalent to ERRCODE_DATA_CORRUPTED. I doubt that it works out that way in practice, though. -- Peter Geoghegan
On 2020-May-14, Robert Haas wrote: > I have a question about what you mean here by "arbitrarily." > > If you mean that we shouldn't have the buildfarm run the proposed heap > corruption checker against heap pages full of randomly-generated > garbage, I tend to agree. Such a test wouldn't be very stable and > might fail in lots of low-probability ways that could require > unreasonable effort to find and fix. This is what I meant. I was thinking of blocks generated randomly. > If you mean that we shouldn't have the buildfarm run the proposed heap > corruption checker against any corrupted heap pages at all, I tend to > disagree. Yeah, IMV those would not be arbitrarily corrupted -- instead they're crafted to be corrupted in some specific way. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > On 2020-May-14, Robert Haas wrote: >> If you mean that we shouldn't have the buildfarm run the proposed heap >> corruption checker against heap pages full of randomly-generated >> garbage, I tend to agree. Such a test wouldn't be very stable and >> might fail in lots of low-probability ways that could require >> unreasonable effort to find and fix. > This is what I meant. I was thinking of blocks generated randomly. Yeah, -1 for using random data --- when it fails, how you gonna reproduce the problem? >> If you mean that we shouldn't have the buildfarm run the proposed heap >> corruption checker against any corrupted heap pages at all, I tend to >> disagree. > Yeah, IMV those would not be arbitrarily corrupted -- instead they're > crafted to be corrupted in some specific way. I think there's definitely value in corrupting data in some predictable (reproducible) way and verifying that the check code catches it and responds as expected. Sure, this will not be 100% coverage, but it'll be a lot better than 0% coverage. regards, tom lane
On 2020-05-11 19:21, Mark Dilger wrote: > 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internallyit functions by querying the database for a list of tables which are appropriate given the command line switches,and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. Why is this useful over just using the extension's functions via psql? I suppose you could make an argument for a command-line wrapper around almost every admin-focused contrib module (pageinspect, pg_prewarm, pgstattuple, ...), but that doesn't seem very sensible. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On May 14, 2020, at 1:02 PM, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > > On 2020-05-11 19:21, Mark Dilger wrote: >> 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internallyit functions by querying the database for a list of tables which are appropriate given the command line switches,and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. > > Why is this useful over just using the extension's functions via psql? The tool doesn't hold a single snapshot or transaction for the lifetime of checking the entire database. A future improvementto the tool might add parallelism. Users could do all of this in scripts, but having a single tool with the mostcommonly useful options avoids duplication of effort. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, May 11, 2020 at 10:51 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Here is v5 of the patch. Major changes in this version include: > > 1) A new module, pg_amcheck, which includes a command line client for checking a database or subset of a database. Internallyit functions by querying the database for a list of tables which are appropriate given the command line switches,and then calls amcheck's functions to validate each table and/or index. The options for selecting/excluding tablesand schemas is patterned on pg_dump, on the assumption that interface is already familiar to users. > > 2) amcheck's btree checking functions have been refactored to be able to operate in two modes; the original mode in whichall errors are reported via ereport, and a new mode for returning errors as rows from a set returning function. Thenew mode is used by a new function verify_btreeam(), analogous to verify_heapam(), both of which are used by the pg_amcheckcommand line tool. > > 3) The regression test which generates corruption within a table uses the pageinspect module to determine the locationof each tuple on disk for corrupting. This was suggested upthread. > > Testing on the command line shows that the pre-existing btree checking code could use some hardening, as it currently crashesthe backend on certain corruptions. When I corrupt relation files for tables and indexes in the backend and thenuse pg_amcheck to check all objects in the database, I keep getting assertions from the btree checking code. I thinkI need to harden this code, but wanted to post an updated patch and solicit opinions before doing so. Here are someexample problems I'm seeing. Note the stack trace when calling from the command line tool includes the new verify_btreeamfunction, but you can get the same crashes using the old interface via psql: > > From psql, first error: > > test=# select bt_index_parent_check('corrupted_idx', true, true); > TRAP: FailedAssertion("_bt_check_natts(rel, key->heapkeyspace, page, offnum)", File: "nbtsearch.c", Line: 663) > 0 postgres 0x0000000106872977 ExceptionalCondition + 103 > 1 postgres 0x00000001063a33e2 _bt_compare + 1090 > 2 amcheck.so 0x0000000106d62921 bt_target_page_check + 6033 > 3 amcheck.so 0x0000000106d5fd2f bt_index_check_internal + 2847 > 4 amcheck.so 0x0000000106d60433 bt_index_parent_check + 67 > 5 postgres 0x00000001064d6762 ExecInterpExpr + 1634 > 6 postgres 0x000000010650d071 ExecResult + 321 > 7 postgres 0x00000001064ddc3d standard_ExecutorRun + 301 > 8 postgres 0x00000001066600c5 PortalRunSelect + 389 > 9 postgres 0x000000010665fc7f PortalRun + 527 > 10 postgres 0x000000010665ed59 exec_simple_query + 1641 > 11 postgres 0x000000010665c99d PostgresMain + 3661 > 12 postgres 0x00000001065d6a8a BackendRun + 410 > 13 postgres 0x00000001065d61c4 ServerLoop + 3044 > 14 postgres 0x00000001065d2fe9 PostmasterMain + 3769 > 15 postgres 0x000000010652e3b0 help + 0 > 16 libdyld.dylib 0x00007fff6725fcc9 start + 1 > server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. > The connection to the server was lost. Attempting reset: 2020-05-11 10:11:47.394 PDT [41091] LOG: server process (PID41309) was terminated by signal 6: Abort trap: 6 > > > > From commandline, second error: > > pgtest % pg_amcheck -i test > (relname=corrupted,blkno=0,offnum=16,lp_off=7680,lp_flags=1,lp_len=31,attnum=,chunk=) > tuple xmin = 3289393 is in the future > (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) > tuple xmax = 0 precedes relation relminmxid = 1 > (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) > tuple xmin = 12593 is in the future > (relname=corrupted,blkno=0,offnum=17,lp_off=7648,lp_flags=1,lp_len=31,attnum=,chunk=) > > <snip> > > (relname=corrupted,blkno=107,offnum=20,lp_off=7392,lp_flags=1,lp_len=34,attnum=,chunk=) > tuple xmin = 306 precedes relation relfrozenxid = 487 > (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) > tuple xmax = 0 precedes relation relminmxid = 1 > (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) > tuple xmin = 305 precedes relation relfrozenxid = 487 > (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) > t_hoff > lp_len (54 > 34) > (relname=corrupted,blkno=107,offnum=22,lp_off=7312,lp_flags=1,lp_len=34,attnum=,chunk=) > t_hoff not max-aligned (54) > TRAP: FailedAssertion("TransactionIdIsValid(xmax)", File: "heapam_visibility.c", Line: 1319) > 0 postgres 0x0000000105b22977 ExceptionalCondition + 103 > 1 postgres 0x0000000105636e86 HeapTupleSatisfiesVacuum + 1158 > 2 postgres 0x0000000105634aa1 heapam_index_build_range_scan + 1089 > 3 amcheck.so 0x00000001060100f3 bt_index_check_internal + 3811 > 4 amcheck.so 0x000000010601057c verify_btreeam + 316 > 5 postgres 0x0000000105796266 ExecMakeTableFunctionResult + 422 > 6 postgres 0x00000001057a8c35 FunctionNext + 101 > 7 postgres 0x00000001057bbf3e ExecNestLoop + 478 > 8 postgres 0x000000010578dc3d standard_ExecutorRun + 301 > 9 postgres 0x00000001059100c5 PortalRunSelect + 389 > 10 postgres 0x000000010590fc7f PortalRun + 527 > 11 postgres 0x000000010590ed59 exec_simple_query + 1641 > 12 postgres 0x000000010590c99d PostgresMain + 3661 > 13 postgres 0x0000000105886a8a BackendRun + 410 > 14 postgres 0x00000001058861c4 ServerLoop + 3044 > 15 postgres 0x0000000105882fe9 PostmasterMain + 3769 > 16 postgres 0x00000001057de3b0 help + 0 > 17 libdyld.dylib 0x00007fff6725fcc9 start + 1 > pg_amcheck: error: query failed: server closed the connection unexpectedly > This probably means the server terminated abnormally > before or while processing the request. I have just browsed through the patch and the idea is quite interesting. I think we can expand it to check that whether the flags set in the infomask are sane or not w.r.t other flags and xid status. Some examples are - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED should not be set in new_infomask2. - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we actually cross verify the transaction status from the CLOG and check whether is matching the hint bit or not. While browsing through the code I could not find that we are doing this kind of check, ignore if we are already checking this. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On May 11, 2020, at 10:21 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > <v5-0001-Adding-verify_heapam-and-pg_amcheck.patch> Rebased with some whitespace fixes, but otherwise unmodified from v5. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have just browsed through the patch and the idea is quite > interesting. I think we can expand it to check that whether the flags > set in the infomask are sane or not w.r.t other flags and xid status. > Some examples are > > - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED > should not be set in new_infomask2. > - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we > actually cross verify the transaction status from the CLOG and check > whether is matching the hint bit or not. > > While browsing through the code I could not find that we are doing > this kind of check, ignore if we are already checking this. Thanks for taking a look! Having both of those bits set simultaneously appears to fall into a different category than what I wrote verify_heapam.cto detect. It doesn't violate any assertion in the backend, nor does it cause the code to crash. (At least,I don't immediately see how it does either of those things.) At first glance it appears invalid to have those bitsboth set simultaneously, but I'm hesitant to enforce that without good reason. If it is a good thing to enforce, shouldwe also change the backend code to Assert? I integrated your idea into one of the regression tests. It now sets these two bits in the header of one of the rows ina table. The verify_heapam check output (which includes all detected corruptions) does not change, which verifies yourobservation that verify_heapam is not checking for this. I've attached that as a patch to this email. Note that thispatch should be applied atop the v6 patch recently posted in another email. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have just browsed through the patch and the idea is quite > > interesting. I think we can expand it to check that whether the flags > > set in the infomask are sane or not w.r.t other flags and xid status. > > Some examples are > > > > - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED > > should not be set in new_infomask2. > > - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we > > actually cross verify the transaction status from the CLOG and check > > whether is matching the hint bit or not. > > > > While browsing through the code I could not find that we are doing > > this kind of check, ignore if we are already checking this. > > Thanks for taking a look! > > Having both of those bits set simultaneously appears to fall into a different category than what I wrote verify_heapam.cto detect. Ok It doesn't violate any assertion in the backend, nor does it cause the code to crash. (At least, I don't immediately see how it does either of those things.) At first glance it appears invalid to have those bits both set simultaneously, but I'm hesitant to enforce that without good reason. If it is a good thing to enforce, should we also change the backend code to Assert? Yeah, it may not hit assert or crash but it could lead to a wrong result. But I agree that it could be an assertion in the backend code. What about the other check, like hint bit is saying the transaction is committed but actually as per the clog the status is something else. I think in general processing it is hard to check such things in backend no? because if the hint bit is set saying that the transaction is committed then we will directly check its visibility with the snapshot. I think a corruption checker may be a good tool for catching such anomalies. > I integrated your idea into one of the regression tests. It now sets these two bits in the header of one of the rows ina table. The verify_heapam check output (which includes all detected corruptions) does not change, which verifies yourobservation that verifies _heapam is not checking for this. I've attached that as a patch to this email. Note thatthis patch should be applied atop the v6 patch recently posted in another email. Ok. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Jun 11, 2020, at 11:35 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> >> >> >>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: >>> >>> I have just browsed through the patch and the idea is quite >>> interesting. I think we can expand it to check that whether the flags >>> set in the infomask are sane or not w.r.t other flags and xid status. >>> Some examples are >>> >>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED >>> should not be set in new_infomask2. >>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we >>> actually cross verify the transaction status from the CLOG and check >>> whether is matching the hint bit or not. >>> >>> While browsing through the code I could not find that we are doing >>> this kind of check, ignore if we are already checking this. >> >> Thanks for taking a look! >> >> Having both of those bits set simultaneously appears to fall into a different category than what I wrote verify_heapam.cto detect. > > Ok > > >> It doesn't violate any assertion in the backend, nor does it cause >> the code to crash. (At least, I don't immediately see how it does >> either of those things.) At first glance it appears invalid to have >> those bits both set simultaneously, but I'm hesitant to enforce that >> without good reason. If it is a good thing to enforce, should we also >> change the backend code to Assert? > > Yeah, it may not hit assert or crash but it could lead to a wrong > result. But I agree that it could be an assertion in the backend > code. For v7, I've added an assertion for this. Per heap/README.tuplock, "We currently never set the HEAP_XMAX_COMMITTED whenthe HEAP_XMAX_IS_MULTI bit is set." I added an assertion for that, too. Both new assertions are in RelationPutHeapTuple(). I'm not sure if that is the best place to put the assertion, but I am confident that the assertionneeds to only check tuples destined for disk, as in memory tuples can and do violate the assertion. Also for v7, I've updated contrib/amcheck to report these two conditions as corruption. > What about the other check, like hint bit is saying the > transaction is committed but actually as per the clog the status is > something else. I think in general processing it is hard to check > such things in backend no? because if the hint bit is set saying that > the transaction is committed then we will directly check its > visibility with the snapshot. I think a corruption checker may be a > good tool for catching such anomalies. I already made some design changes to this patch to avoid taking the CLogTruncationLock too often. I'm happy to incorporatethis idea, but perhaps you could provide a design on how to do it without all the extra locking? If not, I cantry to get this into v8 as an optional check, so users can turn it on at their discretion. Having the check enabled bydefault is probably a non-starter. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2020-06-12 23:06, Mark Dilger wrote: > [v7-0001-Adding-verify_heapam-and-pg_amcheck.patch] > [v7-0002-Adding-checks-o...ations-of-hint-bit.patch] I came across these typos in the sgml: --exclude-scheam should be --exclude-schema <option>table</option> should be <option>--table</option> I found this connection problem (or perhaps it is as designed): $ env | grep ^PG PGPORT=6965 PGPASSFILE=/home/aardvark/.pg_aardvark PGDATABASE=testdb PGDATA=/home/aardvark/pg_stuff/pg_installations/pgsql.amcheck/data -- just to show that psql is connecting (via $PGPASSFILE and $PGPORT and $PGDATABASE): -- and showing a table t that I made earlier $ psql SET Timing is on. psql (14devel_amcheck_0612_2f48) Type "help" for help. testdb=# \dt+ t List of relations Schema | Name | Type | Owner | Persistence | Size | Description --------+------+-------+----------+-------------+--------+------------- public | t | table | aardvark | permanent | 346 MB | (1 row) testdb=# \q I think this should work: $ pg_amcheck -i -t t pg_amcheck: error: no matching tables were found It seems a bug that I have to add '-d testdb': This works OK: pg_amcheck -i -t t -d testdb Is that error as expected? thanks, Erik Rijkers
> On Jun 13, 2020, at 2:13 PM, Erik Rijkers <er@xs4all.nl> wrote: Thanks for the review! > On 2020-06-12 23:06, Mark Dilger wrote: > >> [v7-0001-Adding-verify_heapam-and-pg_amcheck.patch] >> [v7-0002-Adding-checks-o...ations-of-hint-bit.patch] > > I came across these typos in the sgml: > > --exclude-scheam should be > --exclude-schema > > <option>table</option> should be > <option>--table</option> Yeah, I agree and have made these changes for v8. > I found this connection problem (or perhaps it is as designed): > > $ env | grep ^PG > PGPORT=6965 > PGPASSFILE=/home/aardvark/.pg_aardvark > PGDATABASE=testdb > PGDATA=/home/aardvark/pg_stuff/pg_installations/pgsql.amcheck/data > > -- just to show that psql is connecting (via $PGPASSFILE and $PGPORT and $PGDATABASE): > -- and showing a table t that I made earlier > > $ psql > SET > Timing is on. > psql (14devel_amcheck_0612_2f48) > Type "help" for help. > > testdb=# \dt+ t > List of relations > Schema | Name | Type | Owner | Persistence | Size | Description > --------+------+-------+----------+-------------+--------+------------- > public | t | table | aardvark | permanent | 346 MB | > (1 row) > > testdb=# \q > > I think this should work: > > $ pg_amcheck -i -t t > pg_amcheck: error: no matching tables were found > > It seems a bug that I have to add '-d testdb': > > This works OK: > pg_amcheck -i -t t -d testdb > > Is that error as expected? It was expected, but looking more broadly at other tools, your expectations seem to be more typical. I've changed it inv8. Thanks again for having a look at this patch! Note that I've merge the two patches (v7-0001 and v7-0002) back into a single patch, since the separation introduced in v7was only for illustration of changes in v7. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Sat, Jun 13, 2020 at 2:36 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jun 11, 2020, at 11:35 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > On Fri, Jun 12, 2020 at 12:40 AM Mark Dilger > > <mark.dilger@enterprisedb.com> wrote: > >> > >> > >> > >>> On Jun 11, 2020, at 9:14 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > >>> > >>> I have just browsed through the patch and the idea is quite > >>> interesting. I think we can expand it to check that whether the flags > >>> set in the infomask are sane or not w.r.t other flags and xid status. > >>> Some examples are > >>> > >>> - If HEAP_XMAX_LOCK_ONLY is set in infomask then HEAP_KEYS_UPDATED > >>> should not be set in new_infomask2. > >>> - If HEAP_XMIN(XMAX)_COMMITTED is set in the infomask then can we > >>> actually cross verify the transaction status from the CLOG and check > >>> whether is matching the hint bit or not. > >>> > >>> While browsing through the code I could not find that we are doing > >>> this kind of check, ignore if we are already checking this. > >> > >> Thanks for taking a look! > >> > >> Having both of those bits set simultaneously appears to fall into a different category than what I wrote verify_heapam.cto detect. > > > > Ok > > > > > >> It doesn't violate any assertion in the backend, nor does it cause > >> the code to crash. (At least, I don't immediately see how it does > >> either of those things.) At first glance it appears invalid to have > >> those bits both set simultaneously, but I'm hesitant to enforce that > >> without good reason. If it is a good thing to enforce, should we also > >> change the backend code to Assert? > > > > Yeah, it may not hit assert or crash but it could lead to a wrong > > result. But I agree that it could be an assertion in the backend > > code. > > For v7, I've added an assertion for this. Per heap/README.tuplock, "We currently never set the HEAP_XMAX_COMMITTED whenthe HEAP_XMAX_IS_MULTI bit is set." I added an assertion for that, too. Both new assertions are in RelationPutHeapTuple(). I'm not sure if that is the best place to put the assertion, but I am confident that the assertionneeds to only check tuples destined for disk, as in memory tuples can and do violate the assertion. > > Also for v7, I've updated contrib/amcheck to report these two conditions as corruption. > > > What about the other check, like hint bit is saying the > > transaction is committed but actually as per the clog the status is > > something else. I think in general processing it is hard to check > > such things in backend no? because if the hint bit is set saying that > > the transaction is committed then we will directly check its > > visibility with the snapshot. I think a corruption checker may be a > > good tool for catching such anomalies. > > I already made some design changes to this patch to avoid taking the CLogTruncationLock too often. I'm happy to incorporatethis idea, but perhaps you could provide a design on how to do it without all the extra locking? If not, I cantry to get this into v8 as an optional check, so users can turn it on at their discretion. Having the check enabled bydefault is probably a non-starter. Okay, even I can't think a way to do it without an extra locking. I have looked into 0001 patch and I have a few comments. 1. + + /* Skip over unused/dead/redirected line pointers */ + if (!ItemIdIsUsed(ctx.itemid) || + ItemIdIsDead(ctx.itemid) || + ItemIdIsRedirected(ctx.itemid)) + continue; Isn't it a good idea to verify the Redirected Itemtid? Because we will still access the redirected item id to find the actual tuple from the index scan. Maybe not exactly at this level, but we can verify that the link itemid store in that is within the itemid range of the page or not. 2. + /* Check for tuple header corruption */ + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) + { + confess(ctx, + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)", + ctx->tuphdr->t_hoff, + (unsigned) SizeofHeapTupleHeader)); + fatal = true; + } I think we can also check that if there is no NULL attributes (if (!(t_infomask & HEAP_HASNULL)) then ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader. 3. + ctx->offset = 0; + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) + { + if (!check_tuple_attribute(ctx)) + break; + } + ctx->offset = -1; + ctx->attnum = -1; So we are first setting ctx->offset to 0, then inside check_tuple_attribute, we will keep updating the offset as we process the attributes and after the loop is over we set ctx->offset to -1, I did not understand that why we need to reset it to -1, do we ever check for that. We don't even initialize the ctx->offset to -1 while initializing the context for the tuple so I do not understand what is the meaning of the random value -1. 4. + if (!VARATT_IS_EXTENDED(chunk)) + { + chunksize = VARSIZE(chunk) - VARHDRSZ; + chunkdata = VARDATA(chunk); + } + else if (VARATT_IS_SHORT(chunk)) + { + /* + * could happen due to heap_form_tuple doing its thing + */ + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; + chunkdata = VARDATA_SHORT(chunk); + } + else + { + /* should never happen */ + confess(ctx, + pstrdup("toast chunk is neither short nor extended")); + return; + } I think the error message "toast chunk is neither short nor extended". Because ideally, the toast chunk should not be further toasted. So I think the check is correct, but the error message is not correct. 5. + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); + check_relation_relkind_and_relam(ctx.rel); + + /* + * Open the toast relation, if any, also protected from concurrent + * vacuums. + */ + if (ctx.rel->rd_rel->reltoastrelid) + { + int offset; + + /* Main relation has associated toast relation */ + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid, + ShareUpdateExclusiveLock); + offset = toast_open_indexes(ctx.toastrel, .... + if (TransactionIdIsNormal(ctx.relfrozenxid) && + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid)) + { + confess(&ctx, psprintf("relfrozenxid %u precedes global " + "oldest valid xid %u ", + ctx.relfrozenxid, ctx.oldestValidXid)); + PG_RETURN_NULL(); + } Don't we need to close the relation/toastrel/toastindexrel in such return which is without an abort? IIRC, we will get relcache leak WARNING on commit if we left them open in commit path. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > I have looked into 0001 patch and I have a few comments. > > 1. > + > + /* Skip over unused/dead/redirected line pointers */ > + if (!ItemIdIsUsed(ctx.itemid) || > + ItemIdIsDead(ctx.itemid) || > + ItemIdIsRedirected(ctx.itemid)) > + continue; > > Isn't it a good idea to verify the Redirected Itemtid? Because we > will still access the redirected item id to find the > actual tuple from the index scan. Maybe not exactly at this level, > but we can verify that the link itemid store in that > is within the itemid range of the page or not. Good idea. I've added checks that the redirection is valid, both in terms of being within bounds and in terms of alignment. > 2. > > + /* Check for tuple header corruption */ > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) > + { > + confess(ctx, > + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)", > + ctx->tuphdr->t_hoff, > + (unsigned) SizeofHeapTupleHeader)); > + fatal = true; > + } > > I think we can also check that if there is no NULL attributes (if > (!(t_infomask & HEAP_HASNULL)) then > ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader. You have to take alignment padding into account, but otherwise yes, and I've added a check for that. > 3. > + ctx->offset = 0; > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) > + { > + if (!check_tuple_attribute(ctx)) > + break; > + } > + ctx->offset = -1; > + ctx->attnum = -1; > > So we are first setting ctx->offset to 0, then inside > check_tuple_attribute, we will keep updating the offset as we process > the attributes and after the loop is over we set ctx->offset to -1, I > did not understand that why we need to reset it to -1, do we ever > check for that. We don't even initialize the ctx->offset to -1 while > initializing the context for the tuple so I do not understand what is > the meaning of the random value -1. Ahh, right, those are left over from a previous design of the code. Thanks for pointing them out. They are now removed. > 4. > + if (!VARATT_IS_EXTENDED(chunk)) > + { > + chunksize = VARSIZE(chunk) - VARHDRSZ; > + chunkdata = VARDATA(chunk); > + } > + else if (VARATT_IS_SHORT(chunk)) > + { > + /* > + * could happen due to heap_form_tuple doing its thing > + */ > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; > + chunkdata = VARDATA_SHORT(chunk); > + } > + else > + { > + /* should never happen */ > + confess(ctx, > + pstrdup("toast chunk is neither short nor extended")); > + return; > + } > > I think the error message "toast chunk is neither short nor extended". > Because ideally, the toast chunk should not be further toasted. > So I think the check is correct, but the error message is not correct. I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what I cameup with, "corrupt toast chunk va_header". > 5. > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > + check_relation_relkind_and_relam(ctx.rel); > + > + /* > + * Open the toast relation, if any, also protected from concurrent > + * vacuums. > + */ > + if (ctx.rel->rd_rel->reltoastrelid) > + { > + int offset; > + > + /* Main relation has associated toast relation */ > + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid, > + ShareUpdateExclusiveLock); > + offset = toast_open_indexes(ctx.toastrel, > .... > + if (TransactionIdIsNormal(ctx.relfrozenxid) && > + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid)) > + { > + confess(&ctx, psprintf("relfrozenxid %u precedes global " > + "oldest valid xid %u ", > + ctx.relfrozenxid, ctx.oldestValidXid)); > + PG_RETURN_NULL(); > + } > > Don't we need to close the relation/toastrel/toastindexrel in such > return which is without an abort? IIRC, we > will get relcache leak WARNING on commit if we left them open in commit path. Ok, I've added logic to close them. All changes inspired by your review are included in the v9-0001 patch. The differences since v8 are pulled out into v9_diffsfor easier review. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Jun 22, 2020 at 5:44 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > I have looked into 0001 patch and I have a few comments. > > > > 1. > > + > > + /* Skip over unused/dead/redirected line pointers */ > > + if (!ItemIdIsUsed(ctx.itemid) || > > + ItemIdIsDead(ctx.itemid) || > > + ItemIdIsRedirected(ctx.itemid)) > > + continue; > > > > Isn't it a good idea to verify the Redirected Itemtid? Because we > > will still access the redirected item id to find the > > actual tuple from the index scan. Maybe not exactly at this level, > > but we can verify that the link itemid store in that > > is within the itemid range of the page or not. > > Good idea. I've added checks that the redirection is valid, both in terms of being within bounds and in terms of alignment. > > > 2. > > > > + /* Check for tuple header corruption */ > > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) > > + { > > + confess(ctx, > > + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)", > > + ctx->tuphdr->t_hoff, > > + (unsigned) SizeofHeapTupleHeader)); > > + fatal = true; > > + } > > > > I think we can also check that if there is no NULL attributes (if > > (!(t_infomask & HEAP_HASNULL)) then > > ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader. > > You have to take alignment padding into account, but otherwise yes, and I've added a check for that. > > > 3. > > + ctx->offset = 0; > > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) > > + { > > + if (!check_tuple_attribute(ctx)) > > + break; > > + } > > + ctx->offset = -1; > > + ctx->attnum = -1; > > > > So we are first setting ctx->offset to 0, then inside > > check_tuple_attribute, we will keep updating the offset as we process > > the attributes and after the loop is over we set ctx->offset to -1, I > > did not understand that why we need to reset it to -1, do we ever > > check for that. We don't even initialize the ctx->offset to -1 while > > initializing the context for the tuple so I do not understand what is > > the meaning of the random value -1. > > Ahh, right, those are left over from a previous design of the code. Thanks for pointing them out. They are now removed. > > > 4. > > + if (!VARATT_IS_EXTENDED(chunk)) > > + { > > + chunksize = VARSIZE(chunk) - VARHDRSZ; > > + chunkdata = VARDATA(chunk); > > + } > > + else if (VARATT_IS_SHORT(chunk)) > > + { > > + /* > > + * could happen due to heap_form_tuple doing its thing > > + */ > > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; > > + chunkdata = VARDATA_SHORT(chunk); > > + } > > + else > > + { > > + /* should never happen */ > > + confess(ctx, > > + pstrdup("toast chunk is neither short nor extended")); > > + return; > > + } > > > > I think the error message "toast chunk is neither short nor extended". > > Because ideally, the toast chunk should not be further toasted. > > So I think the check is correct, but the error message is not correct. > > I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what I cameup with, "corrupt toast chunk va_header". > > > 5. > > > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > + check_relation_relkind_and_relam(ctx.rel); > > + > > + /* > > + * Open the toast relation, if any, also protected from concurrent > > + * vacuums. > > + */ > > + if (ctx.rel->rd_rel->reltoastrelid) > > + { > > + int offset; > > + > > + /* Main relation has associated toast relation */ > > + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid, > > + ShareUpdateExclusiveLock); > > + offset = toast_open_indexes(ctx.toastrel, > > .... > > + if (TransactionIdIsNormal(ctx.relfrozenxid) && > > + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid)) > > + { > > + confess(&ctx, psprintf("relfrozenxid %u precedes global " > > + "oldest valid xid %u ", > > + ctx.relfrozenxid, ctx.oldestValidXid)); > > + PG_RETURN_NULL(); > > + } > > > > Don't we need to close the relation/toastrel/toastindexrel in such > > return which is without an abort? IIRC, we > > will get relcache leak WARNING on commit if we left them open in commit path. > > Ok, I've added logic to close them. > > All changes inspired by your review are included in the v9-0001 patch. The differences since v8 are pulled out into v9_diffsfor easier review. I have reviewed the changes in v9_diffs and looks fine to me. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
On Sun, Jun 28, 2020 at 8:59 PM Dilip Kumar <dilipbalaut@gmail.com> wrote: > > On Mon, Jun 22, 2020 at 5:44 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > > > > > > > > On Jun 21, 2020, at 2:54 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > > > I have looked into 0001 patch and I have a few comments. > > > > > > 1. > > > + > > > + /* Skip over unused/dead/redirected line pointers */ > > > + if (!ItemIdIsUsed(ctx.itemid) || > > > + ItemIdIsDead(ctx.itemid) || > > > + ItemIdIsRedirected(ctx.itemid)) > > > + continue; > > > > > > Isn't it a good idea to verify the Redirected Itemtid? Because we > > > will still access the redirected item id to find the > > > actual tuple from the index scan. Maybe not exactly at this level, > > > but we can verify that the link itemid store in that > > > is within the itemid range of the page or not. > > > > Good idea. I've added checks that the redirection is valid, both in terms of being within bounds and in terms of alignment. > > > > > 2. > > > > > > + /* Check for tuple header corruption */ > > > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) > > > + { > > > + confess(ctx, > > > + psprintf("t_hoff < SizeofHeapTupleHeader (%u < %u)", > > > + ctx->tuphdr->t_hoff, > > > + (unsigned) SizeofHeapTupleHeader)); > > > + fatal = true; > > > + } > > > > > > I think we can also check that if there is no NULL attributes (if > > > (!(t_infomask & HEAP_HASNULL)) then > > > ctx->tuphdr->t_hoff should be equal to SizeofHeapTupleHeader. > > > > You have to take alignment padding into account, but otherwise yes, and I've added a check for that. > > > > > 3. > > > + ctx->offset = 0; > > > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) > > > + { > > > + if (!check_tuple_attribute(ctx)) > > > + break; > > > + } > > > + ctx->offset = -1; > > > + ctx->attnum = -1; > > > > > > So we are first setting ctx->offset to 0, then inside > > > check_tuple_attribute, we will keep updating the offset as we process > > > the attributes and after the loop is over we set ctx->offset to -1, I > > > did not understand that why we need to reset it to -1, do we ever > > > check for that. We don't even initialize the ctx->offset to -1 while > > > initializing the context for the tuple so I do not understand what is > > > the meaning of the random value -1. > > > > Ahh, right, those are left over from a previous design of the code. Thanks for pointing them out. They are now removed. > > > > > 4. > > > + if (!VARATT_IS_EXTENDED(chunk)) > > > + { > > > + chunksize = VARSIZE(chunk) - VARHDRSZ; > > > + chunkdata = VARDATA(chunk); > > > + } > > > + else if (VARATT_IS_SHORT(chunk)) > > > + { > > > + /* > > > + * could happen due to heap_form_tuple doing its thing > > > + */ > > > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; > > > + chunkdata = VARDATA_SHORT(chunk); > > > + } > > > + else > > > + { > > > + /* should never happen */ > > > + confess(ctx, > > > + pstrdup("toast chunk is neither short nor extended")); > > > + return; > > > + } > > > > > > I think the error message "toast chunk is neither short nor extended". > > > Because ideally, the toast chunk should not be further toasted. > > > So I think the check is correct, but the error message is not correct. > > > > I agree the error message was wrongly stated, and I've changed it, but you might suggest a better wording than what Icame up with, "corrupt toast chunk va_header". > > > > > 5. > > > > > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > > + check_relation_relkind_and_relam(ctx.rel); > > > + > > > + /* > > > + * Open the toast relation, if any, also protected from concurrent > > > + * vacuums. > > > + */ > > > + if (ctx.rel->rd_rel->reltoastrelid) > > > + { > > > + int offset; > > > + > > > + /* Main relation has associated toast relation */ > > > + ctx.toastrel = table_open(ctx.rel->rd_rel->reltoastrelid, > > > + ShareUpdateExclusiveLock); > > > + offset = toast_open_indexes(ctx.toastrel, > > > .... > > > + if (TransactionIdIsNormal(ctx.relfrozenxid) && > > > + TransactionIdPrecedes(ctx.relfrozenxid, ctx.oldestValidXid)) > > > + { > > > + confess(&ctx, psprintf("relfrozenxid %u precedes global " > > > + "oldest valid xid %u ", > > > + ctx.relfrozenxid, ctx.oldestValidXid)); > > > + PG_RETURN_NULL(); > > > + } > > > > > > Don't we need to close the relation/toastrel/toastindexrel in such > > > return which is without an abort? IIRC, we > > > will get relcache leak WARNING on commit if we left them open in commit path. > > > > Ok, I've added logic to close them. > > > > All changes inspired by your review are included in the v9-0001 patch. The differences since v8 are pulled out intov9_diffs for easier review. > > I have reviewed the changes in v9_diffs and looks fine to me. Some more comments on v9_0001. 1. + LWLockAcquire(XidGenLock, LW_SHARED); + nextFullXid = ShmemVariableCache->nextFullXid; + ctx.oldestValidXid = ShmemVariableCache->oldestXid; + LWLockRelease(XidGenLock); + ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid); ... ... + + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) + { + int32 mapbits; + OffsetNumber maxoff; + PageHeader ph; + + /* Optionally skip over all-frozen or all-visible blocks */ + if (skip_all_frozen || skip_all_visible) + { + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, + &vmbuffer); + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) + continue; + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) + continue; + } + + /* Read and lock the next page. */ + ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno, + RBM_NORMAL, ctx.bstrategy); + LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE); I might be missing something, but it appears that first we are getting the nextFullXid and after that, we are scanning the block by block. So while we are scanning the block if the nextXid is advanced and it has updated some tuple in the heap pages, then it seems the current logic will complain about out of range xid. I did not test this behavior so please point me to the logic which is protecting this. 2. /* * Helper function to construct the TupleDesc needed by verify_heapam. */ static TupleDesc verify_heapam_tupdesc(void) From function name, it appeared that it is verifying tuple descriptor but this is just creating the tuple descriptor. 3. + /* Optionally skip over all-frozen or all-visible blocks */ + if (skip_all_frozen || skip_all_visible) + { + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, + &vmbuffer); + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) + continue; + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) + continue; + } Here, do we want to test that in VM the all visible bit is set whereas on the page it is not set? That can lead to a wrong result in an index-only scan. 4. One cosmetic comment + /* Skip non-varlena values, but update offset first */ .. + + /* Ok, we're looking at a varlena attribute. */ Throughout the patch, I have noticed that some of your single-line comments have "full stop" whereas other don't. Can we keep them consistent? -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Jun 28, 2020, at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > Some more comments on v9_0001. > 1. > + LWLockAcquire(XidGenLock, LW_SHARED); > + nextFullXid = ShmemVariableCache->nextFullXid; > + ctx.oldestValidXid = ShmemVariableCache->oldestXid; > + LWLockRelease(XidGenLock); > + ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid); > ... > ... > + > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > + { > + int32 mapbits; > + OffsetNumber maxoff; > + PageHeader ph; > + > + /* Optionally skip over all-frozen or all-visible blocks */ > + if (skip_all_frozen || skip_all_visible) > + { > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, > + &vmbuffer); > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) > + continue; > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) > + continue; > + } > + > + /* Read and lock the next page. */ > + ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno, > + RBM_NORMAL, ctx.bstrategy); > + LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE); > > I might be missing something, but it appears that first we are getting > the nextFullXid and after that, we are scanning the block by block. > So while we are scanning the block if the nextXid is advanced and it > has updated some tuple in the heap pages, then it seems the current > logic will complain about out of range xid. I did not test this > behavior so please point me to the logic which is protecting this. We know the oldest valid Xid cannot advance, because we hold a lock that would prevent it from doing so. We cannot knowthat the newest Xid will not advance, but when we see an Xid beyond the end of the known valid range, we check its validity,and either report it as a corruption or advance our idea of the newest valid Xid, depending on that check. Thatlogic is in TransactionIdValidInRel. > 2. > /* > * Helper function to construct the TupleDesc needed by verify_heapam. > */ > static TupleDesc > verify_heapam_tupdesc(void) > > From function name, it appeared that it is verifying tuple descriptor > but this is just creating the tuple descriptor. In amcheck--1.2--1.3.sql we define a function named verify_heapam which returns a set of records. This is the tuple descriptorfor that function. I understand that the name can be parsed as verify_(heapam_tupdesc), but it is meant as (verify_heapam)_tupdesc. Do you have a name you would prefer? > 3. > + /* Optionally skip over all-frozen or all-visible blocks */ > + if (skip_all_frozen || skip_all_visible) > + { > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, > + &vmbuffer); > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) > + continue; > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) > + continue; > + } > > Here, do we want to test that in VM the all visible bit is set whereas > on the page it is not set? That can lead to a wrong result in an > index-only scan. If the caller has specified that the corruption check should skip over all-frozen or all-visible data, then we cannot loadthe page that the VM claims is all-frozen or all-visible without defeating the purpose of the caller having specifiedthese options. Without loading the page, we cannot check the page's header bits. When not skipping all-visible or all-frozen blocks, we might like to pin both the heap page and the visibility map page inorder to compare the two, being careful not to hold a pin on the one while performing I/O on the other. See for examplethe logic in heap_delete(). But I'm not sure what guarantees the system makes about agreement between these two bits. Certainly, the VM should not claim a page is all visible when it isn't, but are we guaranteed that a page that is all-visiblewill always have its all-visible bit set? I don't know if (possibly transient) disagreement between these twobits constitutes corruption. Perhaps others following this thread can advise? > 4. One cosmetic comment > > + /* Skip non-varlena values, but update offset first */ > .. > + > + /* Ok, we're looking at a varlena attribute. */ > > Throughout the patch, I have noticed that some of your single-line > comments have "full stop" whereas other don't. Can we keep them > consistent? I try to use a "full stop" at the end of sentences, but not at the end of sentence fragments. To me, a "full stop" meansthat a sentence has reached its conclusion. I don't intentionally use one at the end of a fragment, unless the fragmentprecedes a full sentence, in which case the "full stop" is needed to separate the two. Of course, I may have violatedmy own rule in a few places, but before I submit a v10 patch with comment punctuation changes, perhaps we can agreeon what the rule is? (This has probably been discussed before and agreed before. A link to the appropriate email threadwould be sufficient.) For example: /* red, green, or blue */ /* set to pink */ /* set to blue. We have not closed the file. */ /* At this point, we have chosen the color. */ The first comment is not a sentence, but the fourth is. The third comment is a fragment followed by a full sentence, anda "full stop" separates the two. As for the second comment, as I recall, verb phrases can be interpreted as a full sentence,as in "Close the door!", when they are meant as commands to the listener, but not otherwise. "set to pink" is nota command to the reader, but rather a description of what the code is doing at that point, so I think of it as a mereverb phrase and not a full sentence. Making matters even more complicated, portions of the logic in verify_heapam were taken from sections of code that wouldereport(), elog(), or Assert() on corruption, and when I took such code, I sometimes also took the comments in unmodifiedform. That means that my normal commenting rules don't apply, as I'm not the comment author in such cases. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I think there are two very large patches here. One adds checking of heapam tables to amcheck, and the other adds a binary that eases calling amcheck from the command line. I think these should be two separate patches. I don't know what to think of a module contrib/pg_amcheck. I kinda lean towards fitting it in src/bin/scripts rather than as a contrib module. However, it seems a bit weird that it depends on a contrib module. Maybe amcheck should not be a contrib module at all but rather a new extension in src/extensions/ that is compiled and installed (in the filesystem, not in databases) by default. I strongly agree with hardening backend code so that all the crashes that Mark has found can be repaired. (We discussed this topic before[1]: we'd repair all crashes when run with production code, not all assertion crashes.) [1] https://postgr.es/m/20200513221051.GA26592@alvherre.pgsql -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On Jun 30, 2020, at 11:44 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > > I think there are two very large patches here. One adds checking of > heapam tables to amcheck, and the other adds a binary that eases calling > amcheck from the command line. I think these should be two separate > patches. contrib/amcheck has pretty limited regression test coverage. I wrote pg_amcheck in large part because the infrastructureI was writing for testing contrib/amcheck was starting to look like a stand-alone tool, so I made it one. I can split contrib/pg_amcheck into a separate patch, but I would expect reviewers to use it to review contrib/amcheck Saythe word, and I'll resubmit as two separate patches. > I don't know what to think of a module contrib/pg_amcheck. I kinda lean > towards fitting it in src/bin/scripts rather than as a contrib module. > However, it seems a bit weird that it depends on a contrib module. Agreed. > Maybe amcheck should not be a contrib module at all but rather a new > extension in src/extensions/ that is compiled and installed (in the > filesystem, not in databases) by default. Fine with me, but I'll have to see what others think about that. > I strongly agree with hardening backend code so that all the crashes > that Mark has found can be repaired. (We discussed this topic > before[1]: we'd repair all crashes when run with production code, not > all assertion crashes.) I'm guessing that hardening the backend would be a separate patch? Or did you want that as part of this one? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2020-Jun-30, Mark Dilger wrote: > I'm guessing that hardening the backend would be a separate patch? Or > did you want that as part of this one? Lately, to me the foremost criterion to determine what is a separate patch and what isn't is the way the commit message is structured. If it looks too much like a bullet list of unrelated things, that suggests that the commit should be split into one commit per bullet point; of course, there are counterexamples. But when I have a commit message that says "I do A, and I also do B because I need it for A", then it makes more sense to do B first standalone and then A on top. OTOH if two things are done because they're heavily intermixed (e.g. commit 850196b610d2, bullet points galore), that suggests that one commit is a decent approach. Just my opinion, of course. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Jun 28, 2020 at 11:18 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jun 28, 2020, at 9:05 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > > > Some more comments on v9_0001. > > 1. > > + LWLockAcquire(XidGenLock, LW_SHARED); > > + nextFullXid = ShmemVariableCache->nextFullXid; > > + ctx.oldestValidXid = ShmemVariableCache->oldestXid; > > + LWLockRelease(XidGenLock); > > + ctx.nextKnownValidXid = XidFromFullTransactionId(nextFullXid); > > ... > > ... > > + > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > + { > > + int32 mapbits; > > + OffsetNumber maxoff; > > + PageHeader ph; > > + > > + /* Optionally skip over all-frozen or all-visible blocks */ > > + if (skip_all_frozen || skip_all_visible) > > + { > > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, > > + &vmbuffer); > > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) > > + continue; > > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) > > + continue; > > + } > > + > > + /* Read and lock the next page. */ > > + ctx.buffer = ReadBufferExtended(ctx.rel, MAIN_FORKNUM, ctx.blkno, > > + RBM_NORMAL, ctx.bstrategy); > > + LockBuffer(ctx.buffer, BUFFER_LOCK_SHARE); > > > > I might be missing something, but it appears that first we are getting > > the nextFullXid and after that, we are scanning the block by block. > > So while we are scanning the block if the nextXid is advanced and it > > has updated some tuple in the heap pages, then it seems the current > > logic will complain about out of range xid. I did not test this > > behavior so please point me to the logic which is protecting this. > > We know the oldest valid Xid cannot advance, because we hold a lock that would prevent it from doing so. We cannot knowthat the newest Xid will not advance, but when we see an Xid beyond the end of the known valid range, we check its validity,and either report it as a corruption or advance our idea of the newest valid Xid, depending on that check. Thatlogic is in TransactionIdValidInRel. That makes sense to me. > > > 2. > > /* > > * Helper function to construct the TupleDesc needed by verify_heapam. > > */ > > static TupleDesc > > verify_heapam_tupdesc(void) > > > > From function name, it appeared that it is verifying tuple descriptor > > but this is just creating the tuple descriptor. > > In amcheck--1.2--1.3.sql we define a function named verify_heapam which returns a set of records. This is the tuple descriptorfor that function. I understand that the name can be parsed as verify_(heapam_tupdesc), but it is meant as (verify_heapam)_tupdesc. Do you have a name you would prefer? Not very particular, but if we have a name like verify_heapam_get_tupdesc, But, just a suggestion so it's your choice if you prefer the current name I have no objection. > > > 3. > > + /* Optionally skip over all-frozen or all-visible blocks */ > > + if (skip_all_frozen || skip_all_visible) > > + { > > + mapbits = (int32) visibilitymap_get_status(ctx.rel, ctx.blkno, > > + &vmbuffer); > > + if (skip_all_visible && (mapbits & VISIBILITYMAP_ALL_VISIBLE) != 0) > > + continue; > > + if (skip_all_frozen && (mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) > > + continue; > > + } > > > > Here, do we want to test that in VM the all visible bit is set whereas > > on the page it is not set? That can lead to a wrong result in an > > index-only scan. > > If the caller has specified that the corruption check should skip over all-frozen or all-visible data, then we cannot loadthe page that the VM claims is all-frozen or all-visible without defeating the purpose of the caller having specifiedthese options. Without loading the page, we cannot check the page's header bits. > > When not skipping all-visible or all-frozen blocks, we might like to pin both the heap page and the visibility map pagein order to compare the two, being careful not to hold a pin on the one while performing I/O on the other. See for examplethe logic in heap_delete(). But I'm not sure what guarantees the system makes about agreement between these two bits. Certainly, the VM should not claim a page is all visible when it isn't, but are we guaranteed that a page that is all-visiblewill always have its all-visible bit set? I don't know if (possibly transient) disagreement between these twobits constitutes corruption. Perhaps others following this thread can advise? Right, the VM should not claim its all visible when it actually not. But, IIRC, it is not guaranteed that if the page is all visible then the VM must set the all visible flag. > > 4. One cosmetic comment > > > > + /* Skip non-varlena values, but update offset first */ > > .. > > + > > + /* Ok, we're looking at a varlena attribute. */ > > > > Throughout the patch, I have noticed that some of your single-line > > comments have "full stop" whereas other don't. Can we keep them > > consistent? > > I try to use a "full stop" at the end of sentences, but not at the end of sentence fragments. To me, a "full stop" meansthat a sentence has reached its conclusion. I don't intentionally use one at the end of a fragment, unless the fragmentprecedes a full sentence, in which case the "full stop" is needed to separate the two. Of course, I may have violatedmy own rule in a few places, but before I submit a v10 patch with comment punctuation changes, perhaps we can agreeon what the rule is? (This has probably been discussed before and agreed before. A link to the appropriate email threadwould be sufficient.) I can see in different files we have followed different rules. I am fine as far as those are consistent across the file. > For example: > > /* red, green, or blue */ > /* set to pink */ > /* set to blue. We have not closed the file. */ > /* At this point, we have chosen the color. */ > > The first comment is not a sentence, but the fourth is. The third comment is a fragment followed by a full sentence, anda "full stop" separates the two. As for the second comment, as I recall, verb phrases can be interpreted as a full sentence,as in "Close the door!", when they are meant as commands to the listener, but not otherwise. "set to pink" is nota command to the reader, but rather a description of what the code is doing at that point, so I think of it as a mereverb phrase and not a full sentence. > Making matters even more complicated, portions of the logic in verify_heapam were taken from sections of code that wouldereport(), elog(), or Assert() on corruption, and when I took such code, I sometimes also took the comments in unmodifiedform. That means that my normal commenting rules don't apply, as I'm not the comment author in such cases. I agree. A few more comments. 1. + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + { + confess(ctx, + pstrdup("attribute is external but not marked as on disk")); + return true; + } + .... + + /* + * Must dereference indirect toast pointers before we can check them + */ + if (VARATT_IS_EXTERNAL_INDIRECT(attr)) + { So first we are checking that if the varatt is not VARATT_IS_EXTERNAL_ONDISK then we are returning, but just a few statements down we are checking if the varatt is VARATT_IS_EXTERNAL_INDIRECT, so seems like unreachable code. 2. Another point related to the same code is that toast_save_datum always set the VARTAG_ONDISK tag. IIUC, we use VARTAG_INDIRECT in reorderbuffer for generating temp tuple so ideally while scanning the heap we should never get VARATT_IS_EXTERNAL_INDIRECT tuple. Am I missing something here? 3. + if (VARATT_IS_1B_E(tp + ctx->offset)) + { + uint8 va_tag = va_tag = VARTAG_EXTERNAL(tp + ctx->offset); + + if (va_tag != VARTAG_ONDISK) + { + confess(ctx, psprintf("unexpected TOAST vartag %u for " + "attribute #%u at t_hoff = %u, " + "offset = %u", + va_tag, ctx->attnum, + ctx->tuphdr->t_hoff, ctx->offset)); + return false; /* We can't know where the next attribute + * begins */ + } + } + /* Skip values that are not external */ + if (!VARATT_IS_EXTERNAL(attr)) + return true; + + /* It is external, and we're looking at a page on disk */ + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) + { + confess(ctx, + pstrdup("attribute is external but not marked as on disk")); + return true; + } First, we are checking that if VARATT_IS_1B_E and if so we will check whether its tag is VARTAG_ONDISK or not. But just after that, we will get the actual attribute pointer and Again check the same thing with 2 different checks. Can you explain why this is necessary? 4. + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) && + (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED)) + { + confess(ctx, + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set")); + } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && + (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)) + { + confess(ctx, + psprintf("HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI both set")); + } Maybe we can further expand these checks, like if the tuple is HEAP_XMAX_LOCK_ONLY then HEAP_UPDATED or HEAP_HOT_UPDATED should not be set. -- Regards, Dilip Kumar EnterpriseDB: http://www.enterprisedb.com
> On Jul 4, 2020, at 6:04 AM, Dilip Kumar <dilipbalaut@gmail.com> wrote: > > A few more comments. Your comments all pertain to function check_tuple_attribute(), which follows the logic of heap_deform_tuple() and detoast_external_attr(). The idea is that any error that could result in an assertion or crash in those functions shouldbe checked carefully by check_tuple_attribute(), and checked *before* any such asserts or crashes might be triggered. I obviously did not explain this thinking in the function comment. That is rectified in the v10 patch, attached. > 1. > > + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) > + { > + confess(ctx, > + pstrdup("attribute is external but not marked as on disk")); > + return true; > + } > + > .... > + > + /* > + * Must dereference indirect toast pointers before we can check them > + */ > + if (VARATT_IS_EXTERNAL_INDIRECT(attr)) > + { > > > So first we are checking that if the varatt is not > VARATT_IS_EXTERNAL_ONDISK then we are returning, but just a > few statements down we are checking if the varatt is > VARATT_IS_EXTERNAL_INDIRECT, so seems like unreachable code. True. I've removed the VARATT_IS_EXTERNAL_INDIRECT check. > 2. Another point related to the same code is that toast_save_datum > always set the VARTAG_ONDISK tag. IIUC, we use > VARTAG_INDIRECT in reorderbuffer for generating temp tuple so ideally > while scanning the heap we should never get > VARATT_IS_EXTERNAL_INDIRECT tuple. Am I missing something here? I think you are right that we cannot get a VARATT_IS_EXTERNAL_INDIRECT tuple. That check is removed in v10. > 3. > + if (VARATT_IS_1B_E(tp + ctx->offset)) > + { > + uint8 va_tag = va_tag = VARTAG_EXTERNAL(tp + ctx->offset); > + > + if (va_tag != VARTAG_ONDISK) > + { > + confess(ctx, psprintf("unexpected TOAST vartag %u for " > + "attribute #%u at t_hoff = %u, " > + "offset = %u", > + va_tag, ctx->attnum, > + ctx->tuphdr->t_hoff, ctx->offset)); > + return false; /* We can't know where the next attribute > + * begins */ > + } > + } > > + /* Skip values that are not external */ > + if (!VARATT_IS_EXTERNAL(attr)) > + return true; > + > + /* It is external, and we're looking at a page on disk */ > + if (!VARATT_IS_EXTERNAL_ONDISK(attr)) > + { > + confess(ctx, > + pstrdup("attribute is external but not marked as on disk")); > + return true; > + } > > First, we are checking that if VARATT_IS_1B_E and if so we will check > whether its tag is VARTAG_ONDISK or not. But just after that, we will > get the actual attribute pointer and > Again check the same thing with 2 different checks. Can you explain > why this is necessary? The code that calls check_tuple_attribute() expects it to check the current attribute, but also to safely advance the ctx->offsetvalue to the next attribute, as the caller is iterating over all attributes. The first check verifies that itis safe to call att_addlength_pointer, as we must not call att_addlength_pointer on a corrupt datum. The second checksimply returns on non-external attributes, having advanced ctx->offset, there is nothing left to do. The third checkis validating the external attribute, now that we know that it is external. You are right that the third check cannotfail, as the first check would already have confess()ed and returned false. The third check is removed in v10, attached. > 4. > + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) && > + (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED)) > + { > + confess(ctx, > + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set")); > + } > + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && > + (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)) > + { > + confess(ctx, > + psprintf("HEAP_XMAX_COMMITTED and HEAP_XMAX_IS_MULTI both set")); > + } > > Maybe we can further expand these checks, like if the tuple is > HEAP_XMAX_LOCK_ONLY then HEAP_UPDATED or HEAP_HOT_UPDATED should not > be set. Adding Asserts in src/backend/access/heap/hio.c against those two conditions, the regression tests fail in quite a lot ofplaces where HEAP_XMAX_LOCK_ONLY and HEAP_UPDATED are both true. I'm leaving this idea out for v10, since it doesn't work,but in case you want to tell me what I did wrong, here are the changed I made on top of v10: diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c index 00de10b7c9..76d23e141a 100644 --- a/src/backend/access/heap/hio.c +++ b/src/backend/access/heap/hio.c @@ -57,6 +57,10 @@ RelationPutHeapTuple(Relation relation, (tuple->t_data->t_infomask2 & HEAP_KEYS_UPDATED))); Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_COMMITTED) && (tuple->t_data->t_infomask & HEAP_XMAX_IS_MULTI))); + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) && + (tuple->t_data->t_infomask & HEAP_UPDATED))); + Assert(!((tuple->t_data->t_infomask & HEAP_XMAX_LOCK_ONLY) && + (tuple->t_data->t_infomask2 & HEAP_HOT_UPDATED))); diff --git a/contrib/amcheck/verify_heapam.c b/contrib/amcheck/verify_heapam.c index 49d3d5618a..60e4ad5be0 100644 --- a/contrib/amcheck/verify_heapam.c +++ b/contrib/amcheck/verify_heapam.c @@ -969,12 +969,19 @@ check_tuple(HeapCheckContext * ctx) ctx->tuphdr->t_hoff)); fatal = true; } - if ((ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) && - (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED)) + if (ctx->tuphdr->t_infomask & HEAP_XMAX_LOCK_ONLY) { - confess(ctx, - psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set")); + if (ctx->tuphdr->t_infomask2 & HEAP_KEYS_UPDATED) + confess(ctx, + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_KEYS_UPDATED both set")); + if (ctx->tuphdr->t_infomask & HEAP_UPDATED) + confess(ctx, + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_UPDATED both set")); + if (ctx->tuphdr->t_infomask2 & HEAP_HOT_UPDATED) + confess(ctx, + psprintf("HEAP_XMAX_LOCK_ONLY and HEAP_HOT_UPDATED both set")); } + if ((ctx->tuphdr->t_infomask & HEAP_XMAX_COMMITTED) && (ctx->tuphdr->t_infomask & HEAP_XMAX_IS_MULTI)) { The v10 patch without these ideas is here: — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The v10 patch without these ideas is here: Along the lines of what Alvaro was saying before, I think this definitely needs to be split up into a series of patches. The commit message for v10 describes it doing three pretty separate things, and I think that argues for splitting it into a series of three patches. I'd argue for this ordering: 0001 Refactoring existing amcheck btree checking functions to optionally return corruption information rather than ereport'ing it. This is used by the new pg_amcheck command line tool for reporting back to the caller. 0002 Adding new function verify_heapam for checking a heap relation and associated toast relation, if any, to contrib/amcheck. 0003 Adding new contrib module pg_amcheck, which is a command line interface for running amcheck's verifications against tables and indexes. It's too hard to review things like this when it's all mixed together. +++ b/contrib/amcheck/t/skipping.pl The name of this file is inconsistent with the tree's usual convention, which is all stuff like 001_whatever.pl, except for src/test/modules/brin, which randomly decided to use two digits instead of three. There's no precedent for a test file with no leading numeric digits. Also, what does "skipping" even have to do with what the test is checking? Maybe it's intended to refer to the new error handling "skipping" the actual error in favor of just reporting it without stopping, but that's not really what the word "skipping" normally means. Finally, it seems a bit over-engineered: do we really need 183 test cases to check that detecting a problem doesn't lead to an abort? Like, if that's the purpose of the test, I'd expect it to check one corrupt relation and one non-corrupt relation, each with and without the no-error behavior. And that's about it. Or maybe it's talking about skipping pages during the checks, because those pages are all-visible or all-frozen? It's not very clear to me what's going on here. + TransactionId nextKnownValidXid; + TransactionId oldestValidXid; Please add explanatory comments indicating what these are intended to mean. For most of the the structure members, the brief comments already present seem sufficient; but here, more explanation looks necessary and less is provided. The "Values for returning tuples" could possibly also use some more detail. +#define HEAPCHECK_RELATION_COLS 8 I think this should really be at the top of the file someplace. Sometimes people have adopted this style when the #define is only used within the function that contains it, but that's not the case here. + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognized parameter for 'skip': %s", skip), + errhint("please choose from 'all visible', 'all frozen', " + "or NULL"))); I think it would be better if we had three string values selecting the different behaviors, and made the parameter NOT NULL but with a default. It seems like that would be easier to understand. Right now, I can tell that my options for what to skip are "all visible", "all frozen", and, uh, some other thing that I don't know what it is. I'm gonna guess the third option is to skip nothing, but it seems best to make that explicit. Also, should we maybe consider spelling this 'all-visible' and 'all-frozen' with dashes, instead of using spaces? Spaces in an option value seems a little icky to me somehow. + int64 startblock = -1; + int64 endblock = -1; ... + if (!PG_ARGISNULL(3)) + startblock = PG_GETARG_INT64(3); + if (!PG_ARGISNULL(4)) + endblock = PG_GETARG_INT64(4); ... + if (startblock < 0) + startblock = 0; + if (endblock < 0 || endblock > ctx.nblocks) + endblock = ctx.nblocks; + + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) So, the user can specify a negative value explicitly and it will be treated as the default, and an endblock value that's larger than the relation size will be treated as the relation size. The way pg_prewarm does the corresponding checks seems superior: null indicates the default value, and any non-null value must be within range or you get an error. Also, you seem to be treating endblock as the first block that should not be checked, whereas pg_prewarm takes what seems to me to be the more natural interpretation: the end block is the last block that IS checked. If you do it this way, then someone who specifies the same start and end block will check no blocks -- silently, I think. + if (skip_all_frozen || skip_all_visible) Since you can't skip all frozen without skipping all visible, this test could be simplified. Or you could introduce a three-valued enum and test that skip_pages != SKIP_PAGES_NONE, which might be even better. + /* We must unlock the page from the prior iteration, if any */ + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); I don't understand this assertion, and I don't understand the comment, either. I think ctx.blkno can never be equal to InvalidBlockNumber because we never set it to anything outside the range of 0..(endblock - 1), and I think ctx.buffer must always be unequal to InvalidBuffer because we just initialized it by calling ReadBufferExtended(). So I think this assertion would still pass if we wrote && rather than ||. But even then, I don't know what that has to do with the comment or why it even makes sense to have an assertion for that in the first place. + /* + * Open the relation. We use ShareUpdateExclusive to prevent concurrent + * vacuums from changing the relfrozenxid, relminmxid, or advancing the + * global oldestXid to be newer than those. This protection saves us from + * having to reacquire the locks and recheck those minimums for every + * tuple, which would be expensive. + */ + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); I don't think we'd need to recheck for every tuple, would we? Just for cases where there's an apparent violation of the rules. I guess that could still be expensive if there's a lot of them, but needing ShareUpdateExclusiveLock rather than only AccessShareLock is a little unfortunate. It's also unclear to me why this concerns itself with relfrozenxid and the cluster-wide oldestXid value but not with datfrozenxid. It seems like if we're going to sanity-check the relfrozenxid against the cluster-wide value, we ought to also check it against the database-wide value. Checking neither would also seem like a plausible choice. But it seems very strange to only check against the cluster-wide value. + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, + "InvalidOffsetNumber increments to FirstOffsetNumber"); If you are going to rely on this property, I agree that it is good to check it. But it would be better to NOT rely on this property, and I suspect the code can be written quite cleanly without relying on it. And actually, that's what you did, because you first set ctx.offnum = InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in the loop initializer. So AFAICS the first initializer, and the static assert, are pointless. + if (ItemIdIsRedirected(ctx.itemid)) + { + uint16 redirect = ItemIdGetRedirect(ctx.itemid); + if (redirect <= SizeOfPageHeaderData || redirect >= ph->pd_lower) ... + if ((redirect - SizeOfPageHeaderData) % sizeof(uint16)) I think that ItemIdGetRedirect() returns an offset, not a byte position. So the expectation that I would have is that it would be any integer >= 0 and <= maxoff. Am I confused? BTW, it seems like it might be good to complain if the item to which it points is LP_UNUSED... AFAIK that shouldn't happen. + errmsg("\"%s\" is not a heap AM", I think the correct wording would be just "is not a heap." The "heap AM" is the thing in pg_am, not a specific table. +confess(HeapCheckContext * ctx, char *msg) +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) This is what happens when you pgindent without adding all the right things to typedefs.list first ... or when you don't pgindent and have odd ideas about how to indent things. + /* + * In principle, there is nothing to prevent a scan over a large, highly + * corrupted table from using workmem worth of memory building up the + * tuplestore. Don't leak the msg argument memory. + */ + pfree(msg); Maybe change the second sentence to something like: "That should be OK, else the user can lower work_mem, but we'd better not leak any additional memory." +/* + * check_tuphdr_xids + * + * Determine whether tuples are visible for verification. Similar to + * HeapTupleSatisfiesVacuum, but with critical differences. + * + * 1) Does not touch hint bits. It seems imprudent to write hint bits + * to a table during a corruption check. + * 2) Only makes a boolean determination of whether verification should + * see the tuple, rather than doing extra work for vacuum-related + * categorization. + * + * The caller should already have checked that xmin and xmax are not out of + * bounds for the relation. + */ First, check_tuphdr_xids() doesn't seem like a very good name. If you have a function with that name and, like this one, it returns Boolean, what does true mean? What does false mean? Kinda hard to tell. And also, check the tuple header XIDs *for what*? If you called it, say, tuple_is_visible(), that would be self-evident. Second, consider that we hold at least AccessShareLock on the relation - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there cannot be a concurrent modification to the tuple descriptor in progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is potentially using a non-current schema. If the tuple is HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the inserting transaction, or that transaction committed before we got our lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed before we got our lock. Or if it's both inserted and deleted in the same transaction, say, then that transaction committed before we got our lock or else contains no relevant DDL. IOW, I think you can check everything but dead tuples here. Capitalization and punctuation for messages complaining about problems need to be consistent. verify_heapam() has "Invalid redirect line pointer offset %u out of bounds" which starts with a capital letter, but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower case, but in any event it should be the same. Also, check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a debugging leftover or a very unclear complaint. I think some real work needs to be put into the phrasing of these messages so that it's more clear exactly what is going on and why it's bad. For example the first example in this paragraph is clearly a problem of some kind, but it's not very clear exactly what is happening: is %u the offset of the invalid line redirect or the value to which it points? I don't think the phrasing is very grammatical, which makes it hard to tell which is meant, and I actually think it would be a good idea to include both things. Project policy is generally against splitting a string across multiple lines to fit within 80 characters. We like to fit within 80 characters, but we like to be able to grep for strings more, and breaking them up like this makes that harder. + confess(ctx, + pstrdup("corrupt toast chunk va_header")); This is another message that I don't think is very clear. There's two elements to that. One is that the phrasing is not very good, and the other is that there are no % escapes. What's somebody going to do when they see this message? First, they're probably going to have to look at the code to figure out in which circumstances it gets generated; that's a sign that the message isn't phrased clearly enough. That will tell them that an unexpected bit pattern has been found, but not what that unexpected bit pattern actually was. So then, they're going to have to try to find the relevant va_header by some other means and fish out the relevant bit so that they can see what actually went wrong. + * Checks the current attribute as tracked in ctx for corruption. Records + * any corruption found in ctx->corruption. + * + * Extra blank line. + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), + ctx->attnum); Maybe you could avoid the line wrap by declaring this without initializing it, and then initializing it as a separate statement. + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)", + ctx->tuphdr->t_hoff, ctx->offset, + ctx->lp_len)); Uggh! This isn't even remotely an English sentence. I don't think formulas are the way to go here, but I like the idea of formulas in some places and written-out messages in others even less. I guess the complaint here in English is something like "tuple attribute %d should start at offset %u, but tuple length is only %u" or something of that sort. Also, it seems like this complaint really ought to have been reported on the *preceding* loop iteration, either complaining that (1) the fixed length attribute is more than the number of remaining bytes in the tuple or (2) the varlena header for the tuple specifies an excessively high length. It seems like you're blaming the wrong attribute for the problem. BTW, the header comments for this function (check_tuple_attribute) neglect to document the meaning of the return value. + confess(ctx, psprintf("tuple xmax = %u precedes relation " + "relfrozenxid = %u", This is another example of these messages needing work. The corresponding message from heap_prepare_freeze_tuple() is "found update xid %u from before relfrozenxid %u". That's better, because we don't normally include equals signs in our messages like this, and also because "relation relfrozenxid" is redundant. I think this should say something like "tuple xmax %u precedes relfrozenxid %u". + confess(ctx, psprintf("tuple xmax = %u is in the future", + xmax)); And then this could be something like "tuple xmax %u follows last-assigned xid %u". That would be more symmetric and more informative. + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > ctx->tuphdr->t_hoff) I think we should be able to predict the exact value of t_hoff and complain if it isn't precisely equal to the expected value. Or is that not possible for some reason? Is there some place that's checking that lp_len >= SizeOfHeapTupleHeader before check_tuple() goes and starts poking into the header? If not, there should be. +$node->command_ok( + [ + 'pg_amcheck', '-p', $port, 'postgres' + ], + 'pg_amcheck all schemas and tables implicitly'); + +$node->command_ok( + [ + 'pg_amcheck', '-i', '-p', $port, 'postgres' + ], + 'pg_amcheck all schemas, tables and indexes'); I haven't really looked through the btree-checking and pg_amcheck parts of this much yet, but this caught my eye. Why would the default be to check tables but not indexes? I think the default ought to be to check everything we know how to check. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, May 14, 2020 at 03:50:52PM -0400, Tom Lane wrote: > I think there's definitely value in corrupting data in some predictable > (reproducible) way and verifying that the check code catches it and > responds as expected. Sure, this will not be 100% coverage, but it'll be > a lot better than 0% coverage. Skimming quickly through the patch, that's what is done in a way similar to pg_checksums's 002_actions.pl. So it seems fine to me to use something like that for some basic coverage. We may want to refactor the test APIs to unify all that though. -- Michael
Attachment
> On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> The v10 patch without these ideas is here: > > Along the lines of what Alvaro was saying before, I think this > definitely needs to be split up into a series of patches. The commit > message for v10 describes it doing three pretty separate things, and I > think that argues for splitting it into a series of three patches. I'd > argue for this ordering: > > 0001 Refactoring existing amcheck btree checking functions to optionally > return corruption information rather than ereport'ing it. This is > used by the new pg_amcheck command line tool for reporting back to > the caller. > > 0002 Adding new function verify_heapam for checking a heap relation and > associated toast relation, if any, to contrib/amcheck. > > 0003 Adding new contrib module pg_amcheck, which is a command line > interface for running amcheck's verifications against tables and > indexes. > > It's too hard to review things like this when it's all mixed together. The v11 patch series is broken up as you suggest. > +++ b/contrib/amcheck/t/skipping.pl > > The name of this file is inconsistent with the tree's usual > convention, which is all stuff like 001_whatever.pl, except for > src/test/modules/brin, which randomly decided to use two digits > instead of three. There's no precedent for a test file with no leading > numeric digits. Also, what does "skipping" even have to do with what > the test is checking? Maybe it's intended to refer to the new error > handling "skipping" the actual error in favor of just reporting it > without stopping, but that's not really what the word "skipping" > normally means. Finally, it seems a bit over-engineered: do we really > need 183 test cases to check that detecting a problem doesn't lead to > an abort? Like, if that's the purpose of the test, I'd expect it to > check one corrupt relation and one non-corrupt relation, each with and > without the no-error behavior. And that's about it. Or maybe it's > talking about skipping pages during the checks, because those pages > are all-visible or all-frozen? It's not very clear to me what's going > on here. The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks. I haverenamed it 001_verify_heapam.pl, since it tests that function. > > + TransactionId nextKnownValidXid; > + TransactionId oldestValidXid; > > Please add explanatory comments indicating what these are intended to > mean. Done. > For most of the the structure members, the brief comments > already present seem sufficient; but here, more explanation looks > necessary and less is provided. The "Values for returning tuples" > could possibly also use some more detail. Ok, I've expanded the comments for these. > +#define HEAPCHECK_RELATION_COLS 8 > > I think this should really be at the top of the file someplace. > Sometimes people have adopted this style when the #define is only used > within the function that contains it, but that's not the case here. Done. > > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("unrecognized parameter for 'skip': %s", skip), > + errhint("please choose from 'all visible', 'all frozen', " > + "or NULL"))); > > I think it would be better if we had three string values selecting the > different behaviors, and made the parameter NOT NULL but with a > default. It seems like that would be easier to understand. Right now, > I can tell that my options for what to skip are "all visible", "all > frozen", and, uh, some other thing that I don't know what it is. I'm > gonna guess the third option is to skip nothing, but it seems best to > make that explicit. Also, should we maybe consider spelling this > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > Spaces in an option value seems a little icky to me somehow. I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. I did not mark the function as strict,as I think NULL is a reasonable value (and the default) for startblock and endblock. > + int64 startblock = -1; > + int64 endblock = -1; > ... > + if (!PG_ARGISNULL(3)) > + startblock = PG_GETARG_INT64(3); > + if (!PG_ARGISNULL(4)) > + endblock = PG_GETARG_INT64(4); > ... > + if (startblock < 0) > + startblock = 0; > + if (endblock < 0 || endblock > ctx.nblocks) > + endblock = ctx.nblocks; > + > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > So, the user can specify a negative value explicitly and it will be > treated as the default, and an endblock value that's larger than the > relation size will be treated as the relation size. The way pg_prewarm > does the corresponding checks seems superior: null indicates the > default value, and any non-null value must be within range or you get > an error. Also, you seem to be treating endblock as the first block > that should not be checked, whereas pg_prewarm takes what seems to me > to be the more natural interpretation: the end block is the last block > that IS checked. If you do it this way, then someone who specifies the > same start and end block will check no blocks -- silently, I think. Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block", andfor relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception. I don'tlike that too much, but I'm happy to defer to precedent. Since you say pg_prewarm works this way (I did not check),I have changed verify_heapam to do likewise. > + if (skip_all_frozen || skip_all_visible) > > Since you can't skip all frozen without skipping all visible, this > test could be simplified. Or you could introduce a three-valued enum > and test that skip_pages != SKIP_PAGES_NONE, which might be even > better. It works now with a three-valued enum. > + /* We must unlock the page from the prior iteration, if any */ > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); > > I don't understand this assertion, and I don't understand the comment, > either. I think ctx.blkno can never be equal to InvalidBlockNumber > because we never set it to anything outside the range of 0..(endblock > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer > because we just initialized it by calling ReadBufferExtended(). So I > think this assertion would still pass if we wrote && rather than ||. > But even then, I don't know what that has to do with the comment or > why it even makes sense to have an assertion for that in the first > place. Yes, it is vestigial. Removed. > + /* > + * Open the relation. We use ShareUpdateExclusive to prevent concurrent > + * vacuums from changing the relfrozenxid, relminmxid, or advancing the > + * global oldestXid to be newer than those. This protection > saves us from > + * having to reacquire the locks and recheck those minimums for every > + * tuple, which would be expensive. > + */ > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > I don't think we'd need to recheck for every tuple, would we? Just for > cases where there's an apparent violation of the rules. It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarilymuch. It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how oftenyou check. Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensiblechoice for that. I removed the offending sentence. > I guess that > could still be expensive if there's a lot of them, but needing > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > unfortunate. I welcome strategies that would allow for taking a lesser lock. > It's also unclear to me why this concerns itself with relfrozenxid and > the cluster-wide oldestXid value but not with datfrozenxid. It seems > like if we're going to sanity-check the relfrozenxid against the > cluster-wide value, we ought to also check it against the > database-wide value. Checking neither would also seem like a plausible > choice. But it seems very strange to only check against the > cluster-wide value. If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid. Otherwise,each row needs to be compared against some other minimum xid value. Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade. This is awful. If I compare the xid of a row in a table against the oldest xid value for the database, and thexid of the row is older, what can I do? I don't have a principled basis for determining which one of them is wrong. The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if itdoes report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding. I think this is a goodchoice; a tool with only false negatives is much more useful than one with both false positives and false negatives. I have added a comment about my reasoning to verify_heapam.c. I'm happy to be convinced of a better strategy for handlingthis situation. > > + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, > + "InvalidOffsetNumber > increments to FirstOffsetNumber"); > > If you are going to rely on this property, I agree that it is good to > check it. But it would be better to NOT rely on this property, and I > suspect the code can be written quite cleanly without relying on it. > And actually, that's what you did, because you first set ctx.offnum = > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > the loop initializer. So AFAICS the first initializer, and the static > assert, are pointless. Ah, right you are. Removed. > > + if (ItemIdIsRedirected(ctx.itemid)) > + { > + uint16 redirect = ItemIdGetRedirect(ctx.itemid); > + if (redirect <= SizeOfPageHeaderData > || redirect >= ph->pd_lower) > ... > + if ((redirect - SizeOfPageHeaderData) > % sizeof(uint16)) > > I think that ItemIdGetRedirect() returns an offset, not a byte > position. So the expectation that I would have is that it would be any > integer >= 0 and <= maxoff. Am I confused? I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive. I haveupdated the checks. > BTW, it seems like it might > be good to complain if the item to which it points is LP_UNUSED... > AFAIK that shouldn't happen. Thanks for mentioning that. It now checks for that. > + errmsg("\"%s\" is not a heap AM", > > I think the correct wording would be just "is not a heap." The "heap > AM" is the thing in pg_am, not a specific table. Fixed. > +confess(HeapCheckContext * ctx, char *msg) > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > This is what happens when you pgindent without adding all the right > things to typedefs.list first ... or when you don't pgindent and have > odd ideas about how to indent things. Hmm. I don't see the three lines of code you are quoting. Which patch is that from? > > + /* > + * In principle, there is nothing to prevent a scan over a large, highly > + * corrupted table from using workmem worth of memory building up the > + * tuplestore. Don't leak the msg argument memory. > + */ > + pfree(msg); > > Maybe change the second sentence to something like: "That should be > OK, else the user can lower work_mem, but we'd better not leak any > additional memory." It may be a little wordy, but I went with /* * In principle, there is nothing to prevent a scan over a large, highly * corrupted table from using workmem worth of memory building up the * tuplestore. That's ok, but if we also leak the msg argument memory * until the end of the query, we could exceed workmem by more than a * trivial amount. Therefore, free the msg argument each time we are * called rather than waiting for our current memory context to be freed. */ > +/* > + * check_tuphdr_xids > + * > + * Determine whether tuples are visible for verification. Similar to > + * HeapTupleSatisfiesVacuum, but with critical differences. > + * > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > + * to a table during a corruption check. > + * 2) Only makes a boolean determination of whether verification should > + * see the tuple, rather than doing extra work for vacuum-related > + * categorization. > + * > + * The caller should already have checked that xmin and xmax are not out of > + * bounds for the relation. > + */ > > First, check_tuphdr_xids() doesn't seem like a very good name. If you > have a function with that name and, like this one, it returns Boolean, > what does true mean? What does false mean? Kinda hard to tell. And > also, check the tuple header XIDs *for what*? If you called it, say, > tuple_is_visible(), that would be self-evident. Changed. > Second, consider that we hold at least AccessShareLock on the relation > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there > cannot be a concurrent modification to the tuple descriptor in > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is > potentially using a non-current schema. If the tuple is > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the > inserting transaction, or that transaction committed before we got our > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed > before we got our lock. Or if it's both inserted and deleted in the > same transaction, say, then that transaction committed before we got > our lock or else contains no relevant DDL. IOW, I think you can check > everything but dead tuples here. Ok, I have changed tuple_is_visible to return true rather than false for those other cases. > Capitalization and punctuation for messages complaining about problems > need to be consistent. verify_heapam() has "Invalid redirect line > pointer offset %u out of bounds" which starts with a capital letter, > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower > case, but in any event it should be the same. I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone. > Also, > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a > debugging leftover or a very unclear complaint. Right. That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation". > I think some real work > needs to be put into the phrasing of these messages so that it's more > clear exactly what is going on and why it's bad. For example the first > example in this paragraph is clearly a problem of some kind, but it's > not very clear exactly what is happening: is %u the offset of the > invalid line redirect or the value to which it points? I don't think > the phrasing is very grammatical, which makes it hard to tell which is > meant, and I actually think it would be a good idea to include both > things. Beware that every row returned from amcheck has more fields than just the error message. blkno OUT bigint, offnum OUT integer, lp_off OUT smallint, lp_flags OUT smallint, lp_len OUT smallint, attnum OUT integer, chunk OUT integer, msg OUT text Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to removethese things from messages that include them. For the specific message under consideration, I've converted the textto "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u". That avoids duplicatingthe offset information of the referring item, while reporting to offset of the referred item. > Project policy is generally against splitting a string across multiple > lines to fit within 80 characters. We like to fit within 80 > characters, but we like to be able to grep for strings more, and > breaking them up like this makes that harder. Thanks for clarifying the project policy. I joined these message strings back together. > + confess(ctx, > + pstrdup("corrupt toast chunk va_header")); > > This is another message that I don't think is very clear. There's two > elements to that. One is that the phrasing is not very good, and the > other is that there are no % escapes Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x". I think all the otherinformation about where the corruption was found is already present in the other returned columns. > What's somebody going to do when > they see this message? First, they're probably going to have to look > at the code to figure out in which circumstances it gets generated; > that's a sign that the message isn't phrased clearly enough. That will > tell them that an unexpected bit pattern has been found, but not what > that unexpected bit pattern actually was. So then, they're going to > have to try to find the relevant va_header by some other means and > fish out the relevant bit so that they can see what actually went > wrong. Right. > > + * Checks the current attribute as tracked in ctx for corruption. Records > + * any corruption found in ctx->corruption. > + * > + * > > Extra blank line. Fixed. > + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), > + > ctx->attnum); > > Maybe you could avoid the line wrap by declaring this without > initializing it, and then initializing it as a separate statement. Yes, I like that better. I did not need to do the same with infomask, but it looks better to me to break the declarationand initialization for both, so I did that. > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)", > + > ctx->tuphdr->t_hoff, ctx->offset, > + ctx->lp_len)); > > Uggh! This isn't even remotely an English sentence. I don't think > formulas are the way to go here, but I like the idea of formulas in > some places and written-out messages in others even less. I guess the > complaint here in English is something like "tuple attribute %d should > start at offset %u, but tuple length is only %u" or something of that > sort. Also, it seems like this complaint really ought to have been > reported on the *preceding* loop iteration, either complaining that > (1) the fixed length attribute is more than the number of remaining > bytes in the tuple or (2) the varlena header for the tuple specifies > an excessively high length. It seems like you're blaming the wrong > attribute for the problem. Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute to blameit on. I've changed it to report as you suggest, although it also still complains if the first attribute starts outsidethe bounds of the tuple. The two error messages now read as "tuple attribute should start at offset %u, but tuplelength is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u". > BTW, the header comments for this function (check_tuple_attribute) > neglect to document the meaning of the return value. Fixed. > + confess(ctx, psprintf("tuple xmax = %u > precedes relation " > + > "relfrozenxid = %u", > > This is another example of these messages needing work. The > corresponding message from heap_prepare_freeze_tuple() is "found > update xid %u from before relfrozenxid %u". That's better, because we > don't normally include equals signs in our messages like this, and > also because "relation relfrozenxid" is redundant. I think this should > say something like "tuple xmax %u precedes relfrozenxid %u". > > + confess(ctx, psprintf("tuple xmax = %u is in > the future", > + xmax)); > > And then this could be something like "tuple xmax %u follows > last-assigned xid %u". That would be more symmetric and more > informative. Both of these have been changed. > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > > ctx->tuphdr->t_hoff) > > I think we should be able to predict the exact value of t_hoff and > complain if it isn't precisely equal to the expected value. Or is that > not possible for some reason? That is possible, and I've updated the error message to match. There are cases where you can't know if the HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected. That sidesteps the problem of not knowing exactly which value to blame. > Is there some place that's checking that lp_len >= > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into > the header? If not, there should be. Good catch. check_tuple() now does that before reading the header. > +$node->command_ok( > > + [ > + 'pg_amcheck', '-p', $port, 'postgres' > + ], > + 'pg_amcheck all schemas and tables implicitly'); > + > +$node->command_ok( > + [ > + 'pg_amcheck', '-i', '-p', $port, 'postgres' > + ], > + 'pg_amcheck all schemas, tables and indexes'); > > I haven't really looked through the btree-checking and pg_amcheck > parts of this much yet, but this caught my eye. Why would the default > be to check tables but not indexes? I think the default ought to be to > check everything we know how to check. I have changed the default to match your expectations. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi Mark, I think new structures should be listed in src/tools/pgindent/typedefs.list, otherwise, pgindent might disturb its indentation. Regards, Amul On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > >> The v10 patch without these ideas is here: > > > > Along the lines of what Alvaro was saying before, I think this > > definitely needs to be split up into a series of patches. The commit > > message for v10 describes it doing three pretty separate things, and I > > think that argues for splitting it into a series of three patches. I'd > > argue for this ordering: > > > > 0001 Refactoring existing amcheck btree checking functions to optionally > > return corruption information rather than ereport'ing it. This is > > used by the new pg_amcheck command line tool for reporting back to > > the caller. > > > > 0002 Adding new function verify_heapam for checking a heap relation and > > associated toast relation, if any, to contrib/amcheck. > > > > 0003 Adding new contrib module pg_amcheck, which is a command line > > interface for running amcheck's verifications against tables and > > indexes. > > > > It's too hard to review things like this when it's all mixed together. > > The v11 patch series is broken up as you suggest. > > > +++ b/contrib/amcheck/t/skipping.pl > > > > The name of this file is inconsistent with the tree's usual > > convention, which is all stuff like 001_whatever.pl, except for > > src/test/modules/brin, which randomly decided to use two digits > > instead of three. There's no precedent for a test file with no leading > > numeric digits. Also, what does "skipping" even have to do with what > > the test is checking? Maybe it's intended to refer to the new error > > handling "skipping" the actual error in favor of just reporting it > > without stopping, but that's not really what the word "skipping" > > normally means. Finally, it seems a bit over-engineered: do we really > > need 183 test cases to check that detecting a problem doesn't lead to > > an abort? Like, if that's the purpose of the test, I'd expect it to > > check one corrupt relation and one non-corrupt relation, each with and > > without the no-error behavior. And that's about it. Or maybe it's > > talking about skipping pages during the checks, because those pages > > are all-visible or all-frozen? It's not very clear to me what's going > > on here. > > The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks. I haverenamed it 001_verify_heapam.pl, since it tests that function. > > > > > + TransactionId nextKnownValidXid; > > + TransactionId oldestValidXid; > > > > Please add explanatory comments indicating what these are intended to > > mean. > > Done. > > > For most of the the structure members, the brief comments > > already present seem sufficient; but here, more explanation looks > > necessary and less is provided. The "Values for returning tuples" > > could possibly also use some more detail. > > Ok, I've expanded the comments for these. > > > +#define HEAPCHECK_RELATION_COLS 8 > > > > I think this should really be at the top of the file someplace. > > Sometimes people have adopted this style when the #define is only used > > within the function that contains it, but that's not the case here. > > Done. > > > > > + ereport(ERROR, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("unrecognized parameter for 'skip': %s", skip), > > + errhint("please choose from 'all visible', 'all frozen', " > > + "or NULL"))); > > > > I think it would be better if we had three string values selecting the > > different behaviors, and made the parameter NOT NULL but with a > > default. It seems like that would be easier to understand. Right now, > > I can tell that my options for what to skip are "all visible", "all > > frozen", and, uh, some other thing that I don't know what it is. I'm > > gonna guess the third option is to skip nothing, but it seems best to > > make that explicit. Also, should we maybe consider spelling this > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > > Spaces in an option value seems a little icky to me somehow. > > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. I did not mark the function asstrict, as I think NULL is a reasonable value (and the default) for startblock and endblock. > > > + int64 startblock = -1; > > + int64 endblock = -1; > > ... > > + if (!PG_ARGISNULL(3)) > > + startblock = PG_GETARG_INT64(3); > > + if (!PG_ARGISNULL(4)) > > + endblock = PG_GETARG_INT64(4); > > ... > > + if (startblock < 0) > > + startblock = 0; > > + if (endblock < 0 || endblock > ctx.nblocks) > > + endblock = ctx.nblocks; > > + > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > > > So, the user can specify a negative value explicitly and it will be > > treated as the default, and an endblock value that's larger than the > > relation size will be treated as the relation size. The way pg_prewarm > > does the corresponding checks seems superior: null indicates the > > default value, and any non-null value must be within range or you get > > an error. Also, you seem to be treating endblock as the first block > > that should not be checked, whereas pg_prewarm takes what seems to me > > to be the more natural interpretation: the end block is the last block > > that IS checked. If you do it this way, then someone who specifies the > > same start and end block will check no blocks -- silently, I think. > > Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block", andfor relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception. I don'tlike that too much, but I'm happy to defer to precedent. Since you say pg_prewarm works this way (I did not check),I have changed verify_heapam to do likewise. > > > + if (skip_all_frozen || skip_all_visible) > > > > Since you can't skip all frozen without skipping all visible, this > > test could be simplified. Or you could introduce a three-valued enum > > and test that skip_pages != SKIP_PAGES_NONE, which might be even > > better. > > It works now with a three-valued enum. > > > + /* We must unlock the page from the prior iteration, if any */ > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); > > > > I don't understand this assertion, and I don't understand the comment, > > either. I think ctx.blkno can never be equal to InvalidBlockNumber > > because we never set it to anything outside the range of 0..(endblock > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer > > because we just initialized it by calling ReadBufferExtended(). So I > > think this assertion would still pass if we wrote && rather than ||. > > But even then, I don't know what that has to do with the comment or > > why it even makes sense to have an assertion for that in the first > > place. > > Yes, it is vestigial. Removed. > > > + /* > > + * Open the relation. We use ShareUpdateExclusive to prevent concurrent > > + * vacuums from changing the relfrozenxid, relminmxid, or advancing the > > + * global oldestXid to be newer than those. This protection > > saves us from > > + * having to reacquire the locks and recheck those minimums for every > > + * tuple, which would be expensive. > > + */ > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > > > I don't think we'd need to recheck for every tuple, would we? Just for > > cases where there's an apparent violation of the rules. > > It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarilymuch. It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how oftenyou check. Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensiblechoice for that. I removed the offending sentence. > > > I guess that > > could still be expensive if there's a lot of them, but needing > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > > unfortunate. > > I welcome strategies that would allow for taking a lesser lock. > > > It's also unclear to me why this concerns itself with relfrozenxid and > > the cluster-wide oldestXid value but not with datfrozenxid. It seems > > like if we're going to sanity-check the relfrozenxid against the > > cluster-wide value, we ought to also check it against the > > database-wide value. Checking neither would also seem like a plausible > > choice. But it seems very strange to only check against the > > cluster-wide value. > > If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid. Otherwise,each row needs to be compared against some other minimum xid value. > > Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. > > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade. This is awful. If I compare the xid of a row in a table against the oldest xid value for the database, and thexid of the row is older, what can I do? I don't have a principled basis for determining which one of them is wrong. > > The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but if itdoes report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding. I think this is a goodchoice; a tool with only false negatives is much more useful than one with both false positives and false negatives. > > I have added a comment about my reasoning to verify_heapam.c. I'm happy to be convinced of a better strategy for handlingthis situation. > > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, > > + "InvalidOffsetNumber > > increments to FirstOffsetNumber"); > > > > If you are going to rely on this property, I agree that it is good to > > check it. But it would be better to NOT rely on this property, and I > > suspect the code can be written quite cleanly without relying on it. > > And actually, that's what you did, because you first set ctx.offnum = > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > the loop initializer. So AFAICS the first initializer, and the static > > assert, are pointless. > > Ah, right you are. Removed. > > > > > + if (ItemIdIsRedirected(ctx.itemid)) > > + { > > + uint16 redirect = ItemIdGetRedirect(ctx.itemid); > > + if (redirect <= SizeOfPageHeaderData > > || redirect >= ph->pd_lower) > > ... > > + if ((redirect - SizeOfPageHeaderData) > > % sizeof(uint16)) > > > > I think that ItemIdGetRedirect() returns an offset, not a byte > > position. So the expectation that I would have is that it would be any > > integer >= 0 and <= maxoff. Am I confused? > > I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive. Ihave updated the checks. > > > BTW, it seems like it might > > be good to complain if the item to which it points is LP_UNUSED... > > AFAIK that shouldn't happen. > > Thanks for mentioning that. It now checks for that. > > > + errmsg("\"%s\" is not a heap AM", > > > > I think the correct wording would be just "is not a heap." The "heap > > AM" is the thing in pg_am, not a specific table. > > Fixed. > > > +confess(HeapCheckContext * ctx, char *msg) > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > This is what happens when you pgindent without adding all the right > > things to typedefs.list first ... or when you don't pgindent and have > > odd ideas about how to indent things. > > Hmm. I don't see the three lines of code you are quoting. Which patch is that from? > > > > > + /* > > + * In principle, there is nothing to prevent a scan over a large, highly > > + * corrupted table from using workmem worth of memory building up the > > + * tuplestore. Don't leak the msg argument memory. > > + */ > > + pfree(msg); > > > > Maybe change the second sentence to something like: "That should be > > OK, else the user can lower work_mem, but we'd better not leak any > > additional memory." > > It may be a little wordy, but I went with > > /* > * In principle, there is nothing to prevent a scan over a large, highly > * corrupted table from using workmem worth of memory building up the > * tuplestore. That's ok, but if we also leak the msg argument memory > * until the end of the query, we could exceed workmem by more than a > * trivial amount. Therefore, free the msg argument each time we are > * called rather than waiting for our current memory context to be freed. > */ > > > +/* > > + * check_tuphdr_xids > > + * > > + * Determine whether tuples are visible for verification. Similar to > > + * HeapTupleSatisfiesVacuum, but with critical differences. > > + * > > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > > + * to a table during a corruption check. > > + * 2) Only makes a boolean determination of whether verification should > > + * see the tuple, rather than doing extra work for vacuum-related > > + * categorization. > > + * > > + * The caller should already have checked that xmin and xmax are not out of > > + * bounds for the relation. > > + */ > > > > First, check_tuphdr_xids() doesn't seem like a very good name. If you > > have a function with that name and, like this one, it returns Boolean, > > what does true mean? What does false mean? Kinda hard to tell. And > > also, check the tuple header XIDs *for what*? If you called it, say, > > tuple_is_visible(), that would be self-evident. > > Changed. > > > Second, consider that we hold at least AccessShareLock on the relation > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there > > cannot be a concurrent modification to the tuple descriptor in > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is > > potentially using a non-current schema. If the tuple is > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the > > inserting transaction, or that transaction committed before we got our > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed > > before we got our lock. Or if it's both inserted and deleted in the > > same transaction, say, then that transaction committed before we got > > our lock or else contains no relevant DDL. IOW, I think you can check > > everything but dead tuples here. > > Ok, I have changed tuple_is_visible to return true rather than false for those other cases. > > > Capitalization and punctuation for messages complaining about problems > > need to be consistent. verify_heapam() has "Invalid redirect line > > pointer offset %u out of bounds" which starts with a capital letter, > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower > > case, but in any event it should be the same. > > I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone. > > > Also, > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a > > debugging leftover or a very unclear complaint. > > Right. That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation". > > > I think some real work > > needs to be put into the phrasing of these messages so that it's more > > clear exactly what is going on and why it's bad. For example the first > > example in this paragraph is clearly a problem of some kind, but it's > > not very clear exactly what is happening: is %u the offset of the > > invalid line redirect or the value to which it points? I don't think > > the phrasing is very grammatical, which makes it hard to tell which is > > meant, and I actually think it would be a good idea to include both > > things. > > Beware that every row returned from amcheck has more fields than just the error message. > > blkno OUT bigint, > offnum OUT integer, > lp_off OUT smallint, > lp_flags OUT smallint, > lp_len OUT smallint, > attnum OUT integer, > chunk OUT integer, > msg OUT text > > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better to removethese things from messages that include them. For the specific message under consideration, I've converted the textto "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u". That avoids duplicatingthe offset information of the referring item, while reporting to offset of the referred item. > > > Project policy is generally against splitting a string across multiple > > lines to fit within 80 characters. We like to fit within 80 > > characters, but we like to be able to grep for strings more, and > > breaking them up like this makes that harder. > > Thanks for clarifying the project policy. I joined these message strings back together. > > > + confess(ctx, > > + pstrdup("corrupt toast chunk va_header")); > > > > This is another message that I don't think is very clear. There's two > > elements to that. One is that the phrasing is not very good, and the > > other is that there are no % escapes > > Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x". I think all the otherinformation about where the corruption was found is already present in the other returned columns. > > > What's somebody going to do when > > they see this message? First, they're probably going to have to look > > at the code to figure out in which circumstances it gets generated; > > that's a sign that the message isn't phrased clearly enough. That will > > tell them that an unexpected bit pattern has been found, but not what > > that unexpected bit pattern actually was. So then, they're going to > > have to try to find the relevant va_header by some other means and > > fish out the relevant bit so that they can see what actually went > > wrong. > > Right. > > > > > + * Checks the current attribute as tracked in ctx for corruption. Records > > + * any corruption found in ctx->corruption. > > + * > > + * > > > > Extra blank line. > > Fixed. > > > + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), > > + > > ctx->attnum); > > > > Maybe you could avoid the line wrap by declaring this without > > initializing it, and then initializing it as a separate statement. > > Yes, I like that better. I did not need to do the same with infomask, but it looks better to me to break the declarationand initialization for both, so I did that. > > > > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)", > > + > > ctx->tuphdr->t_hoff, ctx->offset, > > + ctx->lp_len)); > > > > Uggh! This isn't even remotely an English sentence. I don't think > > formulas are the way to go here, but I like the idea of formulas in > > some places and written-out messages in others even less. I guess the > > complaint here in English is something like "tuple attribute %d should > > start at offset %u, but tuple length is only %u" or something of that > > sort. Also, it seems like this complaint really ought to have been > > reported on the *preceding* loop iteration, either complaining that > > (1) the fixed length attribute is more than the number of remaining > > bytes in the tuple or (2) the varlena header for the tuple specifies > > an excessively high length. It seems like you're blaming the wrong > > attribute for the problem. > > Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attribute toblame it on. I've changed it to report as you suggest, although it also still complains if the first attribute startsoutside the bounds of the tuple. The two error messages now read as "tuple attribute should start at offset %u, buttuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u". > > > BTW, the header comments for this function (check_tuple_attribute) > > neglect to document the meaning of the return value. > > Fixed. > > > + confess(ctx, psprintf("tuple xmax = %u > > precedes relation " > > + > > "relfrozenxid = %u", > > > > This is another example of these messages needing work. The > > corresponding message from heap_prepare_freeze_tuple() is "found > > update xid %u from before relfrozenxid %u". That's better, because we > > don't normally include equals signs in our messages like this, and > > also because "relation relfrozenxid" is redundant. I think this should > > say something like "tuple xmax %u precedes relfrozenxid %u". > > > > + confess(ctx, psprintf("tuple xmax = %u is in > > the future", > > + xmax)); > > > > And then this could be something like "tuple xmax %u follows > > last-assigned xid %u". That would be more symmetric and more > > informative. > > Both of these have been changed. > > > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > > > ctx->tuphdr->t_hoff) > > > > I think we should be able to predict the exact value of t_hoff and > > complain if it isn't precisely equal to the expected value. Or is that > > not possible for some reason? > > That is possible, and I've updated the error message to match. There are cases where you can't know if the HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected. That sidesteps the problem of not knowing exactly which value to blame. > > > Is there some place that's checking that lp_len >= > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into > > the header? If not, there should be. > > Good catch. check_tuple() now does that before reading the header. > > > +$node->command_ok( > > > > + [ > > + 'pg_amcheck', '-p', $port, 'postgres' > > + ], > > + 'pg_amcheck all schemas and tables implicitly'); > > + > > +$node->command_ok( > > + [ > > + 'pg_amcheck', '-i', '-p', $port, 'postgres' > > + ], > > + 'pg_amcheck all schemas, tables and indexes'); > > > > I haven't really looked through the btree-checking and pg_amcheck > > parts of this much yet, but this caught my eye. Why would the default > > be to check tables but not indexes? I think the default ought to be to > > check everything we know how to check. > > I have changed the default to match your expectations. > > > > — > Mark Dilger > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > >
On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sulamul@gmail.com> wrote: > > Hi Mark, > > I think new structures should be listed in src/tools/pgindent/typedefs.list, > otherwise, pgindent might disturb its indentation. > > Regards, > Amul > > > On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > > > > > > > > On Jul 16, 2020, at 12:38 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > On Mon, Jul 6, 2020 at 2:06 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > >> The v10 patch without these ideas is here: > > > > > > Along the lines of what Alvaro was saying before, I think this > > > definitely needs to be split up into a series of patches. The commit > > > message for v10 describes it doing three pretty separate things, and I > > > think that argues for splitting it into a series of three patches. I'd > > > argue for this ordering: > > > > > > 0001 Refactoring existing amcheck btree checking functions to optionally > > > return corruption information rather than ereport'ing it. This is > > > used by the new pg_amcheck command line tool for reporting back to > > > the caller. > > > > > > 0002 Adding new function verify_heapam for checking a heap relation and > > > associated toast relation, if any, to contrib/amcheck. > > > > > > 0003 Adding new contrib module pg_amcheck, which is a command line > > > interface for running amcheck's verifications against tables and > > > indexes. > > > > > > It's too hard to review things like this when it's all mixed together. > > > > The v11 patch series is broken up as you suggest. > > > > > +++ b/contrib/amcheck/t/skipping.pl > > > > > > The name of this file is inconsistent with the tree's usual > > > convention, which is all stuff like 001_whatever.pl, except for > > > src/test/modules/brin, which randomly decided to use two digits > > > instead of three. There's no precedent for a test file with no leading > > > numeric digits. Also, what does "skipping" even have to do with what > > > the test is checking? Maybe it's intended to refer to the new error > > > handling "skipping" the actual error in favor of just reporting it > > > without stopping, but that's not really what the word "skipping" > > > normally means. Finally, it seems a bit over-engineered: do we really > > > need 183 test cases to check that detecting a problem doesn't lead to > > > an abort? Like, if that's the purpose of the test, I'd expect it to > > > check one corrupt relation and one non-corrupt relation, each with and > > > without the no-error behavior. And that's about it. Or maybe it's > > > talking about skipping pages during the checks, because those pages > > > are all-visible or all-frozen? It's not very clear to me what's going > > > on here. > > > > The "skipping" did originally refer to testing verify_heapam()'s option to skip all-visible or all-frozen blocks. Ihave renamed it 001_verify_heapam.pl, since it tests that function. > > > > > > > > + TransactionId nextKnownValidXid; > > > + TransactionId oldestValidXid; > > > > > > Please add explanatory comments indicating what these are intended to > > > mean. > > > > Done. > > > > > For most of the the structure members, the brief comments > > > already present seem sufficient; but here, more explanation looks > > > necessary and less is provided. The "Values for returning tuples" > > > could possibly also use some more detail. > > > > Ok, I've expanded the comments for these. > > > > > +#define HEAPCHECK_RELATION_COLS 8 > > > > > > I think this should really be at the top of the file someplace. > > > Sometimes people have adopted this style when the #define is only used > > > within the function that contains it, but that's not the case here. > > > > Done. > > > > > > > > + ereport(ERROR, > > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > > + errmsg("unrecognized parameter for 'skip': %s", skip), > > > + errhint("please choose from 'all visible', 'all frozen', " > > > + "or NULL"))); > > > > > > I think it would be better if we had three string values selecting the > > > different behaviors, and made the parameter NOT NULL but with a > > > default. It seems like that would be easier to understand. Right now, > > > I can tell that my options for what to skip are "all visible", "all > > > frozen", and, uh, some other thing that I don't know what it is. I'm > > > gonna guess the third option is to skip nothing, but it seems best to > > > make that explicit. Also, should we maybe consider spelling this > > > 'all-visible' and 'all-frozen' with dashes, instead of using spaces? > > > Spaces in an option value seems a little icky to me somehow. > > > > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. I did not mark the function asstrict, as I think NULL is a reasonable value (and the default) for startblock and endblock. > > > > > + int64 startblock = -1; > > > + int64 endblock = -1; > > > ... > > > + if (!PG_ARGISNULL(3)) > > > + startblock = PG_GETARG_INT64(3); > > > + if (!PG_ARGISNULL(4)) > > > + endblock = PG_GETARG_INT64(4); > > > ... > > > + if (startblock < 0) > > > + startblock = 0; > > > + if (endblock < 0 || endblock > ctx.nblocks) > > > + endblock = ctx.nblocks; > > > + > > > + for (ctx.blkno = startblock; ctx.blkno < endblock; ctx.blkno++) > > > > > > So, the user can specify a negative value explicitly and it will be > > > treated as the default, and an endblock value that's larger than the > > > relation size will be treated as the relation size. The way pg_prewarm > > > does the corresponding checks seems superior: null indicates the > > > default value, and any non-null value must be within range or you get > > > an error. Also, you seem to be treating endblock as the first block > > > that should not be checked, whereas pg_prewarm takes what seems to me > > > to be the more natural interpretation: the end block is the last block > > > that IS checked. If you do it this way, then someone who specifies the > > > same start and end block will check no blocks -- silently, I think. > > > > Under that regime, for relations with one block of data, (startblock=0, endblock=0) means "check the zero'th block",and for relations with no blocks of data, specifying any non-null (startblock,endblock) pair raises an exception. I don't like that too much, but I'm happy to defer to precedent. Since you say pg_prewarm works this way (I didnot check), I have changed verify_heapam to do likewise. > > > > > + if (skip_all_frozen || skip_all_visible) > > > > > > Since you can't skip all frozen without skipping all visible, this > > > test could be simplified. Or you could introduce a three-valued enum > > > and test that skip_pages != SKIP_PAGES_NONE, which might be even > > > better. > > > > It works now with a three-valued enum. > > > > > + /* We must unlock the page from the prior iteration, if any */ > > > + Assert(ctx.blkno == InvalidBlockNumber || ctx.buffer != InvalidBuffer); > > > > > > I don't understand this assertion, and I don't understand the comment, > > > either. I think ctx.blkno can never be equal to InvalidBlockNumber > > > because we never set it to anything outside the range of 0..(endblock > > > - 1), and I think ctx.buffer must always be unequal to InvalidBuffer > > > because we just initialized it by calling ReadBufferExtended(). So I > > > think this assertion would still pass if we wrote && rather than ||. > > > But even then, I don't know what that has to do with the comment or > > > why it even makes sense to have an assertion for that in the first > > > place. > > > > Yes, it is vestigial. Removed. > > > > > + /* > > > + * Open the relation. We use ShareUpdateExclusive to prevent concurrent > > > + * vacuums from changing the relfrozenxid, relminmxid, or advancing the > > > + * global oldestXid to be newer than those. This protection > > > saves us from > > > + * having to reacquire the locks and recheck those minimums for every > > > + * tuple, which would be expensive. > > > + */ > > > + ctx.rel = relation_open(relid, ShareUpdateExclusiveLock); > > > > > > I don't think we'd need to recheck for every tuple, would we? Just for > > > cases where there's an apparent violation of the rules. > > > > It's a bit fuzzy what an "apparent violation" might be if both ends of the range of valid xids may be moving, and arbitrarilymuch. It's also not clear how often to recheck, since you'd be dealing with a race condition no matter how oftenyou check. Perhaps the comments shouldn't mention how often you'd have to recheck, since there is no really defensiblechoice for that. I removed the offending sentence. > > > > > I guess that > > > could still be expensive if there's a lot of them, but needing > > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > > > unfortunate. > > > > I welcome strategies that would allow for taking a lesser lock. > > > > > It's also unclear to me why this concerns itself with relfrozenxid and > > > the cluster-wide oldestXid value but not with datfrozenxid. It seems > > > like if we're going to sanity-check the relfrozenxid against the > > > cluster-wide value, we ought to also check it against the > > > database-wide value. Checking neither would also seem like a plausible > > > choice. But it seems very strange to only check against the > > > cluster-wide value. > > > > If the relation has a normal relfrozenxid, then the oldest valid xid we can encounter in the table is relfrozenxid. Otherwise, each row needs to be compared against some other minimum xid value. > > > > Logically, that other minimum xid value should be the oldest valid xid for the database, which must logically be at leastas old as any valid row in the table and no older than the oldest valid xid for the cluster. > > > > Unfortunately, if the comments in commands/vacuum.c circa line 1572 can be believed, and if I am reading them correctly,the stored value for the oldest valid xid in the database has been known to be corrupted by bugs in pg_upgrade. This is awful. If I compare the xid of a row in a table against the oldest xid value for the database, and thexid of the row is older, what can I do? I don't have a principled basis for determining which one of them is wrong. > > > > The logic in verify_heapam is conservative; it makes no guarantees about finding and reporting all corruption, but ifit does report a row as corrupt, you can bank on that, bugs in verify_heapam itself not withstanding. I think this isa good choice; a tool with only false negatives is much more useful than one with both false positives and false negatives. > > > > I have added a comment about my reasoning to verify_heapam.c. I'm happy to be convinced of a better strategy for handlingthis situation. > > > > > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, > > > + "InvalidOffsetNumber > > > increments to FirstOffsetNumber"); > > > > > > If you are going to rely on this property, I agree that it is good to > > > check it. But it would be better to NOT rely on this property, and I > > > suspect the code can be written quite cleanly without relying on it. > > > And actually, that's what you did, because you first set ctx.offnum = > > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > > the loop initializer. So AFAICS the first initializer, and the static > > > assert, are pointless. > > > > Ah, right you are. Removed. > > > > > > > > + if (ItemIdIsRedirected(ctx.itemid)) > > > + { > > > + uint16 redirect = ItemIdGetRedirect(ctx.itemid); > > > + if (redirect <= SizeOfPageHeaderData > > > || redirect >= ph->pd_lower) > > > ... > > > + if ((redirect - SizeOfPageHeaderData) > > > % sizeof(uint16)) > > > > > > I think that ItemIdGetRedirect() returns an offset, not a byte > > > position. So the expectation that I would have is that it would be any > > > integer >= 0 and <= maxoff. Am I confused? > > > > I think you are right about it returning an offset, which should be between FirstOffsetNumber and maxoff, inclusive. I have updated the checks. > > > > > BTW, it seems like it might > > > be good to complain if the item to which it points is LP_UNUSED... > > > AFAIK that shouldn't happen. > > > > Thanks for mentioning that. It now checks for that. > > > > > + errmsg("\"%s\" is not a heap AM", > > > > > > I think the correct wording would be just "is not a heap." The "heap > > > AM" is the thing in pg_am, not a specific table. > > > > Fixed. > > > > > +confess(HeapCheckContext * ctx, char *msg) > > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > > > This is what happens when you pgindent without adding all the right > > > things to typedefs.list first ... or when you don't pgindent and have > > > odd ideas about how to indent things. > > > > Hmm. I don't see the three lines of code you are quoting. Which patch is that from? > > > > > > > > + /* > > > + * In principle, there is nothing to prevent a scan over a large, highly > > > + * corrupted table from using workmem worth of memory building up the > > > + * tuplestore. Don't leak the msg argument memory. > > > + */ > > > + pfree(msg); > > > > > > Maybe change the second sentence to something like: "That should be > > > OK, else the user can lower work_mem, but we'd better not leak any > > > additional memory." > > > > It may be a little wordy, but I went with > > > > /* > > * In principle, there is nothing to prevent a scan over a large, highly > > * corrupted table from using workmem worth of memory building up the > > * tuplestore. That's ok, but if we also leak the msg argument memory > > * until the end of the query, we could exceed workmem by more than a > > * trivial amount. Therefore, free the msg argument each time we are > > * called rather than waiting for our current memory context to be freed. > > */ > > > > > +/* > > > + * check_tuphdr_xids > > > + * > > > + * Determine whether tuples are visible for verification. Similar to > > > + * HeapTupleSatisfiesVacuum, but with critical differences. > > > + * > > > + * 1) Does not touch hint bits. It seems imprudent to write hint bits > > > + * to a table during a corruption check. > > > + * 2) Only makes a boolean determination of whether verification should > > > + * see the tuple, rather than doing extra work for vacuum-related > > > + * categorization. > > > + * > > > + * The caller should already have checked that xmin and xmax are not out of > > > + * bounds for the relation. > > > + */ > > > > > > First, check_tuphdr_xids() doesn't seem like a very good name. If you > > > have a function with that name and, like this one, it returns Boolean, > > > what does true mean? What does false mean? Kinda hard to tell. And > > > also, check the tuple header XIDs *for what*? If you called it, say, > > > tuple_is_visible(), that would be self-evident. > > > > Changed. > > > > > Second, consider that we hold at least AccessShareLock on the relation > > > - actually, ATM we hold ShareUpdateExclusiveLock. Either way, there > > > cannot be a concurrent modification to the tuple descriptor in > > > progress. Therefore, I think that only a HEAPTUPLE_DEAD tuple is > > > potentially using a non-current schema. If the tuple is > > > HEAPTUPLE_INSERT_IN_PROGRESS, there's either no ADD COLUMN in the > > > inserting transaction, or that transaction committed before we got our > > > lock. Similarly if it's HEAPTUPLE_DELETE_IN_PROGRESS or > > > HEAPTUPLE_RECENTLY_DEAD, the original inserter must've committed > > > before we got our lock. Or if it's both inserted and deleted in the > > > same transaction, say, then that transaction committed before we got > > > our lock or else contains no relevant DDL. IOW, I think you can check > > > everything but dead tuples here. > > > > Ok, I have changed tuple_is_visible to return true rather than false for those other cases. > > > > > Capitalization and punctuation for messages complaining about problems > > > need to be consistent. verify_heapam() has "Invalid redirect line > > > pointer offset %u out of bounds" which starts with a capital letter, > > > but check_tuphdr_xids() has "heap tuple with XMAX_IS_MULTI is neither > > > LOCKED_ONLY nor has a valid xmax" which does not. I vote for lower > > > case, but in any event it should be the same. > > > > I standardized on all lowercase text, though I left embedded symbols and constants such as LOCKED_ONLY alone. > > > > > Also, > > > check_tuphdr_xids() has "tuple xvac = %u invalid" which is either a > > > debugging leftover or a very unclear complaint. > > > > Right. That has been changed to "old-style VACUUM FULL transaction ID %u is invalid in this relation". > > > > > I think some real work > > > needs to be put into the phrasing of these messages so that it's more > > > clear exactly what is going on and why it's bad. For example the first > > > example in this paragraph is clearly a problem of some kind, but it's > > > not very clear exactly what is happening: is %u the offset of the > > > invalid line redirect or the value to which it points? I don't think > > > the phrasing is very grammatical, which makes it hard to tell which is > > > meant, and I actually think it would be a good idea to include both > > > things. > > > > Beware that every row returned from amcheck has more fields than just the error message. > > > > blkno OUT bigint, > > offnum OUT integer, > > lp_off OUT smallint, > > lp_flags OUT smallint, > > lp_len OUT smallint, > > attnum OUT integer, > > chunk OUT integer, > > msg OUT text > > > > Rather than including blkno, offnum, lp_off, lp_flags, lp_len, attnum, or chunk in the message, it would be better toremove these things from messages that include them. For the specific message under consideration, I've converted thetext to "line pointer redirection to item at offset number %u is outside valid bounds %u .. %u". That avoids duplicatingthe offset information of the referring item, while reporting to offset of the referred item. > > > > > Project policy is generally against splitting a string across multiple > > > lines to fit within 80 characters. We like to fit within 80 > > > characters, but we like to be able to grep for strings more, and > > > breaking them up like this makes that harder. > > > > Thanks for clarifying the project policy. I joined these message strings back together. In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to be joined. e.g: + /* check to see if caller supports us returning a tuplestore */ + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("set-valued function called in context that cannot " + "accept a set"))); + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_SYNTAX_ERROR), + errmsg("materialize mode required, but it is not allowed " + "in this context"))); > > > > > + confess(ctx, > > > + pstrdup("corrupt toast chunk va_header")); > > > > > > This is another message that I don't think is very clear. There's two > > > elements to that. One is that the phrasing is not very good, and the > > > other is that there are no % escapes > > > > Changed to "corrupt extended toast chunk with sequence number %d has invalid varlena header %0x". I think all the otherinformation about where the corruption was found is already present in the other returned columns. > > > > > What's somebody going to do when > > > they see this message? First, they're probably going to have to look > > > at the code to figure out in which circumstances it gets generated; > > > that's a sign that the message isn't phrased clearly enough. That will > > > tell them that an unexpected bit pattern has been found, but not what > > > that unexpected bit pattern actually was. So then, they're going to > > > have to try to find the relevant va_header by some other means and > > > fish out the relevant bit so that they can see what actually went > > > wrong. > > > > Right. > > > > > > > > + * Checks the current attribute as tracked in ctx for corruption. Records > > > + * any corruption found in ctx->corruption. > > > + * > > > + * > > > > > > Extra blank line. > > > > Fixed. > > > > > + Form_pg_attribute thisatt = TupleDescAttr(RelationGetDescr(ctx->rel), > > > + > > > ctx->attnum); > > > > > > Maybe you could avoid the line wrap by declaring this without > > > initializing it, and then initializing it as a separate statement. > > > > Yes, I like that better. I did not need to do the same with infomask, but it looks better to me to break the declarationand initialization for both, so I did that. > > > > > > > > + confess(ctx, psprintf("t_hoff + offset > lp_len (%u + %u > %u)", > > > + > > > ctx->tuphdr->t_hoff, ctx->offset, > > > + ctx->lp_len)); > > > > > > Uggh! This isn't even remotely an English sentence. I don't think > > > formulas are the way to go here, but I like the idea of formulas in > > > some places and written-out messages in others even less. I guess the > > > complaint here in English is something like "tuple attribute %d should > > > start at offset %u, but tuple length is only %u" or something of that > > > sort. Also, it seems like this complaint really ought to have been > > > reported on the *preceding* loop iteration, either complaining that > > > (1) the fixed length attribute is more than the number of remaining > > > bytes in the tuple or (2) the varlena header for the tuple specifies > > > an excessively high length. It seems like you're blaming the wrong > > > attribute for the problem. > > > > Yeah, and it wouldn't complain if the final attribute of a tuple was overlong, as there wouldn't be a next attributeto blame it on. I've changed it to report as you suggest, although it also still complains if the first attributestarts outside the bounds of the tuple. The two error messages now read as "tuple attribute should start at offset%u, but tuple length is only %u" and "tuple attribute of length %u ends at offset %u, but tuple length is only %u". > > > > > BTW, the header comments for this function (check_tuple_attribute) > > > neglect to document the meaning of the return value. > > > > Fixed. > > > > > + confess(ctx, psprintf("tuple xmax = %u > > > precedes relation " > > > + > > > "relfrozenxid = %u", > > > > > > This is another example of these messages needing work. The > > > corresponding message from heap_prepare_freeze_tuple() is "found > > > update xid %u from before relfrozenxid %u". That's better, because we > > > don't normally include equals signs in our messages like this, and > > > also because "relation relfrozenxid" is redundant. I think this should > > > say something like "tuple xmax %u precedes relfrozenxid %u". > > > > > > + confess(ctx, psprintf("tuple xmax = %u is in > > > the future", > > > + xmax)); > > > > > > And then this could be something like "tuple xmax %u follows > > > last-assigned xid %u". That would be more symmetric and more > > > informative. > > > > Both of these have been changed. > > > > > + if (SizeofHeapTupleHeader + BITMAPLEN(ctx->natts) > > > > ctx->tuphdr->t_hoff) > > > > > > I think we should be able to predict the exact value of t_hoff and > > > complain if it isn't precisely equal to the expected value. Or is that > > > not possible for some reason? > > > > That is possible, and I've updated the error message to match. There are cases where you can't know if the HEAP_HASNULLbit is wrong or if the t_hoff value is wrong, but I've changed the code to just compute the length based on theHEAP_HASNULL setting and use that as the expected value, and complain when the actual value does not match the expected. That sidesteps the problem of not knowing exactly which value to blame. > > > > > Is there some place that's checking that lp_len >= > > > SizeOfHeapTupleHeader before check_tuple() goes and starts poking into > > > the header? If not, there should be. > > > > Good catch. check_tuple() now does that before reading the header. > > > > > +$node->command_ok( > > > > > > + [ > > > + 'pg_amcheck', '-p', $port, 'postgres' > > > + ], > > > + 'pg_amcheck all schemas and tables implicitly'); > > > + > > > +$node->command_ok( > > > + [ > > > + 'pg_amcheck', '-i', '-p', $port, 'postgres' > > > + ], > > > + 'pg_amcheck all schemas, tables and indexes'); > > > > > > I haven't really looked through the btree-checking and pg_amcheck > > > parts of this much yet, but this caught my eye. Why would the default > > > be to check tables but not indexes? I think the default ought to be to > > > check everything we know how to check. > > > > I have changed the default to match your expectations. > > > > > > > > — > > Mark Dilger > > EnterpriseDB: http://www.enterprisedb.com > > The Enterprise PostgreSQL Company > > > > > >
> On Jul 20, 2020, at 11:50 PM, Amul Sul <sulamul@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 10:58 AM Amul Sul <sulamul@gmail.com> wrote: >> >> Hi Mark, >> >> I think new structures should be listed in src/tools/pgindent/typedefs.list, >> otherwise, pgindent might disturb its indentation. >> <snip> > > In v11-0001 and v11-0002 patches, there are still a few more errmsg that need to > be joined. > > e.g: > > + /* check to see if caller supports us returning a tuplestore */ > + if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("set-valued function called in context that cannot " > + "accept a set"))); > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_SYNTAX_ERROR), > + errmsg("materialize mode required, but it is not allowed " > + "in this context"))); Thanks for the review! I believe these v12 patches resolve the two issues you raised. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > [....] > > > > + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, > > + "InvalidOffsetNumber > > increments to FirstOffsetNumber"); > > > > If you are going to rely on this property, I agree that it is good to > > check it. But it would be better to NOT rely on this property, and I > > suspect the code can be written quite cleanly without relying on it. > > And actually, that's what you did, because you first set ctx.offnum = > > InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in > > the loop initializer. So AFAICS the first initializer, and the static > > assert, are pointless. > > Ah, right you are. Removed. > I can see the same assert and the unnecessary assignment in v12-0002, is that the same thing that is supposed to be removed, or am I missing something? > [....] > > +confess(HeapCheckContext * ctx, char *msg) > > +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) > > +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) > > > > This is what happens when you pgindent without adding all the right > > things to typedefs.list first ... or when you don't pgindent and have > > odd ideas about how to indent things. > > Hmm. I don't see the three lines of code you are quoting. Which patch is that from? > I think it was the same thing related to my previous suggestion to list new structures to typedefs.list. V12 has listed new structures but I think there are still some more adjustments needed in the code e.g. see space between HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the pgindent will do that or not. Here are a few more minor comments for the v12-0002 patch & some of them apply to other patches as well: #include "utils/snapmgr.h" - +#include "amcheck.h" Doesn't seem to be at the correct place -- need to be in sorted order. + if (!PG_ARGISNULL(3)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("starting block " INT64_FORMAT + " is out of bounds for relation with no blocks", + PG_GETARG_INT64(3)))); + if (!PG_ARGISNULL(4)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("ending block " INT64_FORMAT + " is out of bounds for relation with no blocks", + PG_GETARG_INT64(4)))); I think these errmsg() strings also should be in one line. + if (fatal) + { + if (ctx.toast_indexes) + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes, + ShareUpdateExclusiveLock); + if (ctx.toastrel) + table_close(ctx.toastrel, ShareUpdateExclusiveLock); Toast index and rel closing block style is not the same as at the ending of verify_heapam(). + /* If we get this far, we know the relation has at least one block */ + startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3); + endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4); + if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT + " is out of bounds for relation with block count %u", + startblock, endblock, ctx.nblocks))); + ... ... + if (startblock < 0) + startblock = 0; + if (endblock < 0 || endblock > ctx.nblocks) + endblock = ctx.nblocks; Other than endblock < 0 case, do we really need that? I think due to the above error check the rest of the cases will not reach this place. + confess(ctx, psprintf( + "tuple xmax %u follows last assigned xid %u", + xmax, ctx->nextKnownValidXid)); + fatal = true; + } + } + + /* Check for tuple header corruption */ + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) + { + confess(ctx, + psprintf("tuple's header size is %u bytes which is less than the %u byte minimum valid header size", + ctx->tuphdr->t_hoff, + (unsigned) SizeofHeapTupleHeader)); confess() call has two different code styles, first one where psprintf()'s only argument got its own line and second style where psprintf has its own line with the argument. I think the 2nd style is what we do follow & correct, not the former. + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not a heap", + RelationGetRelationName(rel)))); Like elsewhere, can we have errmsg as "only heap AM is supported" and error code is ERRCODE_FEATURE_NOT_SUPPORTED ? That all, for now, apologize for multiple review emails. Regards, Amul
> On Jul 26, 2020, at 9:27 PM, Amul Sul <sulamul@gmail.com> wrote: > > On Tue, Jul 21, 2020 at 2:32 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> [....] >>> >>> + StaticAssertStmt(InvalidOffsetNumber + 1 == FirstOffsetNumber, >>> + "InvalidOffsetNumber >>> increments to FirstOffsetNumber"); >>> >>> If you are going to rely on this property, I agree that it is good to >>> check it. But it would be better to NOT rely on this property, and I >>> suspect the code can be written quite cleanly without relying on it. >>> And actually, that's what you did, because you first set ctx.offnum = >>> InvalidOffsetNumber but then just after that you set ctx.offnum = 0 in >>> the loop initializer. So AFAICS the first initializer, and the static >>> assert, are pointless. >> >> Ah, right you are. Removed. >> > > I can see the same assert and the unnecessary assignment in v12-0002, is that > the same thing that is supposed to be removed, or am I missing something? That's the same thing. I removed it, but obviously I somehow removed the removal prior to making the patch. My best guessis that I reverted some set of changes that unintentionally included this one. > >> [....] >>> +confess(HeapCheckContext * ctx, char *msg) >>> +TransactionIdValidInRel(TransactionId xid, HeapCheckContext * ctx) >>> +check_tuphdr_xids(HeapTupleHeader tuphdr, HeapCheckContext * ctx) >>> >>> This is what happens when you pgindent without adding all the right >>> things to typedefs.list first ... or when you don't pgindent and have >>> odd ideas about how to indent things. >> >> Hmm. I don't see the three lines of code you are quoting. Which patch is that from? >> > > I think it was the same thing related to my previous suggestion to list new > structures to typedefs.list. V12 has listed new structures but I think there > are still some more adjustments needed in the code e.g. see space between > HeapCheckContext and * (asterisk) that need to be fixed. I am not sure if the > pgindent will do that or not. Hmm. I'm not seeing an example of HeapCheckContext with wrong spacing. Can you provide a file and line number? There wasa problem with enum SkipPages. I've added that to the typedefs.list and rerun pgindent. While looking at that, I noticed that the function and variable naming conventions in this patch were irregular, with nameslike TransactionIdValidInRel (init-caps) and tuple_is_visible (underscores), so I spent some time cleaning that up forv13. > Here are a few more minor comments for the v12-0002 patch & some of them > apply to other patches as well: > > #include "utils/snapmgr.h" > - > +#include "amcheck.h" > > Doesn't seem to be at the correct place -- need to be in sorted order. Fixed. > + if (!PG_ARGISNULL(3)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("starting block " INT64_FORMAT > + " is out of bounds for relation with no blocks", > + PG_GETARG_INT64(3)))); > + if (!PG_ARGISNULL(4)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("ending block " INT64_FORMAT > + " is out of bounds for relation with no blocks", > + PG_GETARG_INT64(4)))); > > I think these errmsg() strings also should be in one line. I chose not to do so, because the INT64_FORMAT bit breaks up the text even if placed all on one line. I don't feel stronglyabout that, though, so I'll join them for v13. > + if (fatal) > + { > + if (ctx.toast_indexes) > + toast_close_indexes(ctx.toast_indexes, ctx.num_toast_indexes, > + ShareUpdateExclusiveLock); > + if (ctx.toastrel) > + table_close(ctx.toastrel, ShareUpdateExclusiveLock); > > Toast index and rel closing block style is not the same as at the ending of > verify_heapam(). I've harmonized the two. Thanks for noticing. > + /* If we get this far, we know the relation has at least one block */ > + startblock = PG_ARGISNULL(3) ? 0 : PG_GETARG_INT64(3); > + endblock = PG_ARGISNULL(4) ? ((int64) ctx.nblocks) - 1 : PG_GETARG_INT64(4); > + if (startblock < 0 || endblock >= ctx.nblocks || startblock > endblock) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("block range " INT64_FORMAT " .. " INT64_FORMAT > + " is out of bounds for relation with block count %u", > + startblock, endblock, ctx.nblocks))); > + > ... > ... > + if (startblock < 0) > + startblock = 0; > + if (endblock < 0 || endblock > ctx.nblocks) > + endblock = ctx.nblocks; > > Other than endblock < 0 case This case does not need special checking, either. The combination of checking that startblock >= 0 and that startblock <=endblock already handles it. > , do we really need that? I think due to the above > error check the rest of the cases will not reach this place. We don't need any of that. Removed in v13. > + confess(ctx, psprintf( > + "tuple xmax %u follows last assigned xid %u", > + xmax, ctx->nextKnownValidXid)); > + fatal = true; > + } > + } > + > + /* Check for tuple header corruption */ > + if (ctx->tuphdr->t_hoff < SizeofHeapTupleHeader) > + { > + confess(ctx, > + psprintf("tuple's header size is %u bytes which is less than the %u > byte minimum valid header size", > + ctx->tuphdr->t_hoff, > + (unsigned) SizeofHeapTupleHeader)); > > confess() call has two different code styles, first one where psprintf()'s only > argument got its own line and second style where psprintf has its own line with > the argument. I think the 2nd style is what we do follow & correct, not the > former. Ok, standardized in v13. > + if (rel->rd_rel->relam != HEAP_TABLE_AM_OID) > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("\"%s\" is not a heap", > + RelationGetRelationName(rel)))); > > Like elsewhere, can we have errmsg as "only heap AM is supported" and error > code is ERRCODE_FEATURE_NOT_SUPPORTED ? I'm indifferent about that change. Done for v13. > That all, for now, apologize for multiple review emails. Not at all! I appreciate all the reviews. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. That looks nice. > > I guess that > > could still be expensive if there's a lot of them, but needing > > ShareUpdateExclusiveLock rather than only AccessShareLock is a little > > unfortunate. > > I welcome strategies that would allow for taking a lesser lock. I guess I'm not seeing why you need any particular strategy here. Say that at the beginning you note the starting relfrozenxid of the table -- I think I would lean toward just ignoring datfrozenxid and the cluster-wide value completely. You also note the current value of the transaction ID counter. Those are the two ends of the acceptable range. Let's first consider the oldest acceptable XID, bounded by relfrozenxid. If you see a value that is older than the relfrozenxid value that you noted at the start, it is definitely invalid. If you see a newer value, it could still be older than the table's current relfrozenxid, but that doesn't seem very worrisome. If the user vacuumed the table while they were running this tool, they can always run the tool again afterward if they wish. Forcing the vacuum to wait by taking ShareUpdateExclusiveLock doesn't actually solve anything anyway: you STILL won't notice any problems the vacuum introduces, and in fact you are now GUARANTEED not to notice them, plus now the vacuum happens later. Now let's consider the newest acceptable XID, bounded by the value of the transaction ID counter. Any time you see a newer XID than the last value of the transaction ID counter that you observed, you go observe it again. If the value from the table still looks invalid, then you complain about it. Either way, you remember the new observation and check future tuples against that value. I think the patch is already doing this anyway; if it weren't, you'd need an even stronger lock, one sufficient to prevent any insert/update/delete activity on the table altogether. Maybe I'm just being dense here -- exactly what problem are you worried about? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Not at all! I appreciate all the reviews. Reviewing 0002, reading through verify_heapam.c: +typedef enum SkipPages +{ + SKIP_ALL_FROZEN_PAGES, + SKIP_ALL_VISIBLE_PAGES, + SKIP_PAGES_NONE +} SkipPages; This looks inconsistent. Maybe just start them all with SKIP_PAGES_. + if (PG_ARGISNULL(0)) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("missing required parameter for 'rel'"))); This doesn't look much like other error messages in the code. Do something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study the comparables. + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("unrecognized parameter for 'skip': %s", skip), + errhint("please choose from 'all-visible', 'all-frozen', or 'none'"))); Same problem. Check pg_prewarm's handling of the prewarm type, or EXPLAIN's handling of the FORMAT option, or similar examples. Read the message style guidelines concerning punctuation of hint and detail messages. + * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572) + * to have sometimes rendered the oldest xid value for a database invalid. + * It seems unwise to report rows as corrupt for failing to be newer than + * a value which itself may be corrupt. We instead use the oldest xid for + * the entire cluster, which must be at least as old as the oldest xid for + * our database. This kind of reference to another comment will not age well; line numbers and files change a lot. But I think the right thing to do here is just rely on relfrozenxid and relminmxid. If the table is inconsistent with those, then something needs fixing. datfrozenxid and the cluster-wide value can look out for themselves. The corruption detector shouldn't be trying to work around any bugs in setting relfrozenxid itself; such problems are arguably precisely what we're here to find. +/* + * confess + * + * Return a message about corruption, including information + * about where in the relation the corruption was found. + * + * The msg argument is pfree'd by this function. + */ +static void +confess(HeapCheckContext *ctx, char *msg) Contrary to what the comments say, the function doesn't return a message about corruption or anything else. It returns void. I don't really like the name, either. I get that it's probably inspired by Perl, but I think it should be given a less-clever name like report_corruption() or something. + * corrupted table from using workmem worth of memory building up the This kind of thing destroys grep-ability. If you're going to refer to work_mem, you gotta spell it the same way we do everywhere else. + * Helper function to construct the TupleDesc needed by verify_heapam. Instead of saying it's the TupleDesc somebody needs, how about saying that it's the TupleDesc that we'll use to report problems that we find while scanning the heap, or something like that? + * Given a TransactionId, attempt to interpret it as a valid + * FullTransactionId, neither in the future nor overlong in + * the past. Stores the inferred FullTransactionId in *fxid. It really doesn't, because there's no such thing as 'fxid' referenced anywhere here. You should really make the effort to proofread your patches before posting, and adjust comments and so on as you go. Otherwise reviewing takes longer, and if you keep introducing new stuff like this as you fix other stuff, you can fail to ever produce a committable patch. + * Determine whether tuples are visible for verification. Similar to + * HeapTupleSatisfiesVacuum, but with critical differences. Not accurate, because it also reports problems, which is not mentioned anywhere in the function header comment that purports to be a detailed description of what the function does. + else if (TransactionIdIsCurrentTransactionId(raw_xmin)) + return true; /* insert or delete in progress */ + else if (TransactionIdIsInProgress(raw_xmin)) + return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */ + else if (!TransactionIdDidCommit(raw_xmin)) + { + return false; /* HEAPTUPLE_DEAD */ + } One of these cases is not punctuated like the others. + pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has a valid xmax")); 1. I don't think that's very grammatical. 2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should be referencing C constant names here at all, and if you are I don't think you should abbreviate, and if you do abbreviate I don't think you should omit different numbers of words depending on which constant it is. I wonder what the intended division of responsibility is here, exactly. It seems like you've ended up with some sanity checks in check_tuple() before tuple_is_visible() is called, and others in tuple_is_visible() proper. As far as I can see the comments don't really discuss the logic behind the split, but there's clearly a close relationship between the two sets of checks, even to the point where you have "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has a valid xmax" in tuple_is_visible() and "tuple xmax marked incompatibly as keys updated and locked only" in check_tuple(). Now, those are not the same check, but they seem like closely related things, so it's not ideal that they happen in different functions with differently-formatted messages to report problems and no explanation of why it's different. I think it might make sense here to see whether you could either move more stuff out of tuple_is_visible(), so that it really just checks whether the tuple is visible, or move more stuff into it, so that it has the job not only of checking whether we should continue with checks on the tuple contents but also complaining about any other visibility problems. Or if neither of those make sense then there should be a stronger attempt to rationalize in the comments what checks are going where and for what reason, and also a stronger attempt to rationalize the message wording. + curchunk = DatumGetInt32(fastgetattr(toasttup, 2, + ctx->toast_rel->rd_att, &isnull)); Should we be worrying about the possibility of fastgetattr crapping out if the TOAST tuple is corrupted? + if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len) + { + confess(ctx, + psprintf("tuple attribute should start at offset %u, but tuple length is only %u", + ctx->tuphdr->t_hoff + ctx->offset, ctx->lp_len)); + return false; + } + + /* Skip null values */ + if (infomask & HEAP_HASNULL && att_isnull(ctx->attnum, ctx->tuphdr->t_bits)) + return true; + + /* Skip non-varlena values, but update offset first */ + if (thisatt->attlen != -1) + { + ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign); + ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen, + tp + ctx->offset); + return true; + } This looks like it's not going to complain about a fixed-length attribute that overruns the tuple length. There's code further down that handles that case for a varlena attribute, but there's nothing comparable for the fixed-length case. + confess(ctx, + psprintf("%s toast at offset %u is unexpected", + va_tag == VARTAG_INDIRECT ? "indirect" : + va_tag == VARTAG_EXPANDED_RO ? "expanded" : + va_tag == VARTAG_EXPANDED_RW ? "expanded" : + "unexpected", + ctx->tuphdr->t_hoff + ctx->offset)); I suggest "unexpected TOAST tag %d", without trying to convert to a string. Such a conversion will likely fail in the case of genuine corruption, and isn't meaningful even if it works. Again, let's try to standardize terminology here: most of the messages in this function are now of the form "tuple attribute %d has some problem" or "attribute %d has some problem", but some have neither. Since we're separately returning attnum I don't see why it should be in the message, and if we weren't separately returning attnum then it ought to be in the message the same way all the time, rather than sometimes writing "attribute" and other times "tuple attribute". + /* Check relminmxid against mxid, if any */ + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr); + if (infomask & HEAP_XMAX_IS_MULTI && + MultiXactIdPrecedes(xmax, ctx->relminmxid)) + { + confess(ctx, + psprintf("tuple xmax %u precedes relminmxid %u", + xmax, ctx->relminmxid)); + fatal = true; + } There are checks that an XID is neither too old nor too new, and presumably something similar could be done for MultiXactIds, but here you only check one end of the range. Seems like you should check both. + /* Check xmin against relfrozenxid */ + xmin = HeapTupleHeaderGetXmin(ctx->tuphdr); + if (TransactionIdIsNormal(ctx->relfrozenxid) && + TransactionIdIsNormal(xmin)) + { + if (TransactionIdPrecedes(xmin, ctx->relfrozenxid)) + { + confess(ctx, + psprintf("tuple xmin %u precedes relfrozenxid %u", + xmin, ctx->relfrozenxid)); + fatal = true; + } + else if (!xid_valid_in_rel(xmin, ctx)) + { + confess(ctx, + psprintf("tuple xmin %u follows last assigned xid %u", + xmin, ctx->next_valid_xid)); + fatal = true; + } + } Here you do check both ends of the range, but the comment claims otherwise. Again, please proof-read for this kind of stuff. + /* Check xmax against relfrozenxid */ Ditto here. + psprintf("tuple's header size is %u bytes which is less than the %u byte minimum valid header size", I suggest: tuple data begins at byte %u, but the tuple header must be at least %u bytes + psprintf("tuple's %u byte header size exceeds the %u byte length of the entire tuple", I suggest: tuple data begins at byte %u, but the entire tuple length is only %u bytes + psprintf("tuple's user data offset %u not maximally aligned to %u", I suggest: tuple data begins at byte %u, but that is not maximally aligned Or: tuple data begins at byte %u, which is not a multiple of %u That makes the messages look much more similar to each other grammatically and is more consistent about calling things by the same names. + psprintf("tuple with null values has user data offset %u rather than the expected offset %u", + psprintf("tuple without null values has user data offset %u rather than the expected offset %u", I suggest merging these: tuple data offset %u, but expected offset %u (%u attributes, %s) where %s is either "has nulls" or "no nulls" In fact, aren't several of the above checks redundant with this one? Like, why check for a value less than SizeofHeapTupleHeader or that's not properly aligned first? Just check this straightaway and call it good. + * If we get this far, the tuple is visible to us, so it must not be + * incompatible with our relDesc. The natts field could be legitimately + * shorter than rel's natts, but it cannot be longer than rel's natts. This is yet another case where you didn't update the comments. tuple_is_visible() now checks whether the tuple is visible to anyone, not whether it's visible to us, but the comment doesn't agree. In some sense I think this comment is redundant with the previous one anyway, because that one already talks about the tuple being visible. Maybe just write: The tuple is visible, so it must be compatible with the current version of the relation descriptor. It might have fewer columns than are present in the relation descriptor, but it cannot have more. + psprintf("tuple has %u attributes in relation with only %u attributes", + ctx->natts, + RelationGetDescr(ctx->rel)->natts)); I suggest: tuple has %u attributes, but relation has only %u attributes + /* + * Iterate over the attributes looking for broken toast values. This + * roughly follows the logic of heap_deform_tuple, except that it doesn't + * bother building up isnull[] and values[] arrays, since nobody wants + * them, and it unrolls anything that might trip over an Assert when + * processing corrupt data. + */ + ctx->offset = 0; + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) + { + if (!check_tuple_attribute(ctx)) + break; + } I think this comment is too wordy. This text belongs in the header comment of check_tuple_attribute(), not at the place where it gets called. Otherwise, as you update what check_tuple_attribute() does, you have to remember to come find this comment and fix it to match, and you might forget to do that. In fact... looks like that already happened, because check_tuple_attribute() now checks more than broken TOAST attributes. Seems like you could just simplify this down to something like "Now check each attribute." Also, you could lose the extra braces. - bt_index_check | relname | relpages + bt_index_check | relname | relpages Don't include unrelated changes in the patch. I'm not really sure that the list of fields you're displaying for each reported problem really makes sense. I think the theory here should be that we want to report the information that the user needs to localize the problem but not everything that they could find out from inspecting the page, and not things that are too specific to particular classes of errors. So I would vote for keeping blkno, offnum, and attnum, but I would lose lp_flags, lp_len, and chunk. lp_off feels like it's a more arguable case: technically, it's a locator for the problem, because it gives you the byte offset within the page, but normally we reference tuples by TID, i.e. (blkno, offset), not byte offset. On balance I'd be inclined to omit it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 29, 2020, at 12:52 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 20, 2020 at 5:02 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> I've made the options 'all-visible', 'all-frozen', and 'none'. It defaults to 'none'. > > That looks nice. > >>> I guess that >>> could still be expensive if there's a lot of them, but needing >>> ShareUpdateExclusiveLock rather than only AccessShareLock is a little >>> unfortunate. >> >> I welcome strategies that would allow for taking a lesser lock. > > I guess I'm not seeing why you need any particular strategy here. Say > that at the beginning you note the starting relfrozenxid of the table > -- I think I would lean toward just ignoring datfrozenxid and the > cluster-wide value completely. You also note the current value of the > transaction ID counter. Those are the two ends of the acceptable > range. > > Let's first consider the oldest acceptable XID, bounded by > relfrozenxid. If you see a value that is older than the relfrozenxid > value that you noted at the start, it is definitely invalid. If you > see a newer value, it could still be older than the table's current > relfrozenxid, but that doesn't seem very worrisome. If the user > vacuumed the table while they were running this tool, they can always > run the tool again afterward if they wish. Forcing the vacuum to wait > by taking ShareUpdateExclusiveLock doesn't actually solve anything > anyway: you STILL won't notice any problems the vacuum introduces, and > in fact you are now GUARANTEED not to notice them, plus now the vacuum > happens later. > > Now let's consider the newest acceptable XID, bounded by the value of > the transaction ID counter. Any time you see a newer XID than the last > value of the transaction ID counter that you observed, you go observe > it again. If the value from the table still looks invalid, then you > complain about it. Either way, you remember the new observation and > check future tuples against that value. I think the patch is already > doing this anyway; if it weren't, you'd need an even stronger lock, > one sufficient to prevent any insert/update/delete activity on the > table altogether. > > Maybe I'm just being dense here -- exactly what problem are you worried about? Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I am worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. The threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUMand such. Maybe I'm being dense and don't need to worry about this. But I haven't convinced myself of that,yet. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-30 13:18:01 -0700, Mark Dilger wrote: > Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I am worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. The threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUMand such. Maybe I'm being dense and don't need to worry about this. But I haven't convinced myself of that,yet. I think it's not at all ok to look in the procarray or clog for xids that are older than what you're announcing you may read. IOW I don't think it's OK to just ignore the problem, or try to work around it by holding XactTruncationLock. Greetings, Andres Freund
On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Maybe I'm just being dense here -- exactly what problem are you worried about? > > Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I am worriedabout concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. The threestrategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock, forthose reading this thread from the beginning), locking out vacuum, and the idea upthread from Andres about setting PROC_IN_VACUUMand such. Maybe I'm being dense and don't need to worry about this. But I haven't convinced myself of that,yet. I don't get it. If you've already checked that the XIDs are >= relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be a problem. It could be, if CLOG is hosed, which is possible, because if the table is corrupted, why shouldn't CLOG also be corrupted? But I'm not sure that's what your concern is here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 30, 2020, at 2:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 30, 2020 at 4:18 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >>> Maybe I'm just being dense here -- exactly what problem are you worried about? >> >> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I amworried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. Thethree strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock,for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andresabout setting PROC_IN_VACUUM and such. Maybe I'm being dense and don't need to worry about this. But I haven't convincedmyself of that, yet. > > I don't get it. If you've already checked that the XIDs are >= > relfrozenxid and <= ReadNewFullTransactionId(), then this shouldn't be > a problem. It could be, if CLOG is hosed, which is possible, because > if the table is corrupted, why shouldn't CLOG also be corrupted? But > I'm not sure that's what your concern is here. No, that wasn't my concern. I was thinking about CLOG entries disappearing during the scan as a consequence of concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range. Inthe absence of corruption, I don't immediately see how this would cause any problems. But for a corrupt table, I'm lesscertain how it would play out. The kind of scenario I'm worried about may not be possible in practice. I think it would depend on how vacuum behaves whenscanning a corrupt table that is corrupt in some way that vacuum doesn't notice, and whether vacuum could finish scanningthe table with the false belief that it has frozen all tuples with xids less than some cutoff. I thought it would be safer if that kind of thing were not happening during verify_heapam's scan of the table. Even if acareful analysis proved it was not an issue with the current coding of vacuum, I don't think there is any coding conventionrequiring future versions of vacuum to be hardened against corruption, so I don't see how I can rely on vacuumnot causing such problems. I don't think this is necessarily a too-rare-to-care-about type concern, either. If corruption across multiple tables preventsautovacuum from succeeding, and the DBA doesn't get involved in scanning tables for corruption until the lack ofsuccessful vacuums impacts the production system, I imagine you could end up with vacuums repeatedly happening (or tryingto happen) around the time the DBA is trying to fix tables, or perhaps drop them, or whatever, using verify_heapamfor guidance on which tables are corrupted. Anyway, that's what I was thinking. I was imagining that calling TransactionIdDidCommit might keep crashing the backendwhile the DBA is trying to find and fix corruption, and that could get really annoying. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 30, 2020, at 1:47 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-07-30 13:18:01 -0700, Mark Dilger wrote: >> Per tuple, tuple_is_visible() potentially checks whether the xmin or xmax committed via TransactionIdDidCommit. I amworried about concurrent truncation of clog entries causing I/O errors on SLRU lookup when performing that check. Thethree strategies I had for dealing with that were taking the XactTruncationLock (formerly known as CLogTruncationLock,for those reading this thread from the beginning), locking out vacuum, and the idea upthread from Andresabout setting PROC_IN_VACUUM and such. Maybe I'm being dense and don't need to worry about this. But I haven't convincedmyself of that, yet. > > I think it's not at all ok to look in the procarray or clog for xids > that are older than what you're announcing you may read. IOW I don't > think it's OK to just ignore the problem, or try to work around it by > holding XactTruncationLock. The current state of the patch is that concurrent vacuums are kept out of the table being checked by means of taking a ShareUpdateExclusivelock on the table being checked. In response to Robert's review, I was contemplating whether that wasnecessary, but you raise the interesting question of whether it is even sufficient. The logic in verify_heapam is currentlyrelying on the ShareUpdateExclusive lock to prevent any of the xids in the range relfrozenxid..nextFullXid frombeing invalid arguments to TransactionIdDidCommit. Ignoring whether that is a good choice vis-a-vis performance, isthat even a valid strategy? It sounds like you are saying it is not. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > No, that wasn't my concern. I was thinking about CLOG entries disappearing during the scan as a consequence of concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range. Inthe absence of corruption, I don't immediately see how this would cause any problems. But for a corrupt table, I'm lesscertain how it would play out. Oh, hmm. I wasn't thinking about that problem. I think the only way this can happen is if we read a page and then, before we try to look up the CID, vacuum zooms past, finishes the whole table, and truncates clog. But if that's possible, then it seems like it would be an issue for SELECT as well, and it apparently isn't, or we would've done something about it by now. I think the reason it's not possible is because of the locking rules described in src/backend/storage/buffer/README, which require that you hold a buffer lock until you've determined that the tuple is visible. Since you hold a share lock on the buffer, a VACUUM that hasn't already processed that freeze the tuples in that buffer; it would need an exclusive lock on the buffer to do that. Therefore it can't finish and truncate clog either. Now, you raise the question of whether this is still true if the table is corrupt, but I don't really see why that makes any difference. VACUUM is supposed to freeze each page it encounters, to the extent that such freezing is necessary, and with Andres's changes, it's supposed to ERROR out if things are messed up. We can postulate a bug in that logic, but inserting a VACUUM-blocking lock into this tool to guard against a hypothetical vacuum bug seems strange to me. Why would the right solution not be to fix such a bug if and when we find that there is one? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> No, that wasn't my concern. I was thinking about CLOG entries disappearing during the scan as a consequence of concurrentvacuums, and the effect that would have on the validity of the cached [relfrozenxid..next_valid_xid] range. Inthe absence of corruption, I don't immediately see how this would cause any problems. But for a corrupt table, I'm lesscertain how it would play out. > > Oh, hmm. I wasn't thinking about that problem. I think the only way > this can happen is if we read a page and then, before we try to look > up the CID, vacuum zooms past, finishes the whole table, and truncates > clog. But if that's possible, then it seems like it would be an issue > for SELECT as well, and it apparently isn't, or we would've done > something about it by now. I think the reason it's not possible is > because of the locking rules described in > src/backend/storage/buffer/README, which require that you hold a > buffer lock until you've determined that the tuple is visible. Since > you hold a share lock on the buffer, a VACUUM that hasn't already > processed that freeze the tuples in that buffer; it would need an > exclusive lock on the buffer to do that. Therefore it can't finish and > truncate clog either. > > Now, you raise the question of whether this is still true if the table > is corrupt, but I don't really see why that makes any difference. > VACUUM is supposed to freeze each page it encounters, to the extent > that such freezing is necessary, and with Andres's changes, it's > supposed to ERROR out if things are messed up. We can postulate a bug > in that logic, but inserting a VACUUM-blocking lock into this tool to > guard against a hypothetical vacuum bug seems strange to me. Why would > the right solution not be to fix such a bug if and when we find that > there is one? Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about, I'llwithdraw the argument. But that leaves me wondering about a comment that Andres made upthread: > On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote: > I don't think random interspersed uses of CLogTruncationLock are a good > idea. If you move to only checking visibility after tuple fits into > [relfrozenxid, nextXid), then you don't need to take any locks here, as > long as a lock against vacuum is taken (which I think this should do > anyway). — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger > Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about,I'll withdraw the argument. But that leaves me wondering about a comment that Andres made upthread: > > > On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote: > > > I don't think random interspersed uses of CLogTruncationLock are a good > > idea. If you move to only checking visibility after tuple fits into > > [relfrozenxid, nextXid), then you don't need to take any locks here, as > > long as a lock against vacuum is taken (which I think this should do > > anyway). The version of the patch I'm looking at doesn't seem to mention CLogTruncationLock at all, so I'm confused about the comment. But what it is that you are wondering about exactly? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 31, 2020, at 5:02 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 30, 2020 at 9:38 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >>> On Jul 30, 2020, at 5:53 PM, Robert Haas <robertmhaas@gmail.com> wrote: >>> On Thu, Jul 30, 2020 at 6:10 PM Mark Dilger >> Since I can't think of a plausible concrete example of corruption which would elicit the problem I was worrying about,I'll withdraw the argument. But that leaves me wondering about a comment that Andres made upthread: >> >>> On Apr 20, 2020, at 12:42 PM, Andres Freund <andres@anarazel.de> wrote: >> >>> I don't think random interspersed uses of CLogTruncationLock are a good >>> idea. If you move to only checking visibility after tuple fits into >>> [relfrozenxid, nextXid), then you don't need to take any locks here, as >>> long as a lock against vacuum is taken (which I think this should do >>> anyway). > > The version of the patch I'm looking at doesn't seem to mention > CLogTruncationLock at all, so I'm confused about the comment. But what > it is that you are wondering about exactly? In earlier versions of the patch, I was guarding (perhaps unnecessarily) against clog truncation, (perhaps incorrectly) bytaking the CLogTruncationLock (aka XactTruncationLock.) . I thought Andres was arguing that such locks were not necessary"as long as a lock against vacuum is taken". That's what motivated me to remove the clog locking business and putin the ShareUpdateExclusive lock. I don't want to remove the ShareUpdateExclusive lock from the patch without perhapsa clarification from Andres on the subject. His recent reply upthread seems to still support the idea that some kindof protection is required: > I think it's not at all ok to look in the procarray or clog for xids > that are older than what you're announcing you may read. IOW I don't > think it's OK to just ignore the problem, or try to work around it by > holding XactTruncationLock. I don't understand that paragraph fully, in particular the part about "than what you're announcing you may read", since thecached value of relfrozenxid is not announced; we're just assuming that as long as vacuum cannot advance it during ourscan, that we should be safe checking whether xids newer than that value (and not in the future) were committed. Andres? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-31 08:51:50 -0700, Mark Dilger wrote: > In earlier versions of the patch, I was guarding (perhaps > unnecessarily) against clog truncation, (perhaps incorrectly) by > taking the CLogTruncationLock (aka XactTruncationLock.) . I thought > Andres was arguing that such locks were not necessary "as long as a > lock against vacuum is taken". That's what motivated me to remove the > clog locking business and put in the ShareUpdateExclusive lock. I > don't want to remove the ShareUpdateExclusive lock from the patch > without perhaps a clarification from Andres on the subject. His > recent reply upthread seems to still support the idea that some kind > of protection is required: I'm not sure what I was thinking "back then", but right now I'd argue that the best lock against vacuum isn't a SUE, but announcing the correct ->xmin, so you can be sure that clog entries won't be yanked out from under you. Potentially with the right flag sets to avoid old enough tuples eing pruned. > > I think it's not at all ok to look in the procarray or clog for xids > > that are older than what you're announcing you may read. IOW I don't > > think it's OK to just ignore the problem, or try to work around it by > > holding XactTruncationLock. > > I don't understand that paragraph fully, in particular the part about > "than what you're announcing you may read", since the cached value of > relfrozenxid is not announced; we're just assuming that as long as > vacuum cannot advance it during our scan, that we should be safe > checking whether xids newer than that value (and not in the future) > were committed. With 'announcing' I mean using the normal mechanism for avoiding the clog being truncated for values one might look up. Which is announcing the oldest xid one may look up in PGXACT->xmin. Greetings, Andres Freund
On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote: > I'm not sure what I was thinking "back then", but right now I'd argue > that the best lock against vacuum isn't a SUE, but announcing the > correct ->xmin, so you can be sure that clog entries won't be yanked out > from under you. Potentially with the right flag sets to avoid old enough > tuples eing pruned. Suppose we don't even do anything special in terms of advertising xmin. What can go wrong? To have a problem, we've got to be running concurrently with a vacuum that truncates clog. The clog truncation must happen before our XID lookups, but vacuum has to remove the XIDs from the heap before it can truncate. So we have to observe the XIDs before vacuum removes them, but then vacuum has to truncate before we look them up. But since we observe them and look them up while holding a ShareLock on the buffer, this seems impossible. What's the flaw in this argument? If we do need to do something special in terms of advertising xmin, how would you do it? Normally it happens by registering a snapshot, but here all we would have is an XID; specifically, the value of relfrozenxid that we observed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-07-31 12:42:51 -0400, Robert Haas wrote: > On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote: > > I'm not sure what I was thinking "back then", but right now I'd argue > > that the best lock against vacuum isn't a SUE, but announcing the > > correct ->xmin, so you can be sure that clog entries won't be yanked out > > from under you. Potentially with the right flag sets to avoid old enough > > tuples eing pruned. > > Suppose we don't even do anything special in terms of advertising > xmin. What can go wrong? To have a problem, we've got to be running > concurrently with a vacuum that truncates clog. The clog truncation > must happen before our XID lookups, but vacuum has to remove the XIDs > from the heap before it can truncate. So we have to observe the XIDs > before vacuum removes them, but then vacuum has to truncate before we > look them up. But since we observe them and look them up while holding > a ShareLock on the buffer, this seems impossible. What's the flaw in > this argument? The page could have been wrongly marked all-frozen. There could be interactions between heap and toast table that are checked. Other bugs could apply, like a broken hot chain or such. > If we do need to do something special in terms of advertising xmin, > how would you do it? Normally it happens by registering a snapshot, > but here all we would have is an XID; specifically, the value of > relfrozenxid that we observed. An appropriate procarray or snapmgr function would probably suffice? Greetings, Andres Freund
On Fri, Jul 31, 2020 at 3:05 PM Andres Freund <andres@anarazel.de> wrote: > The page could have been wrongly marked all-frozen. There could be > interactions between heap and toast table that are checked. Other bugs > could apply, like a broken hot chain or such. OK, at least the first two of these do sound like problems. Not sure about the third one. > > If we do need to do something special in terms of advertising xmin, > > how would you do it? Normally it happens by registering a snapshot, > > but here all we would have is an XID; specifically, the value of > > relfrozenxid that we observed. > > An appropriate procarray or snapmgr function would probably suffice? Not sure; I guess that'll need some investigation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jul 30, 2020, at 10:59 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > + curchunk = DatumGetInt32(fastgetattr(toasttup, 2, > + ctx->toast_rel->rd_att, &isnull)); > > Should we be worrying about the possibility of fastgetattr crapping > out if the TOAST tuple is corrupted? I think we should, but I'm not sure we should be worrying about it at this location. If the toast index is corrupt, systable_getnext_orderedcould trip over the index corruption in the process of retrieving the toast tuple, so checking thetoast tuple only helps if the toast index does not cause a crash first. I think the toast index should be checked beforethis point, ala verify_nbtree, so that we don't need to worry about that here. It might also make more sense to verifythe toast table ala verify_heapam prior to here, so we don't have to worry about that here either. But that raisesquestions about whose responsibility this all is. If verify_heapam checks the toast table and toast index before themain table, that takes care of it, but makes a mess of the idea of verify_heapam taking a start and end block, since verifyingthe toast index is an all or nothing proposition, not something to be done in incremental pieces. If we leave verify_heapamas it is, then it is up to the caller to check the toast before the main relation, which is more flexible, butis more complicated and requires the user to remember to do it. We could split the difference by having verify_heapamdo nothing about toast, leaving it up to the caller, but make pg_amcheck handle it by default, making it easierfor users to not think about the issue. Users who want to do incremental checking could still keep track of the chunksthat have already been checked, not just for the main relation, but for the toast relation, too, and give start andend block arguments to verify_heapam for the toast table check and then again for the main table check. That doesn'tfix the question of incrementally checking the index, though. Looking at it a slightly different way, I think what is being checked at the point in the code you mention is the logicalstructure of the toasted value related to the current main table tuple, not the lower level tuple structure of thetoast table. We already have a function for checking a heap, namely verify_heapam, and we (or the caller, really) shouldbe using that. The clean way to do things is verify_heapam(toast_rel) verify_btreeam(toast_idx) verify_heapam(main_rel) and then depending on how fast and loose you want to be, you can use the start and end block arguments, which are inherentlya bit half-baked, given the lack of any way to be sure you check precisely the right range of blocks, and alsoyou can be fast and loose about skipping the index check or not, as you see fit. Thoughts? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Jul 27, 2020 at 10:02 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I'm indifferent about that change. Done for v13. Moving on with verification of the same index in the event of B-Tree index corruption is a categorical mistake. verify_nbtree.c was simply not designed to work that way. You were determined to avoid allowing any behavior that can result in a backend crash in the event of corruption, but this design will defeat various measures I took to avoid crashing with corrupt data (e.g. in commit a9ce839a313). What's the point in not just giving up on the index (though not necessarily the table or other indexes) at the first sign of trouble, anyway? It makes sense for the heap structure, but not for indexes. -- Peter Geoghegan
On Thu, Jul 30, 2020 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > I don't really like the name, either. I get that it's probably > inspired by Perl, but I think it should be given a less-clever name > like report_corruption() or something. +1 -- confess() is an awful name for this. -- Peter Geoghegan
> On Aug 2, 2020, at 8:59 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > What's the point in not just giving up on the index (though not > necessarily the table or other indexes) at the first sign of trouble, > anyway? It makes sense for the heap structure, but not for indexes. The case that came to mind was an index broken by a glibc update with breaking changes to the collation sort order underlyingthe index. If the breaking change has already been live in production for quite some time before a DBA notices,they might want to quantify how broken the index has been for the last however many days, not just drop and recreatethe index. I'm happy to drop that from the patch, though. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 2, 2020, at 9:13 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Jul 30, 2020 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote: >> I don't really like the name, either. I get that it's probably >> inspired by Perl, but I think it should be given a less-clever name >> like report_corruption() or something. > > +1 -- confess() is an awful name for this. I was trying to limit unnecessary whitespace changes. s/ereport/econfess/ leaves the function name nearly the same lengthsuch that the following lines of indented error text don't usually get moved by pgindent. Given the unpopularity ofthe name, it's not worth it, so I'll go with Robert's report_corruption, instead. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 3, 2020 at 12:00 AM Peter Geoghegan <pg@bowt.ie> wrote: > Moving on with verification of the same index in the event of B-Tree > index corruption is a categorical mistake. verify_nbtree.c was simply > not designed to work that way. > > You were determined to avoid allowing any behavior that can result in > a backend crash in the event of corruption, but this design will > defeat various measures I took to avoid crashing with corrupt data > (e.g. in commit a9ce839a313). > > What's the point in not just giving up on the index (though not > necessarily the table or other indexes) at the first sign of trouble, > anyway? It makes sense for the heap structure, but not for indexes. I agree that there's a serious design problem with Mark's patch in this regard, but I disagree that the effort is pointless on its own terms. You're basically postulating that users don't care how corrupt their index is: whether there's one problem or one million problems, it's all the same. If the user presents an index with one million problems and we tell them about one of them, we've done our job and can go home. This doesn't match my experience. When an EDB customer reports corruption, typically one of the first things I want to understand is how widespread the problem is. This same issue came up on the thread about relfrozenxid/relminmxid corruption. If you've got a table with one or two rows where tuple.xmin < relfrozenxid, that's a different kind of problem than if 50% of the tuples in the table have tuple.xmin < relfrozenxid; the latter might well indicate that relfrozenxid value itself is garbage, while the former indicates that a few tuples slipped through the cracks somehow. If you're contemplating a recovery strategy like "nuke the affected tuples from orbit," you really need to understand which of those cases you've got. Granted, this is a bit less important with indexes, because in most cases you're just going to REINDEX. But, even there, the question is not entirely academic. For instance, consider the case of a user whose database crashes and then fails to restart because WAL replay fails. Typically, there is little option here but to run pg_resetwal. At this point, you know that there is some damage, but you don't know how bad it is. If there was little system activity at the time of the crash, there may be only a handful of problems with the database. If there was a heavy OLTP workload running at the time of the crash, with a long checkpoint interval, the problems may be widespread. If the user has done this repeatedly before bothering to contact support, which is more common than you might suppose, the damage may be extremely widespread. Now, you could argue (and not unreasonably) that in any case after something like this happens even once, the user ought to dump and restore to get back to a known good state. However, when the cluster is 10TB in size and there's a $100,000 financial loss for every hour of downtime, the question naturally arises of how urgent that dump and restore is. Can we wait until our next maintenance window? Can we at least wait until off hours? Being able to tell the user whether they've got a tiny bit of corruption or a whole truckload of corruption can enable them to make better decisions in such cases, or at least more educated ones. Now, again, just replacing ereport(ERROR, ...) with something else that does not abort the rest of the checks is clearly not OK. I don't endorse that approach, or anything like it. But neither do I accept the argument that it would be useless to report all the errors even if we could do so safely. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 3, 2020 at 11:02 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I was trying to limit unnecessary whitespace changes. s/ereport/econfess/ leaves the function name nearly the same lengthsuch that the following lines of indented error text don't usually get moved by pgindent. Given the unpopularity ofthe name, it's not worth it, so I'll go with Robert's report_corruption, instead. Yeah, that's not really a good reason for something like that. I think what you should do is drop the nbtree portion of this for now; the length of the name then doesn't even matter at all, because all the code in which this is used will be new code. Even if we were churning existing code, mechanical stuff like this isn't really a huge problem most of the time, but there's no need for that here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Aug 3, 2020 at 8:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > I agree that there's a serious design problem with Mark's patch in > this regard, but I disagree that the effort is pointless on its own > terms. You're basically postulating that users don't care how corrupt > their index is: whether there's one problem or one million problems, > it's all the same. If the user presents an index with one million > problems and we tell them about one of them, we've done our job and > can go home. It's not so much that I think that users won't care about whether any given index is a bit corrupt or very corrupt. It's more like I don't think that it's worth the eye-watering complexity, especially without a real concrete goal in mind. "Counting all the errors, not just the first" sounds like a tractable goal for the heap/table structure, but it's just not like that with indexes. If you really wanted to do this, you'd have to describe a practical scenario under which it made sense to soldier on, where we'd definitely be able to count the number of problems in a meaningful way, without much risk of either massively overcounting or undecounting inconsistencies. Consider how the search in verify_ntree.c actually works at a high level. If you thoroughly corrupted one B-Tree leaf page (let's say you replaced it with an all-zero page image), all pages to the right of the page would be fundamentally inaccessible to the left-to-right level search that is coordinated within bt_check_level_from_leftmost(). And yet, most real index scans can still be expected to work. How do you know to skip past that one corrupt leaf page (by going back to the parent to get the next sibling leaf page) during index verification? That's what it would take to do this in the general case, I guess. More fundamentally, I wonder how many inconsistencies one should imagine that this index has, before we even get into talking about the implementation. -- Peter Geoghegan
On Mon, Aug 3, 2020 at 1:16 PM Peter Geoghegan <pg@bowt.ie> wrote: > If you really wanted to do this, > you'd have to describe a practical scenario under which it made sense > to soldier on, where we'd definitely be able to count the number of > problems in a meaningful way, without much risk of either massively > overcounting or undecounting inconsistencies. I completely agree. You have to have a careful plan to make this sort of thing work - you want to skip checking the things that are dependent on the part already determined to be bad, without skipping everything. You need a strategy for where and how to restart checking, first bypassing whatever needs to be skipped. > Consider how the search in verify_ntree.c actually works at a high > level. If you thoroughly corrupted one B-Tree leaf page (let's say you > replaced it with an all-zero page image), all pages to the right of > the page would be fundamentally inaccessible to the left-to-right > level search that is coordinated within > bt_check_level_from_leftmost(). And yet, most real index scans can > still be expected to work. How do you know to skip past that one > corrupt leaf page (by going back to the parent to get the next sibling > leaf page) during index verification? That's what it would take to do > this in the general case, I guess. In that particular example, you would want the function that verifies that page to return some indicator. If it finds that two keys in the page are out-of-order, it tells the caller that it can still follow the right-link. But if it finds that the whole page is garbage, then it tells the caller that it doesn't have a valid right-link and the caller's got to do something else, like give up on the rest of the checks or (better) try to recover a pointer to the next page from the parent. > More fundamentally, I wonder how > many inconsistencies one should imagine that this index has, before we > even get into talking about the implementation. I think we should try not to imagine anything in particular. Just to be clear, I am not trying to knock what you have; I know it was a lot of work to create and it's a huge improvement over having nothing. But in my mind, a perfect tool would do just what a human being would do if investigating manually: assume initially that you know nothing - the index might be totally fine, mildly corrupted in a very localized way, completely hosed, or anything in between. And it would systematically try to track that down by traversing the usable pointers that it has until it runs out of things to do. It does not seem impossible to build a tool that would allow us to take a big index and overwrite a random subset of pages with garbage data and have the tool tell us about all the bad pages that are still reachable from the root by any path. If you really wanted to go crazy with it, you could even try to find the bad pages that are not reachable from the root, by doing a pass after the fact over all the pages that you didn't otherwise reach. It would be a lot of work to build something like that and maybe not the best use of time, but if I got to wave tools into existence using my magic wand, I think that would be the gold standard. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 4, 2020 at 7:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think we should try not to imagine anything in particular. Just to > be clear, I am not trying to knock what you have; I know it was a lot > of work to create and it's a huge improvement over having nothing. But > in my mind, a perfect tool would do just what a human being would do > if investigating manually: assume initially that you know nothing - > the index might be totally fine, mildly corrupted in a very localized > way, completely hosed, or anything in between. And it would > systematically try to track that down by traversing the usable > pointers that it has until it runs out of things to do. It does not > seem impossible to build a tool that would allow us to take a big > index and overwrite a random subset of pages with garbage data and > have the tool tell us about all the bad pages that are still reachable > from the root by any path. If you really wanted to go crazy with it, > you could even try to find the bad pages that are not reachable from > the root, by doing a pass after the fact over all the pages that you > didn't otherwise reach. It would be a lot of work to build something > like that and maybe not the best use of time, but if I got to wave > tools into existence using my magic wand, I think that would be the > gold standard. I guess that might be true. With indexes you tend to have redundancy in how relationships among pages are described. So you have siblings whose pointers must be in agreement (left points to right, right points to left), and it's not clear which one you should trust when they don't agree. It's not like simple heuristics get you all that far. I really can't think of a good one, and detecting corruption should mean detecting truly exceptional cases. I guess you could build a model based on Bayesian methods, or something like that. But that is very complicated, and only used when you actually have corruption -- which is presumably extremely rare in reality. That's very unappealing as a project. I have always believed that the big problem is not "known unknowns". Rather, I think that the problem is "unknown unknowns". I accept that you have a point, especially when it comes to heap checking, but even there the most important consideration should be to make corruption detection thorough and cheap. The vast vast majority of databases do not have any corruption at any given time. You're not searching for a needle in a haystack; you're searching for a needle in many many haystacks within a field filled with haystacks, which taken together probably contain no needles at all. (OTOH, once you find one needle all bets are off, and you could very well go on to find a huge number of them.) -- Peter Geoghegan
On Fri, Jul 31, 2020 at 12:33 PM Andres Freund <andres@anarazel.de> wrote: > I'm not sure what I was thinking "back then", but right now I'd argue > that the best lock against vacuum isn't a SUE, but announcing the > correct ->xmin, so you can be sure that clog entries won't be yanked out > from under you. Potentially with the right flag sets to avoid old enough > tuples eing pruned. I was just thinking about this some more (and talking it over with Mark) and I think this might actually be a really bad idea. One problem with it is that it means that the oldest-xmin value can go backward, which is something that I think has caused us some problems before. There are some other cases where it can happen, and I'm not sure that there's any necessarily fatal problem with doing it in this case, but it would definitely be a shame if this contrib module broke something for core in a way that was hard to fix. But let's leave that aside and suppose that there is no fatal problem there. Essentially what we're talking about here is advertising the table's relfrozenxid as our xmin. How old is that likely to be? Maybe pretty old. The default value of vacuum_freeze_table_age is 150 million transactions, and that's just the trigger to start vacuuming; the actual value of age(relfrozenxid) could easily be higher than that. But even if it's only a fraction of that, it's still pretty bad. Advertising an xmin half that old (75 million transactions) is equivalent to keeping a snapshot open for an amount of time equal to however long it takes you to burn through 75 million XIDs. For instance, if you burn 10 million XIDs/hour, that's the equivalent of keeping a snapshot open for 7.5 hours. In other words, it's quite likely that doing this is going to make VACUUM (and HOT pruning) drastically less effective throughout the entire database cluster. To me, this seems a lot worse than just taking ShareUpdateExclusiveLock on the table. After all, ShareUpdateExclusiveLock will prevent VACUUM from running on that table, but it only affects that one table rather than the whole cluster, and it "only" stops VACUUM from running, which is still better than having it do lots of I/O but not clean anything up. I think I see another problem with this approach, too: it's racey. If some other process has entered vac_update_datfrozenxid() and has gotten past the calls to GetOldestXmin() and GetOldestMultiXactId(), and we then advertise an older xmin (and I guess also oldestMXact) it can still go on to update datfrozenxid/datminmxid and then truncate the SLRUs. Even holding XactTruncationLock is insufficient to protect against this race condition, and there doesn't seem to be any other obvious approach, either. So I would like to back up a minute and lay out the possible solutions as I understand them. The specific problem here I'm talking about here is: how do we keep from looking up an XID or MXID whose information might have been truncated away from the relevant SLRU? 1. Take a ShareUpdateExclusiveLock on the table. This prevents VACUUM from running concurrently on this table (which sucks), but that for sure guarantees that the table's relfrozenxid and relminmxid can't advance, which precludes a concurrent CLOG truncation. 2. Advertise an older xmin and minimum MXID. See above. 3. Acquire XactTruncationLock for each lookup, like pg_xact_status(). One downside here is a lot of extra lock acquisitions, but we can mitigate that to some degree by caching the results of lookups, and by not doing it for XIDs that our newer than our advertised xmin (which must be OK) or at least as old as the newest XID we previously discovered to be unsafe to look up (because those must not be OK either). The problem case is a table with lots of different XIDs that are all new enough to look up but older than our xmin, e.g. a table populated using many single-row inserts. But even if we hit this case, how bad is it really? I don't think XactTruncationLock is particularly hot, so maybe it just doesn't matter very much. We could contend against other sessions checking other tables, or against widespread use of pg_xact_status(), but I think that's about it. Another downside of this approach is that I'm not sure it does anything to help us with the MXID case; fixing that might require building some new infrastructure similar to XactTruncationLock but for MXIDs. 4. Provide entrypoints for looking up XIDs that fail gently instead of throwing errors. I've got my doubts about how practical this is; if it's easy, why didn't we do that instead of inventing XactTruncationLock? Maybe there are other options here, too? At the moment, I'm thinking that (2) and (4) are just bad and so we ought to either do (3) if it doesn't suck too much for performance (which I don't quite see why it should, but it might) or else fall back on (1). (1) doesn't feel clever enough but it might be better to be not clever enough than to be too clever. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 4, 2020 at 12:00 PM Peter Geoghegan <pg@bowt.ie> wrote: > With indexes you tend to have redundancy in how relationships among > pages are described. So you have siblings whose pointers must be in > agreement (left points to right, right points to left), and it's not > clear which one you should trust when they don't agree. It's not like > simple heuristics get you all that far. I really can't think of a good > one, and detecting corruption should mean detecting truly exceptional > cases. I guess you could build a model based on Bayesian methods, or > something like that. But that is very complicated, and only used when > you actually have corruption -- which is presumably extremely rare in > reality. That's very unappealing as a project. I think it might be possible to distinguish between different types of corruption and to separate, at least to some degree, the checking associated with each type. I think one can imagine something that checks the structure of a btree without regard to the contents. That is, it cares that left and right links are consistent with each other and with downlinks from the parent level. So it checks things like the left link of the page to which my right link points is pointing back to me, and that's also the page to which my parent's next downlink points. It could also verify that there's a proper tree structure, where every page has a well-defined tree level. So you assign the root page level 1, and each time you traverse a downlink you assign that page a level one larger. If you ever try to assign to a page a level unequal to the level previously assigned to it, you report that as a problem. You can check, too, that if a page does not have a left or right link, it's actually the last page at that level according what you saw at the parent, grandparent, etc. levels. Finally, you can check that all of the max-level pages you can find are leaf pages, and the others are all internal pages. All of this structural stuff can be verified without caring a whit about what keys you've got or what they mean or whether there's even a heap associated with this index. Now a second type of checking, which can also be done without regard to keys, is checking that the TIDs in the index point to TIDs that are on heap pages that actually exist, and that the corresponding items are not unused, nor are they tuples which are not the root of a HOT chain. Passing a check of this type doesn't prove that the index and heap are consistent, but failing it proves that they are inconsistent. This kind of check can be done on every leaf index page you can find by any means even if it fails the structural checks described above. Failure of these checks on one page does not preclude checking the same invariants for other pages. Let's call this kind of thing "basic index-heap sanity checking." A third type of checking is to verify the relationship between the index keys within and across the index pages: are the keys actually in order within a page, and are they in order across pages? The first part of this can be checked individually for each page pretty much no matter what other problems we may have; we only have to abandon this checking for a particular page if it's total garbage and we cannot identify any index items on the page at all. The second part, though, has the problem you mention. I think the solution is to skip the second part of the check for any pages that failed related structural checks. For example, if my right sibling thinks that I am not its left sibling, or my right sibling and I agree that we are siblings but do not agree on who our parent is, or if that parent does not agree that we have the same sibling relationship that we think we have, then we should report that problem and forget about issuing any complaints about the relationship between my key space and that sibling's key space. The internal consistency of each page with respect to key ordering can still be verified, though, and it's possible that my key space can be validly compared to the key space of my other sibling, if the structural checks pass on that side. A fourth type of checking is to verify the index key against the keys in the heap tuples to which they point, but only for index tuples that passed the basic index-heap sanity checking and where the tuples have not been pruned. This can be sensibly done even if the structural checks or index-ordering checks have failed. I don't mean to suggest that one would implement all of these things as separate phases; that would be crazy expensive, and what if things changed by the time you visit the page? Rather, the checks likely ought to be interleaved, just keeping track internally of which things need to be skipped because prerequisite checks have already failed. Aside from providing a way to usefully continue after errors, this would also be useful in certain scenarios where you want to know what kind of corruption you have. For example, suppose that I start getting wrong answers from index lookups on a particular index. Upon investigation, it turns out that my last glibc update changed my OS collation definitions for the collation I'm using, and therefore it is to be expected that some of my keys may appear to be out of order with respect to the new definitions. Now what I really want to know before running REINDEX is that this is the only problem I have. It would be amazing if I could run the tool and have it give me a list of problems so that I could confirm that I have only index-ordering problems, not any other kind, and even more amazing if it could tell me the specific keys that were affected so that I could understand exactly how the sorting behavior changed. If I were to discover that my index also has structural problems or inconsistencies with the heap, then I'd know that it couldn't be right to blame it only the collation update; something else has gone wrong. I'm speaking here with fairly limited knowledge of the details of how all this actually works and, again, I'm not trying to suggest that you or anyone is obligated to do any work on this, or that it would be easy to accomplish or worth the time it took. I'm just trying to sketch out what I see as maybe being theoretically possible, and why I think it would be useful if it did. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 4, 2020 at 9:44 AM Robert Haas <robertmhaas@gmail.com> wrote: > I think it might be possible to distinguish between different types of > corruption and to separate, at least to some degree, the checking > associated with each type. I think one can imagine something that > checks the structure of a btree without regard to the contents. That > is, it cares that left and right links are consistent with each other > and with downlinks from the parent level. So it checks things like the > left link of the page to which my right link points is pointing back > to me, and that's also the page to which my parent's next downlink > points. I think that this kind of phased approach to B-Tree verification is possible, more or less, but hard to justify. And it seems impossible to do with only an AccessShareLock. It's not clear that what you describe is much better than just checking a bunch of indexes and seeing what patterns emerge. For example, the involvement of collated text might be a common factor across indexes. That kind of pattern is the first thing that I look for, and often the only thing. It also serves to give me an idea of how messed up things are. There are not that many meaningful degrees of messed-up with indexes in my experience. The first error really does tell you most of what you need to know about any given corrupt index. Kind of like how you can bucket the number of cockroaches in your home into perhaps three meaningful buckets: 0 cockroaches, at least 1 cockroach, and lots of cockroaches. (Even there, if you really care about the distinction between the second and third bucket, something has gone terribly wrong -- so even three buckets seems like a lot to me.) FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot of details about the tree structure. It's a handy way of getting a sense of what's going on at a high level. For example, if index corruption is found very early on, that strongly suggests that it's pretty pervasive. > Now a second type of checking, which can also be done without regard > to keys, is checking that the TIDs in the index point to TIDs that are > on heap pages that actually exist, and that the corresponding items > are not unused, nor are they tuples which are not the root of a HOT > chain. Passing a check of this type doesn't prove that the index and > heap are consistent, but failing it proves that they are inconsistent. > This kind of check can be done on every leaf index page you can find > by any means even if it fails the structural checks described above. > Failure of these checks on one page does not preclude checking the > same invariants for other pages. Let's call this kind of thing "basic > index-heap sanity checking." One real weakness in the current code is our inability to detect index tuples that are in the correct order and so on, but point to the wrong thing -- we can detect that if it manifests itself as the absence of an index tuple that should be in the index (when you use heapallindexed verification), but we cannot *reliably* detect the presence of an index tuple that shouldn't be in the index at all (though in practice it probably mostly gets caught). The checks on the tree structure itself are excellent with bt_index_parent_check() following Alexander's commit d114cc53 (which I thought was really excellent work). But we still have that one remaining blind spot in verify_nbtree.c, even when you opt in to every possible type of verification (i.e. bt_index_parent_check() with all options). I'd much rather fix that, or help with the new heap checker stuff. > A fourth type of checking is to verify the index key against the keys > in the heap tuples to which they point, but only for index tuples that > passed the basic index-heap sanity checking and where the tuples have > not been pruned. This can be sensibly done even if the structural > checks or index-ordering checks have failed. That's going to require the equivalent of a merge join, which is terribly expensive relative to such a small benefit. > Aside from providing a way to usefully continue after errors, this > would also be useful in certain scenarios where you want to know what > kind of corruption you have. For example, suppose that I start getting > wrong answers from index lookups on a particular index. Upon > investigation, it turns out that my last glibc update changed my OS > collation definitions for the collation I'm using, and therefore it is > to be expected that some of my keys may appear to be out of order with > respect to the new definitions. Now what I really want to know before > running REINDEX is that this is the only problem I have. It would be > amazing if I could run the tool and have it give me a list of problems > so that I could confirm that I have only index-ordering problems, not > any other kind, and even more amazing if it could tell me the specific > keys that were affected so that I could understand exactly how the > sorting behavior changed. This detail seems really hard. There are probably lots of cases where the sorting behavior changed but it just didn't affect you, given the data you had -- it just so happened that you didn't have exactly the wrong kind of diacritic mark or whatever. After all, revisions to how strings in a given natural language are supposed to sort are likely to be relatively rare and relatively obscure (even among people that speak the language in question). Also, the task of figuring out if the tuple to the left or the right is in the wrong order seems kind of daunting. Meanwhile, a simple smoke test covering many indexes probably gives you a fairly meaningful idea of the extent of the damage, without requiring that we do any hard engineering work. > I'm speaking here with fairly limited knowledge of the details of how > all this actually works and, again, I'm not trying to suggest that you > or anyone is obligated to do any work on this, or that it would be > easy to accomplish or worth the time it took. I'm just trying to > sketch out what I see as maybe being theoretically possible, and why I > think it would be useful if it did. I don't think that your relatively limited knowledge of the B-Tree code is an issue here -- your intuitions seem pretty reasonable. I appreciate your perspective here. Corruption detection presents us with some odd qualitative questions of the kind that are just awkward to discuss. Discouraging perspectives that don't quite match my own would be quite counterproductive. That having been said, I suspect that this is a huge task for a small benefit. It's exceptionally hard to test because you have lots of non-trivial code that only gets used in circumstances that by definition should never happen. If users really needed to recover the data in the index then maybe it would happen -- but they don't. The biggest problem that amcheck currently has is that it isn't used enough, because it isn't positioned as a general purpose tool at all. I'm hoping that the work from Mark helps with that. -- Peter Geoghegan
On Tue, Aug 4, 2020 at 9:06 PM Peter Geoghegan <pg@bowt.ie> wrote: > of messed-up with indexes in my experience. The first error really > does tell you most of what you need to know about any given corrupt > index. Kind of like how you can bucket the number of cockroaches in > your home into perhaps three meaningful buckets: 0 cockroaches, at > least 1 cockroach, and lots of cockroaches. (Even there, if you really > care about the distinction between the second and third bucket, > something has gone terribly wrong -- so even three buckets seems like > a lot to me.) Not sure I agree with this. As a homeowner, the distinction between 0 and 1 is less significant to me than the distinction between a few (preferably in places where I'll never see them) and whole lot. I agree with you to an extent though: all I really care about is whether I have too few to worry about, enough that I'd better try to take care of it somehow, or so many that I need a professional exterminator. If, however, I were a professional exterminator, I would be unhappy with just knowing that there are few problems or many. I suspect I would want to know something about where the problems were, and get a more nuanced indication of just how bad things are in each location. FWIW, pg_catcheck is an example of an existing tool (designed by me and written partially by me) that uses the kind of model I'm talking about. It does a single SELECT * FROM pg_<whatever> on each catalog table - so that it doesn't get confused if your system catalog indexes are messed up - and then performs a bunch of cross-checks on the tuples it gets back and tells you about all the messed up stuff. If it can't get data from all your catalog tables it performs whichever checks are valid given what data it was able to get. As a professional exterminator of catalog corruption, I find it quite helpful. If someone sends me the output from a database cluster, I can tell right away whether they are just fine, in a little bit of trouble, or in a whole lot of trouble; I can speculate pretty well about what kind of thing might've happened to cause the problem; and I can recommend steps to straighten things out. > FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot > of details about the tree structure. It's a handy way of getting a > sense of what's going on at a high level. For example, if index > corruption is found very early on, that strongly suggests that it's > pretty pervasive. Interesting. > > A fourth type of checking is to verify the index key against the keys > > in the heap tuples to which they point, but only for index tuples that > > passed the basic index-heap sanity checking and where the tuples have > > not been pruned. This can be sensibly done even if the structural > > checks or index-ordering checks have failed. > > That's going to require the equivalent of a merge join, which is > terribly expensive relative to such a small benefit. I think it depends on how big your data is. If you've got a 2TB table and 512GB of RAM, it's pretty impractical no matter the algorithm. But for small tables even a naive nested loop will suffice. > Meanwhile, a simple smoke test covering many indexes probably gives > you a fairly meaningful idea of the extent of the damage, without > requiring that we do any hard engineering work. In my experience, when EDB customers complain about corruption-related problems, the two most common patterns are: (1) my whole system is messed up and (2) I have one or a few specific objects which are messed up and everything else is fine. The first category is often something like inability to start the database, or scary messages in the log file complaining about, say, checkpoints failing. The second category is the one I'm worried about here. The people who are in this category generally already know which things are broken; they've figured that out through trial and error. Sometimes they miss some problems, but more frequently, in my experience, their understanding of what problems they have is accurate. Now that category of users can be further decomposed into two groups: the people who don't care what happened and just want to barrel through it, and the people who do care what happened and want to know what happened, why it happened, whether it's a bug, etc. The first group are unproblematic: tell them to REINDEX (or restore from backup, or whatever) and you're done. The second group is a lot harder. It is in general difficult to speculate about how something that is now wrong got that way given knowledge only of the present state of affairs. But good tooling makes it easier to speculate intelligently. To take a classic example, there's a great difference between a checksum failure caused by the checksum being incorrect on an otherwise-valid page; a checksum failure on a page the first half of which appears valid and the second half of which looks like it might be some other database page; and a checksum failure on a page whose contents appear to be taken from a Microsoft Word document. I'm not saying we ever want a tool which tries to figure that sort of thing out in an automated way; there's no substitute for human intelligence (yet, anyway). But, the more the tools we do have localize the problems to particular pages or tuples and describe them accurately, the easier it is to do manual investigation as follow-up, when it's necessary. > That having been said, I suspect that this is a huge task for a small > benefit. It's exceptionally hard to test because you have lots of > non-trivial code that only gets used in circumstances that by > definition should never happen. If users really needed to recover the > data in the index then maybe it would happen -- but they don't. Yep, that's a very key difference as compared to the heap. > The biggest problem that amcheck currently has is that it isn't used > enough, because it isn't positioned as a general purpose tool at all. > I'm hoping that the work from Mark helps with that. Agreed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, Aug 5, 2020 at 7:09 AM Robert Haas <robertmhaas@gmail.com> wrote: > Not sure I agree with this. As a homeowner, the distinction between 0 > and 1 is less significant to me than the distinction between a few > (preferably in places where I'll never see them) and whole lot. I > agree with you to an extent though: all I really care about is whether > I have too few to worry about, enough that I'd better try to take care > of it somehow, or so many that I need a professional exterminator. If, > however, I were a professional exterminator, I would be unhappy with > just knowing that there are few problems or many. I suspect I would > want to know something about where the problems were, and get a more > nuanced indication of just how bad things are in each location. Right, but the professional exterminator can be expected to use expert level tools, where a great deal of technical sophistication is required to interpret what's going on sensibly. An amatuer can only use them to determine if something is wrong at all, which is usually not how they add value. (I think that my analogy is slightly flawed in that it hinged upon everybody hating cockroaches as much as I do, which is more than the ordinary amount.) > FWIW, pg_catcheck is an example of an existing tool (designed by me > and written partially by me) that uses the kind of model I'm talking > about. It does a single SELECT * FROM pg_<whatever> on each catalog > table - so that it doesn't get confused if your system catalog indexes > are messed up - and then performs a bunch of cross-checks on the > tuples it gets back and tells you about all the messed up stuff. If it > can't get data from all your catalog tables it performs whichever > checks are valid given what data it was able to get. As a professional > exterminator of catalog corruption, I find it quite helpful. I myself seem to have had quite different experiences with corruption, presumably because it happened at product companies like Heroku. I tended to find software bugs (e.g. the one fixed by commit 008c4135) that were rare and novel by casting a wide net over a large number of relatively homogenous databases. Whereas your experiences tend to involve large support customers with more opportunity for operator error. Both perspectives are important. > The second group is a lot harder. It is in general difficult to > speculate about how something that is now wrong got that way given > knowledge only of the present state of affairs. But good tooling makes > it easier to speculate intelligently. To take a classic example, > there's a great difference between a checksum failure caused by the > checksum being incorrect on an otherwise-valid page; a checksum > failure on a page the first half of which appears valid and the second > half of which looks like it might be some other database page; and a > checksum failure on a page whose contents appear to be taken from a > Microsoft Word document. I'm not saying we ever want a tool which > tries to figure that sort of thing out in an automated way; there's no > substitute for human intelligence (yet, anyway). I wrote my own expert level tool, pg_hexedit. I have to admit that the level of interest in that tool doesn't seem to be all that great, though I myself have used it to investigate corruption to great effect. But I suppose there is no way to know how it's being used. -- Peter Geoghegan
On Wed, Aug 5, 2020 at 4:36 PM Peter Geoghegan <pg@bowt.ie> wrote: > Right, but the professional exterminator can be expected to use expert > level tools, where a great deal of technical sophistication is > required to interpret what's going on sensibly. An amatuer can only > use them to determine if something is wrong at all, which is usually > not how they add value. Quite true. > I myself seem to have had quite different experiences with corruption, > presumably because it happened at product companies like Heroku. I > tended to find software bugs (e.g. the one fixed by commit 008c4135) > that were rare and novel by casting a wide net over a large number of > relatively homogenous databases. Whereas your experiences tend to > involve large support customers with more opportunity for operator > error. Both perspectives are important. I concur. > I wrote my own expert level tool, pg_hexedit. I have to admit that the > level of interest in that tool doesn't seem to be all that great, > though I myself have used it to investigate corruption to great > effect. But I suppose there is no way to know how it's being used. I admit not to having tried pg_hexedit, but I doubt that it would help me very much outside of my own development work. The problem is that in a typical case I am trying to help someone in a professional capacity without access to their machines, and without knowledge of their environment or data. Moreover, sometimes the person I'm trying to help is an unreliable narrator. I can ask people to run tools they have and send the output, and then I can look at that output and tell them what to do next. But it has to be a tool they have (or they can easily get) and it can't involve any complicated if-then stuff. Something like "if the page is totally garbled then do X but if it looks mostly OK then do Y" is radically out of reach. They have no clue about that. Hence my interest in tools that automate as much of the investigation as may be practical. We're probably beating this topic to death at this point; I don't think we are really in any sort of meaningful disagreement, and the next steps in this particular case seem clear enough. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jul 30, 2020 at 11:29 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jul 27, 2020 at 1:02 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > Not at all! I appreciate all the reviews. > > Reviewing 0002, reading through verify_heapam.c: > > +typedef enum SkipPages > +{ > + SKIP_ALL_FROZEN_PAGES, > + SKIP_ALL_VISIBLE_PAGES, > + SKIP_PAGES_NONE > +} SkipPages; > > This looks inconsistent. Maybe just start them all with SKIP_PAGES_. > > + if (PG_ARGISNULL(0)) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("missing required parameter for 'rel'"))); > > This doesn't look much like other error messages in the code. Do > something like git grep -A4 PG_ARGISNULL | grep -A3 ereport and study > the comparables. > > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("unrecognized parameter for 'skip': %s", skip), > + errhint("please choose from 'all-visible', 'all-frozen', or 'none'"))); > > Same problem. Check pg_prewarm's handling of the prewarm type, or > EXPLAIN's handling of the FORMAT option, or similar examples. Read the > message style guidelines concerning punctuation of hint and detail > messages. > > + * Bugs in pg_upgrade are reported (see commands/vacuum.c circa line 1572) > + * to have sometimes rendered the oldest xid value for a database invalid. > + * It seems unwise to report rows as corrupt for failing to be newer than > + * a value which itself may be corrupt. We instead use the oldest xid for > + * the entire cluster, which must be at least as old as the oldest xid for > + * our database. > > This kind of reference to another comment will not age well; line > numbers and files change a lot. But I think the right thing to do here > is just rely on relfrozenxid and relminmxid. If the table is > inconsistent with those, then something needs fixing. datfrozenxid and > the cluster-wide value can look out for themselves. The corruption > detector shouldn't be trying to work around any bugs in setting > relfrozenxid itself; such problems are arguably precisely what we're > here to find. > > +/* > + * confess > + * > + * Return a message about corruption, including information > + * about where in the relation the corruption was found. > + * > + * The msg argument is pfree'd by this function. > + */ > +static void > +confess(HeapCheckContext *ctx, char *msg) > > Contrary to what the comments say, the function doesn't return a > message about corruption or anything else. It returns void. > > I don't really like the name, either. I get that it's probably > inspired by Perl, but I think it should be given a less-clever name > like report_corruption() or something. > > + * corrupted table from using workmem worth of memory building up the > > This kind of thing destroys grep-ability. If you're going to refer to > work_mem, you gotta spell it the same way we do everywhere else. > > + * Helper function to construct the TupleDesc needed by verify_heapam. > > Instead of saying it's the TupleDesc somebody needs, how about saying > that it's the TupleDesc that we'll use to report problems that we find > while scanning the heap, or something like that? > > + * Given a TransactionId, attempt to interpret it as a valid > + * FullTransactionId, neither in the future nor overlong in > + * the past. Stores the inferred FullTransactionId in *fxid. > > It really doesn't, because there's no such thing as 'fxid' referenced > anywhere here. You should really make the effort to proofread your > patches before posting, and adjust comments and so on as you go. > Otherwise reviewing takes longer, and if you keep introducing new > stuff like this as you fix other stuff, you can fail to ever produce a > committable patch. > > + * Determine whether tuples are visible for verification. Similar to > + * HeapTupleSatisfiesVacuum, but with critical differences. > > Not accurate, because it also reports problems, which is not mentioned > anywhere in the function header comment that purports to be a detailed > description of what the function does. > > + else if (TransactionIdIsCurrentTransactionId(raw_xmin)) > + return true; /* insert or delete in progress */ > + else if (TransactionIdIsInProgress(raw_xmin)) > + return true; /* HEAPTUPLE_INSERT_IN_PROGRESS */ > + else if (!TransactionIdDidCommit(raw_xmin)) > + { > + return false; /* HEAPTUPLE_DEAD */ > + } > > One of these cases is not punctuated like the others. > > + pstrdup("heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor > has a valid xmax")); > > 1. I don't think that's very grammatical. > > 2. Why abbreviate HEAP_XMAX_IS_MULTI to XMAX_IS_MULTI and > HEAP_XMAX_IS_LOCKED_ONLY to LOCKED_ONLY? I don't even think you should > be referencing C constant names here at all, and if you are I don't > think you should abbreviate, and if you do abbreviate I don't think > you should omit different numbers of words depending on which constant > it is. > > I wonder what the intended division of responsibility is here, > exactly. It seems like you've ended up with some sanity checks in > check_tuple() before tuple_is_visible() is called, and others in > tuple_is_visible() proper. As far as I can see the comments don't > really discuss the logic behind the split, but there's clearly a close > relationship between the two sets of checks, even to the point where > you have "heap tuple with XMAX_IS_MULTI is neither LOCKED_ONLY nor has > a valid xmax" in tuple_is_visible() and "tuple xmax marked > incompatibly as keys updated and locked only" in check_tuple(). Now, > those are not the same check, but they seem like closely related > things, so it's not ideal that they happen in different functions with > differently-formatted messages to report problems and no explanation > of why it's different. > > I think it might make sense here to see whether you could either move > more stuff out of tuple_is_visible(), so that it really just checks > whether the tuple is visible, or move more stuff into it, so that it > has the job not only of checking whether we should continue with > checks on the tuple contents but also complaining about any other > visibility problems. Or if neither of those make sense then there > should be a stronger attempt to rationalize in the comments what > checks are going where and for what reason, and also a stronger > attempt to rationalize the message wording. > > + curchunk = DatumGetInt32(fastgetattr(toasttup, 2, > + ctx->toast_rel->rd_att, &isnull)); > > Should we be worrying about the possibility of fastgetattr crapping > out if the TOAST tuple is corrupted? > > + if (ctx->tuphdr->t_hoff + ctx->offset > ctx->lp_len) > + { > + confess(ctx, > + psprintf("tuple attribute should start at offset %u, but tuple > length is only %u", > + ctx->tuphdr->t_hoff + ctx->offset, ctx->lp_len)); > + return false; > + } > + > + /* Skip null values */ > + if (infomask & HEAP_HASNULL && att_isnull(ctx->attnum, ctx->tuphdr->t_bits)) > + return true; > + > + /* Skip non-varlena values, but update offset first */ > + if (thisatt->attlen != -1) > + { > + ctx->offset = att_align_nominal(ctx->offset, thisatt->attalign); > + ctx->offset = att_addlength_pointer(ctx->offset, thisatt->attlen, > + tp + ctx->offset); > + return true; > + } > > This looks like it's not going to complain about a fixed-length > attribute that overruns the tuple length. There's code further down > that handles that case for a varlena attribute, but there's nothing > comparable for the fixed-length case. > > + confess(ctx, > + psprintf("%s toast at offset %u is unexpected", > + va_tag == VARTAG_INDIRECT ? "indirect" : > + va_tag == VARTAG_EXPANDED_RO ? "expanded" : > + va_tag == VARTAG_EXPANDED_RW ? "expanded" : > + "unexpected", > + ctx->tuphdr->t_hoff + ctx->offset)); > > I suggest "unexpected TOAST tag %d", without trying to convert to a > string. Such a conversion will likely fail in the case of genuine > corruption, and isn't meaningful even if it works. > > Again, let's try to standardize terminology here: most of the messages > in this function are now of the form "tuple attribute %d has some > problem" or "attribute %d has some problem", but some have neither. > Since we're separately returning attnum I don't see why it should be > in the message, and if we weren't separately returning attnum then it > ought to be in the message the same way all the time, rather than > sometimes writing "attribute" and other times "tuple attribute". > > + /* Check relminmxid against mxid, if any */ > + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr); > + if (infomask & HEAP_XMAX_IS_MULTI && > + MultiXactIdPrecedes(xmax, ctx->relminmxid)) > + { > + confess(ctx, > + psprintf("tuple xmax %u precedes relminmxid %u", > + xmax, ctx->relminmxid)); > + fatal = true; > + } > > There are checks that an XID is neither too old nor too new, and > presumably something similar could be done for MultiXactIds, but here > you only check one end of the range. Seems like you should check both. > > + /* Check xmin against relfrozenxid */ > + xmin = HeapTupleHeaderGetXmin(ctx->tuphdr); > + if (TransactionIdIsNormal(ctx->relfrozenxid) && > + TransactionIdIsNormal(xmin)) > + { > + if (TransactionIdPrecedes(xmin, ctx->relfrozenxid)) > + { > + confess(ctx, > + psprintf("tuple xmin %u precedes relfrozenxid %u", > + xmin, ctx->relfrozenxid)); > + fatal = true; > + } > + else if (!xid_valid_in_rel(xmin, ctx)) > + { > + confess(ctx, > + psprintf("tuple xmin %u follows last assigned xid %u", > + xmin, ctx->next_valid_xid)); > + fatal = true; > + } > + } > > Here you do check both ends of the range, but the comment claims > otherwise. Again, please proof-read for this kind of stuff. > > + /* Check xmax against relfrozenxid */ > > Ditto here. > > + psprintf("tuple's header size is %u bytes which is less than the %u > byte minimum valid header size", > > I suggest: tuple data begins at byte %u, but the tuple header must be > at least %u bytes > > + psprintf("tuple's %u byte header size exceeds the %u byte length of > the entire tuple", > > I suggest: tuple data begins at byte %u, but the entire tuple length > is only %u bytes > > + psprintf("tuple's user data offset %u not maximally aligned to %u", > > I suggest: tuple data begins at byte %u, but that is not maximally aligned > Or: tuple data begins at byte %u, which is not a multiple of %u > > That makes the messages look much more similar to each other > grammatically and is more consistent about calling things by the same > names. > > + psprintf("tuple with null values has user data offset %u rather than > the expected offset %u", > + psprintf("tuple without null values has user data offset %u rather > than the expected offset %u", > > I suggest merging these: tuple data offset %u, but expected offset %u > (%u attributes, %s) > where %s is either "has nulls" or "no nulls" > > In fact, aren't several of the above checks redundant with this one? > Like, why check for a value less than SizeofHeapTupleHeader or that's > not properly aligned first? Just check this straightaway and call it > good. > > + * If we get this far, the tuple is visible to us, so it must not be > + * incompatible with our relDesc. The natts field could be legitimately > + * shorter than rel's natts, but it cannot be longer than rel's natts. > > This is yet another case where you didn't update the comments. > tuple_is_visible() now checks whether the tuple is visible to anyone, > not whether it's visible to us, but the comment doesn't agree. In some > sense I think this comment is redundant with the previous one anyway, > because that one already talks about the tuple being visible. Maybe > just write: The tuple is visible, so it must be compatible with the > current version of the relation descriptor. It might have fewer > columns than are present in the relation descriptor, but it cannot > have more. > > + psprintf("tuple has %u attributes in relation with only %u attributes", > + ctx->natts, > + RelationGetDescr(ctx->rel)->natts)); > > I suggest: tuple has %u attributes, but relation has only %u attributes > > + /* > + * Iterate over the attributes looking for broken toast values. This > + * roughly follows the logic of heap_deform_tuple, except that it doesn't > + * bother building up isnull[] and values[] arrays, since nobody wants > + * them, and it unrolls anything that might trip over an Assert when > + * processing corrupt data. > + */ > + ctx->offset = 0; > + for (ctx->attnum = 0; ctx->attnum < ctx->natts; ctx->attnum++) > + { > + if (!check_tuple_attribute(ctx)) > + break; > + } > > I think this comment is too wordy. This text belongs in the header > comment of check_tuple_attribute(), not at the place where it gets > called. Otherwise, as you update what check_tuple_attribute() does, > you have to remember to come find this comment and fix it to match, > and you might forget to do that. In fact... looks like that already > happened, because check_tuple_attribute() now checks more than broken > TOAST attributes. Seems like you could just simplify this down to > something like "Now check each attribute." Also, you could lose the > extra braces. > > - bt_index_check | relname | relpages > + bt_index_check | relname | relpages > > Don't include unrelated changes in the patch. > > I'm not really sure that the list of fields you're displaying for each > reported problem really makes sense. I think the theory here should be > that we want to report the information that the user needs to localize > the problem but not everything that they could find out from > inspecting the page, and not things that are too specific to > particular classes of errors. So I would vote for keeping blkno, > offnum, and attnum, but I would lose lp_flags, lp_len, and chunk. > lp_off feels like it's a more arguable case: technically, it's a > locator for the problem, because it gives you the byte offset within > the page, but normally we reference tuples by TID, i.e. (blkno, > offset), not byte offset. On balance I'd be inclined to omit it. > > -- In addition to this, I found a few more things while reading v13 patch are as below: Patch v13-0001: - +#include "amcheck.h" Not in correct order. +typedef struct BtreeCheckContext +{ + TupleDesc tupdesc; + Tuplestorestate *tupstore; + bool is_corrupt; + bool on_error_stop; +} BtreeCheckContext; Unnecessary spaces/tabs between } and BtreeCheckContext. static void bt_index_check_internal(Oid indrelid, bool parentcheck, - bool heapallindexed, bool rootdescend); + bool heapallindexed, bool rootdescend, + BtreeCheckContext * ctx); Unnecessary space between * and ctx. The same changes needed for other places as well. --- Patch v13-0002: +-- partitioned tables (the parent ones) don't have visibility maps +create table test_partitioned (a int, b text default repeat('x', 5000)) + partition by list (a); +-- these should all fail +select * from verify_heapam('test_partitioned', + on_error_stop := false, + skip := NULL, + startblock := NULL, + endblock := NULL); +ERROR: "test_partitioned" is not a table, materialized view, or TOAST table +create table test_partition partition of test_partitioned for values in (1); +create index test_index on test_partition (a); Can't we make it work? If the input is partitioned, I think we could collect all its leaf partitions and process them one by one. Thoughts? + ctx->chunkno++; Instead of incrementing in check_toast_tuple(), I think incrementing should happen at the caller -- just after check_toast_tuple() call. --- Patch v13-0003: + resetPQExpBuffer(query); + destroyPQExpBuffer(query); resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer(). + appendPQExpBuffer(query, + "SELECT c.relname, v.blkno, v.offnum, v.lp_off, " + "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg" + "\nFROM verify_heapam(rel := %u, on_error_stop := %s, " + "skip := %s, startblock := %s, endblock := %s) v, " + "pg_class c" + "\nWHERE c.oid = %u", + tbloid, stop, skip, settings.startblock, + settings.endblock, tbloid pg_class should be schema-qualified like elsewhere. IIUC, pg_class is meant to get relname only, instead, we could use '%u'::pg_catalog.regclass in the target list for the relname. Thoughts? Also I think we should skip '\n' from the query string (see appendPQExpBuffer() in pg_dump.c) + appendPQExpBuffer(query, + "SELECT i.indexrelid" + "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c" + "\nWHERE i.indexrelid = c.oid" + "\n AND c.relam = %u" + "\n AND i.indrelid = %u", + BTREE_AM_OID, tbloid); + + ExecuteSqlStatement("RESET search_path"); + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK); + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL)); I don't think we need the search_path query. The main query doesn't have any dependencies on it. Same is in check_indexes(), check_index (), expand_table_name_patterns() & get_table_check_list(). Correct me if I am missing something. + output = PageOutput(lines + 2, NULL); + for (lineno = 0; usage_text[lineno]; lineno++) + fprintf(output, "%s\n", usage_text[lineno]); + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT); + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL); I am not sure why we want PageOutput() if the second argument is always going to be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ? e.g. usage() function in pg_basebackup.c. Regards, Amul
> On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote: > > In addition to this, I found a few more things while reading v13 patch are as > below: > > Patch v13-0001: > > - > +#include "amcheck.h" > > Not in correct order. Fixed. > +typedef struct BtreeCheckContext > +{ > + TupleDesc tupdesc; > + Tuplestorestate *tupstore; > + bool is_corrupt; > + bool on_error_stop; > +} BtreeCheckContext; > > Unnecessary spaces/tabs between } and BtreeCheckContext. This refers to a change in verify_nbtree.c that has been removed. Per discussions with Peter and Robert, I have simply withdrawnthat portion of the patch. > static void bt_index_check_internal(Oid indrelid, bool parentcheck, > - bool heapallindexed, bool rootdescend); > + bool heapallindexed, bool rootdescend, > + BtreeCheckContext * ctx); > > Unnecessary space between * and ctx. The same changes needed for other places as > well. Same as above. The changes to verify_nbtree.c have been withdrawn. > --- > > Patch v13-0002: > > +-- partitioned tables (the parent ones) don't have visibility maps > +create table test_partitioned (a int, b text default repeat('x', 5000)) > + partition by list (a); > +-- these should all fail > +select * from verify_heapam('test_partitioned', > + on_error_stop := false, > + skip := NULL, > + startblock := NULL, > + endblock := NULL); > +ERROR: "test_partitioned" is not a table, materialized view, or TOAST table > +create table test_partition partition of test_partitioned for values in (1); > +create index test_index on test_partition (a); > > Can't we make it work? If the input is partitioned, I think we could > collect all its leaf partitions and process them one by one. Thoughts? I was following the example from pg_visibility. I haven't thought about your proposal enough to have much opinion as yet,except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user interface. > + ctx->chunkno++; > > Instead of incrementing in check_toast_tuple(), I think incrementing should > happen at the caller -- just after check_toast_tuple() call. I agree. > --- > > Patch v13-0003: > > + resetPQExpBuffer(query); > + destroyPQExpBuffer(query); > > resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer(). Thanks. I removed it in cases where destroyPQExpBuffer is obviously the very next call. > + appendPQExpBuffer(query, > + "SELECT c.relname, v.blkno, v.offnum, v.lp_off, " > + "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg" > + "\nFROM verify_heapam(rel := %u, on_error_stop := %s, " > + "skip := %s, startblock := %s, endblock := %s) v, " > + "pg_class c" > + "\nWHERE c.oid = %u", > + tbloid, stop, skip, settings.startblock, > + settings.endblock, tbloid > > pg_class should be schema-qualified like elsewhere. Agreed, and changed. > IIUC, pg_class is meant to > get relname only, instead, we could use '%u'::pg_catalog.regclass in the target > list for the relname. Thoughts? get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over, calling check_table()for each one. I think some verification that the table still exists is in order. Using '%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of the'ERROR: could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped table. > Also I think we should skip '\n' from the query string (see appendPQExpBuffer() > in pg_dump.c) I'm not sure I understand. pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very similarto what this patch does. > + appendPQExpBuffer(query, > + "SELECT i.indexrelid" > + "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c" > + "\nWHERE i.indexrelid = c.oid" > + "\n AND c.relam = %u" > + "\n AND i.indrelid = %u", > + BTREE_AM_OID, tbloid); > + > + ExecuteSqlStatement("RESET search_path"); > + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK); > + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL)); > > I don't think we need the search_path query. The main query doesn't have any > dependencies on it. Same is in check_indexes(), check_index (), > expand_table_name_patterns() & get_table_check_list(). > Correct me if I am missing something. Right. > + output = PageOutput(lines + 2, NULL); > + for (lineno = 0; usage_text[lineno]; lineno++) > + fprintf(output, "%s\n", usage_text[lineno]); > + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT); > + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL); > > I am not sure why we want PageOutput() if the second argument is always going to > be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ? > e.g. usage() function in pg_basebackup.c. Done. Please find attached the next version of the patch. In addition to your review comments (above), I have made changes inresponse to Peter and Robert's review comments upthread. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Aug 20, 2020 at 8:00 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > > > On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote: > > > > In addition to this, I found a few more things while reading v13 patch are as > > below: > > > > Patch v13-0001: > > > > - > > +#include "amcheck.h" > > > > Not in correct order. > > Fixed. > > > +typedef struct BtreeCheckContext > > +{ > > + TupleDesc tupdesc; > > + Tuplestorestate *tupstore; > > + bool is_corrupt; > > + bool on_error_stop; > > +} BtreeCheckContext; > > > > Unnecessary spaces/tabs between } and BtreeCheckContext. > > This refers to a change in verify_nbtree.c that has been removed. Per discussions with Peter and Robert, I have simplywithdrawn that portion of the patch. > > > static void bt_index_check_internal(Oid indrelid, bool parentcheck, > > - bool heapallindexed, bool rootdescend); > > + bool heapallindexed, bool rootdescend, > > + BtreeCheckContext * ctx); > > > > Unnecessary space between * and ctx. The same changes needed for other places as > > well. > > Same as above. The changes to verify_nbtree.c have been withdrawn. > > > --- > > > > Patch v13-0002: > > > > +-- partitioned tables (the parent ones) don't have visibility maps > > +create table test_partitioned (a int, b text default repeat('x', 5000)) > > + partition by list (a); > > +-- these should all fail > > +select * from verify_heapam('test_partitioned', > > + on_error_stop := false, > > + skip := NULL, > > + startblock := NULL, > > + endblock := NULL); > > +ERROR: "test_partitioned" is not a table, materialized view, or TOAST table > > +create table test_partition partition of test_partitioned for values in (1); > > +create index test_index on test_partition (a); > > > > Can't we make it work? If the input is partitioned, I think we could > > collect all its leaf partitions and process them one by one. Thoughts? > > I was following the example from pg_visibility. I haven't thought about your proposal enough to have much opinion as yet,except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user interface. > pg_visibility does exist from before the declarative partitioning came in, I think it's time to improve that as well. > > + ctx->chunkno++; > > > > Instead of incrementing in check_toast_tuple(), I think incrementing should > > happen at the caller -- just after check_toast_tuple() call. > > I agree. > > > --- > > > > Patch v13-0003: > > > > + resetPQExpBuffer(query); > > + destroyPQExpBuffer(query); > > > > resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer(). > > Thanks. I removed it in cases where destroyPQExpBuffer is obviously the very next call. > > > + appendPQExpBuffer(query, > > + "SELECT c.relname, v.blkno, v.offnum, v.lp_off, " > > + "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg" > > + "\nFROM verify_heapam(rel := %u, on_error_stop := %s, " > > + "skip := %s, startblock := %s, endblock := %s) v, " > > + "pg_class c" > > + "\nWHERE c.oid = %u", > > + tbloid, stop, skip, settings.startblock, > > + settings.endblock, tbloid > > > > pg_class should be schema-qualified like elsewhere. > > Agreed, and changed. > > > IIUC, pg_class is meant to > > get relname only, instead, we could use '%u'::pg_catalog.regclass in the target > > list for the relname. Thoughts? > > get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over, calling check_table()for each one. I think some verification that the table still exists is in order. Using '%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of the'ERROR: could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped table. > > > Also I think we should skip '\n' from the query string (see appendPQExpBuffer() > > in pg_dump.c) > > I'm not sure I understand. pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very similarto what this patch does. > I see there is a mix of styles, I was referring to dumpDatabase() from pg_dump.c which doesn't include '\n'. > > + appendPQExpBuffer(query, > > + "SELECT i.indexrelid" > > + "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c" > > + "\nWHERE i.indexrelid = c.oid" > > + "\n AND c.relam = %u" > > + "\n AND i.indrelid = %u", > > + BTREE_AM_OID, tbloid); > > + > > + ExecuteSqlStatement("RESET search_path"); > > + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK); > > + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL)); > > > > I don't think we need the search_path query. The main query doesn't have any > > dependencies on it. Same is in check_indexes(), check_index (), > > expand_table_name_patterns() & get_table_check_list(). > > Correct me if I am missing something. > > Right. > > > + output = PageOutput(lines + 2, NULL); > > + for (lineno = 0; usage_text[lineno]; lineno++) > > + fprintf(output, "%s\n", usage_text[lineno]); > > + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT); > > + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL); > > > > I am not sure why we want PageOutput() if the second argument is always going to > > be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ? > > e.g. usage() function in pg_basebackup.c. > > Done. > > > Please find attached the next version of the patch. In addition to your review comments (above), I have made changes inresponse to Peter and Robert's review comments upthread. Thanks for the updated version, I'll have a look. Regards, Amul
Few comments for v14 version: v14-0001: verify_heapam.c: In function ‘verify_heapam’: verify_heapam.c:339:14: warning: variable ‘ph’ set but not used [-Wunused-but-set-variable] PageHeader ph; ^ verify_heapam.c: In function ‘check_toast_tuple’: verify_heapam.c:877:8: warning: variable ‘chunkdata’ set but not used [-Wunused-but-set-variable] char *chunkdata; Got these compilation warnings +++ b/contrib/amcheck/amcheck.h @@ -0,0 +1,5 @@ +#include "postgres.h" + +Datum verify_heapam(PG_FUNCTION_ARGS); +Datum bt_index_check(PG_FUNCTION_ARGS); +Datum bt_index_parent_check(PG_FUNCTION_ARGS); bt_index_* are needed? #include "access/htup_details.h" #include "access/xact.h" #include "catalog/pg_type.h" #include "catalog/storage_xlog.h" #include "storage/smgr.h" #include "utils/lsyscache.h" #include "utils/rel.h" #include "utils/snapmgr.h" #include "utils/syscache.h" These header file inclusion to verify_heapam.c. can be omitted. Some of those might be implicitly got added by other header files or no longer need due to recent changes. + * on_error_stop: + * Whether to stop at the end of the first page for which errors are + * detected. Note that multiple rows may be returned. + * + * check_toast: + * Whether to check each toasted attribute against the toast table to + * verify that it can be found there. + * + * skip: + * What kinds of pages in the heap relation should be skipped. Valid + * options are "all-visible", "all-frozen", and "none". I think it would be good if the description also includes what will be default value otherwise. + /* + * Optionally open the toast relation, if any, also protected from + * concurrent vacuums. + */ Now lock is changed to AccessShareLock, I think we need to rephrase this comment as well since we are not really doing anything extra explicitly to protect from the concurrent vacuum. +/* + * Return wehter a multitransaction ID is in the cached valid range. + */ Typo: s/wehter/whether v14-0002: +#define NOPAGER 0 Unused macro. + appendPQExpBuffer(querybuf, + "SELECT c.relname, v.blkno, v.offnum, v.attnum, v.msg" + "\nFROM public.verify_heapam(" + "\nrelation := %u," + "\non_error_stop := %s," + "\nskip := %s," + "\ncheck_toast := %s," + "\nstartblock := %s," + "\nendblock := %s) v, " + "\npg_catalog.pg_class c" + "\nWHERE c.oid = %u", + tbloid, stop, skip, toast, startblock, endblock, tbloid); [....] + appendPQExpBuffer(querybuf, + "SELECT public.bt_index_parent_check('%s'::regclass, %s, %s)", + idxoid, + settings.heapallindexed ? "true" : "false", + settings.rootdescend ? "true" : "false"); The assumption that the amcheck extension will be always installed in the public schema doesn't seem to be correct. This will not work if amcheck install somewhere else. Regards, Amul On Thu, Aug 20, 2020 at 5:17 PM Amul Sul <sulamul@gmail.com> wrote: > > On Thu, Aug 20, 2020 at 8:00 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: > > > > > > > > > On Aug 16, 2020, at 9:37 PM, Amul Sul <sulamul@gmail.com> wrote: > > > > > > In addition to this, I found a few more things while reading v13 patch are as > > > below: > > > > > > Patch v13-0001: > > > > > > - > > > +#include "amcheck.h" > > > > > > Not in correct order. > > > > Fixed. > > > > > +typedef struct BtreeCheckContext > > > +{ > > > + TupleDesc tupdesc; > > > + Tuplestorestate *tupstore; > > > + bool is_corrupt; > > > + bool on_error_stop; > > > +} BtreeCheckContext; > > > > > > Unnecessary spaces/tabs between } and BtreeCheckContext. > > > > This refers to a change in verify_nbtree.c that has been removed. Per discussions with Peter and Robert, I have simplywithdrawn that portion of the patch. > > > > > static void bt_index_check_internal(Oid indrelid, bool parentcheck, > > > - bool heapallindexed, bool rootdescend); > > > + bool heapallindexed, bool rootdescend, > > > + BtreeCheckContext * ctx); > > > > > > Unnecessary space between * and ctx. The same changes needed for other places as > > > well. > > > > Same as above. The changes to verify_nbtree.c have been withdrawn. > > > > > --- > > > > > > Patch v13-0002: > > > > > > +-- partitioned tables (the parent ones) don't have visibility maps > > > +create table test_partitioned (a int, b text default repeat('x', 5000)) > > > + partition by list (a); > > > +-- these should all fail > > > +select * from verify_heapam('test_partitioned', > > > + on_error_stop := false, > > > + skip := NULL, > > > + startblock := NULL, > > > + endblock := NULL); > > > +ERROR: "test_partitioned" is not a table, materialized view, or TOAST table > > > +create table test_partition partition of test_partitioned for values in (1); > > > +create index test_index on test_partition (a); > > > > > > Can't we make it work? If the input is partitioned, I think we could > > > collect all its leaf partitions and process them one by one. Thoughts? > > > > I was following the example from pg_visibility. I haven't thought about your proposal enough to have much opinion asyet, except that if we do this for pg_amcheck we should do likewise to pg_visibility, for consistency of the user interface. > > > > pg_visibility does exist from before the declarative partitioning came > in, I think it's time to improve that as well. > > > > + ctx->chunkno++; > > > > > > Instead of incrementing in check_toast_tuple(), I think incrementing should > > > happen at the caller -- just after check_toast_tuple() call. > > > > I agree. > > > > > --- > > > > > > Patch v13-0003: > > > > > > + resetPQExpBuffer(query); > > > + destroyPQExpBuffer(query); > > > > > > resetPQExpBuffer() will be unnecessary if the next call is destroyPQExpBuffer(). > > > > Thanks. I removed it in cases where destroyPQExpBuffer is obviously the very next call. > > > > > + appendPQExpBuffer(query, > > > + "SELECT c.relname, v.blkno, v.offnum, v.lp_off, " > > > + "v.lp_flags, v.lp_len, v.attnum, v.chunk, v.msg" > > > + "\nFROM verify_heapam(rel := %u, on_error_stop := %s, " > > > + "skip := %s, startblock := %s, endblock := %s) v, " > > > + "pg_class c" > > > + "\nWHERE c.oid = %u", > > > + tbloid, stop, skip, settings.startblock, > > > + settings.endblock, tbloid > > > > > > pg_class should be schema-qualified like elsewhere. > > > > Agreed, and changed. > > > > > IIUC, pg_class is meant to > > > get relname only, instead, we could use '%u'::pg_catalog.regclass in the target > > > list for the relname. Thoughts? > > > > get_table_check_list() creates the list of all tables to be checked, which check_tables() then iterates over, callingcheck_table() for each one. I think some verification that the table still exists is in order. Using '%u'::pg_catalog.regclassfor a table that has since been dropped would pass in the old table Oid and draw an error of the'ERROR: could not open relation with OID 36311' variety, whereas the current coding will just skip the dropped table. > > > > > Also I think we should skip '\n' from the query string (see appendPQExpBuffer() > > > in pg_dump.c) > > > > I'm not sure I understand. pg_dump.c uses "\n" in query strings it passes to appendPQExpBuffer(), in a manner very similarto what this patch does. > > > > I see there is a mix of styles, I was referring to dumpDatabase() from pg_dump.c > which doesn't include '\n'. > > > > + appendPQExpBuffer(query, > > > + "SELECT i.indexrelid" > > > + "\nFROM pg_catalog.pg_index i, pg_catalog.pg_class c" > > > + "\nWHERE i.indexrelid = c.oid" > > > + "\n AND c.relam = %u" > > > + "\n AND i.indrelid = %u", > > > + BTREE_AM_OID, tbloid); > > > + > > > + ExecuteSqlStatement("RESET search_path"); > > > + res = ExecuteSqlQuery(query->data, PGRES_TUPLES_OK); > > > + PQclear(ExecuteSqlQueryForSingleRow(ALWAYS_SECURE_SEARCH_PATH_SQL)); > > > > > > I don't think we need the search_path query. The main query doesn't have any > > > dependencies on it. Same is in check_indexes(), check_index (), > > > expand_table_name_patterns() & get_table_check_list(). > > > Correct me if I am missing something. > > > > Right. > > > > > + output = PageOutput(lines + 2, NULL); > > > + for (lineno = 0; usage_text[lineno]; lineno++) > > > + fprintf(output, "%s\n", usage_text[lineno]); > > > + fprintf(output, "Report bugs to <%s>.\n", PACKAGE_BUGREPORT); > > > + fprintf(output, "%s home page: <%s>\n", PACKAGE_NAME, PACKAGE_URL); > > > > > > I am not sure why we want PageOutput() if the second argument is always going to > > > be NULL? Can't we directly use printf() instead of PageOutput() + fprintf() ? > > > e.g. usage() function in pg_basebackup.c. > > > > Done. > > > > > > Please find attached the next version of the patch. In addition to your review comments (above), I have made changesin response to Peter and Robert's review comments upthread. > > Thanks for the updated version, I'll have a look. > > Regards, > Amul
> On Aug 24, 2020, at 2:48 AM, Amul Sul <sulamul@gmail.com> wrote: > > Few comments for v14 version: > > v14-0001: > > verify_heapam.c: In function ‘verify_heapam’: > verify_heapam.c:339:14: warning: variable ‘ph’ set but not used > [-Wunused-but-set-variable] > PageHeader ph; > ^ > verify_heapam.c: In function ‘check_toast_tuple’: > verify_heapam.c:877:8: warning: variable ‘chunkdata’ set but not used > [-Wunused-but-set-variable] > char *chunkdata; > > Got these compilation warnings Removed. > > > +++ b/contrib/amcheck/amcheck.h > @@ -0,0 +1,5 @@ > +#include "postgres.h" > + > +Datum verify_heapam(PG_FUNCTION_ARGS); > +Datum bt_index_check(PG_FUNCTION_ARGS); > +Datum bt_index_parent_check(PG_FUNCTION_ARGS); > > bt_index_* are needed? This entire header file is not needed. Removed. > #include "access/htup_details.h" > #include "access/xact.h" > #include "catalog/pg_type.h" > #include "catalog/storage_xlog.h" > #include "storage/smgr.h" > #include "utils/lsyscache.h" > #include "utils/rel.h" > #include "utils/snapmgr.h" > #include "utils/syscache.h" > > These header file inclusion to verify_heapam.c. can be omitted. Some of those > might be implicitly got added by other header files or no longer need due to > recent changes. Removed. > + * on_error_stop: > + * Whether to stop at the end of the first page for which errors are > + * detected. Note that multiple rows may be returned. > + * > + * check_toast: > + * Whether to check each toasted attribute against the toast table to > + * verify that it can be found there. > + * > + * skip: > + * What kinds of pages in the heap relation should be skipped. Valid > + * options are "all-visible", "all-frozen", and "none". > > I think it would be good if the description also includes what will be default > value otherwise. The defaults are defined in amcheck--1.2--1.3.sql, and I was concerned that documenting them in verify_heapam.c would createa hazard of the defaults and their documented values getting out of sync. The handling of null arguments in verify_heapam.cwas, however, duplicating the defaults from the .sql file, so I've changed that to just ereport error on null. (I can't make the whole function strict, as some other arguments are allowed to be null.) I have not documented thedefaults in either file, as they are quite self-evident in the .sql file. I've updated some tests that were passing nullto get the default behavior to now either pass nothing or explicitly pass the argument they want. > > + /* > + * Optionally open the toast relation, if any, also protected from > + * concurrent vacuums. > + */ > > Now lock is changed to AccessShareLock, I think we need to rephrase this comment > as well since we are not really doing anything extra explicitly to protect from > the concurrent vacuum. Right. Comment changed. > +/* > + * Return wehter a multitransaction ID is in the cached valid range. > + */ > > Typo: s/wehter/whether Changed. > v14-0002: > > +#define NOPAGER 0 > > Unused macro. Removed. > + appendPQExpBuffer(querybuf, > + "SELECT c.relname, v.blkno, v.offnum, v.attnum, v.msg" > + "\nFROM public.verify_heapam(" > + "\nrelation := %u," > + "\non_error_stop := %s," > + "\nskip := %s," > + "\ncheck_toast := %s," > + "\nstartblock := %s," > + "\nendblock := %s) v, " > + "\npg_catalog.pg_class c" > + "\nWHERE c.oid = %u", > + tbloid, stop, skip, toast, startblock, endblock, tbloid); > [....] > + appendPQExpBuffer(querybuf, > + "SELECT public.bt_index_parent_check('%s'::regclass, %s, %s)", > + idxoid, > + settings.heapallindexed ? "true" : "false", > + settings.rootdescend ? "true" : "false"); > > The assumption that the amcheck extension will be always installed in the public > schema doesn't seem to be correct. This will not work if amcheck install > somewhere else. Right. I removed the schema qualification, leaving it up to the search path. Thanks for the review! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> 25 авг. 2020 г., в 19:36, Mark Dilger <mark.dilger@enterprisedb.com> написал(а): Hi Mark! Thanks for working on this important feature. I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow. Will your module be able to gather tid's of similar corruptions? server/db M # select * from heap_check('pg_toast.pg_toast_4848601'); ERROR: 58P01: could not access status of transaction 636558742 DETAIL: Could not open file "pg_xact/025F": No such file or directory. LOCATION: SlruReportIOError, slru.c:913 Time: 3439.915 ms (00:03.440) Thanks! Best regards, Andrey Borodin.
On Fri, Aug 28, 2020 at 1:07 AM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow. > Will your module be able to gather tid's of similar corruptions? > > server/db M # select * from heap_check('pg_toast.pg_toast_4848601'); > ERROR: 58P01: could not access status of transaction 636558742 > DETAIL: Could not open file "pg_xact/025F": No such file or directory. > LOCATION: SlruReportIOError, slru.c:913 > Time: 3439.915 ms (00:03.440) This kind of thing gets really tricky. PostgreSQL uses errors in tons of places to report problems, and if you want to accumulate a list of errors and report them all rather than just letting the first one cancel the operation, you need special handling for each individual error you want to bypass. A tool like this naturally wants to use as much PostgreSQL infrastructure as possible, to avoid duplicating a ton of code and creating a bloated monstrosity, but all that code can throw errors. I think the code in its current form is trying to be resilient against problems on the table pages that it is actually checking, but it can't necessarily handle gracefully corruption in other parts of the system. For instance: - CLOG could be truncated, as in your example - the disk files could have had their permissions changed so that they can't be accessed - the PageIsVerified() check might fail when pages are read - the TOAST table's metadata in pg_class/pg_attribute/etc. could be corrupted - ...or the files for those system catalogs could've had their permissions changed - ....or they could contain invalid pages - ...or their indexes could be messed up I think there are probably a bunch more, and I don't think it's practical to allow this tool to continue after arbitrary stuff goes wrong. It'll be too much code and impossible to maintain. In the case you mention, I think we should view that as a problem with clog rather than a problem with the table, and thus out of scope. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Aug 27, 2020, at 10:07 PM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > >> 25 авг. 2020 г., в 19:36, Mark Dilger <mark.dilger@enterprisedb.com> написал(а): > > Hi Mark! > > Thanks for working on this important feature. > > I was experimenting a bit with our internal heapcheck and found out that it's not helping with truncated CLOG anyhow. > Will your module be able to gather tid's of similar corruptions? > > server/db M # select * from heap_check('pg_toast.pg_toast_4848601'); > ERROR: 58P01: could not access status of transaction 636558742 > DETAIL: Could not open file "pg_xact/025F": No such file or directory. > LOCATION: SlruReportIOError, slru.c:913 > Time: 3439.915 ms (00:03.440) The design principle for verify_heapam.c is, if the rest of the system is not corrupt, corruption in the table being checkedshould not cause a crash during the table check. This is a very limited principle. Even corruption in the associatedtoast table or toast index could cause a crash. That is why checking against the toast table is optional, andfalse by default. Perhaps a more extensive effort could be made later. I think it is out of scope for this release cycle. It is a very interestingarea for further research, though. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> 28 авг. 2020 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а): > In the case > you mention, I think we should view that as a problem with clog rather > than a problem with the table, and thus out of scope. I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing. Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast (whiletoast was accessible until CLOG truncation). Best regards, Andrey Borodin.
> On Aug 28, 2020, at 11:10 AM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > >> 28 авг. 2020 г., в 18:58, Robert Haas <robertmhaas@gmail.com> написал(а): >> In the case >> you mention, I think we should view that as a problem with clog rather >> than a problem with the table, and thus out of scope. > > I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing. > Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast (whiletoast was accessible until CLOG truncation). > > Best regards, Andrey Borodin. If you lock the relations involved, check the toast table first, the toast index second, and the main table third, do youstill get the problem? Look at how pg_amcheck handles this and let me know if you still see a problem. There is theever present problem that external forces, like a rogue process deleting backend files, will strike at precisely the wrongmoment, but barring that kind of concurrent corruption, I think the toast table being checked prior to the main tablebeing checked solves some of the issues you are worried about. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing. > Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast (whiletoast was accessible until CLOG truncation). The code can (and should, and I think does) refrain from looking up XIDs that are out of the range thought to be valid -- but how do you propose that it avoid looking up XIDs that ought to have clog data associated with them despite being >= relfrozenxid and < nextxid? TransactionIdDidCommit() does not have a suppress-errors flag, adding one would be quite invasive, yet we cannot safely perform a significant number of checks without knowing whether the inserting transaction committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> 29 авг. 2020 г., в 00:56, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing. >> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast (whiletoast was accessible until CLOG truncation). > > The code can (and should, and I think does) refrain from looking up > XIDs that are out of the range thought to be valid -- but how do you > propose that it avoid looking up XIDs that ought to have clog data > associated with them despite being >= relfrozenxid and < nextxid? > TransactionIdDidCommit() does not have a suppress-errors flag, adding > one would be quite invasive, yet we cannot safely perform a > significant number of checks without knowing whether the inserting > transaction committed. What you write seems completely correct to me. I agree that CLOG thresholds lookup seems unnecessary. But I have a real corruption at hand (on testing site). If I have proposed here heapcheck. And I have pg_surgery from thethread nearby. Yet I cannot fix the problem, because cannot list affected tuples. These tools do not solve the problemneglected for long enough. It would be supercool if they could. This corruption like a caries had 3 stages: 1. incorrect VM flag that page do not need vacuum 2. xmin and xmax < relfrozenxid 3. CLOG truncated Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that different. Thanks! Best regards, Andrey Borodin.
> On Aug 29, 2020, at 3:27 AM, Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: > > > >> 29 авг. 2020 г., в 00:56, Robert Haas <robertmhaas@gmail.com> написал(а): >> >> On Fri, Aug 28, 2020 at 2:10 PM Andrey M. Borodin <x4mmm@yandex-team.ru> wrote: >>> I don't think so. ISTM It's the same problem of xmax<relfrozenxid actually, just hidden behind detoasing. >>> Our regular heap_check was checking xmin\xmax invariants for tables, but failed to recognise the problem in toast (whiletoast was accessible until CLOG truncation). >> >> The code can (and should, and I think does) refrain from looking up >> XIDs that are out of the range thought to be valid -- but how do you >> propose that it avoid looking up XIDs that ought to have clog data >> associated with them despite being >= relfrozenxid and < nextxid? >> TransactionIdDidCommit() does not have a suppress-errors flag, adding >> one would be quite invasive, yet we cannot safely perform a >> significant number of checks without knowing whether the inserting >> transaction committed. > > What you write seems completely correct to me. I agree that CLOG thresholds lookup seems unnecessary. > > But I have a real corruption at hand (on testing site). If I have proposed here heapcheck. And I have pg_surgery from thethread nearby. Yet I cannot fix the problem, because cannot list affected tuples. These tools do not solve the problemneglected for long enough. It would be supercool if they could. > > This corruption like a caries had 3 stages: > 1. incorrect VM flag that page do not need vacuum > 2. xmin and xmax < relfrozenxid > 3. CLOG truncated > > Stage 2 is curable with proposed toolset, stage 3 is not. But they are not that different. I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog. Ultimately, I rippedthat out. My reasoning was that a simpler patch submission was more likely to be acceptable to the community. If you want to submit a separate patch that creates a non-throwing version of the clog interface, and get the community toaccept and commit it, I would seriously consider using that from verify_heapam. If it gets committed in time, I mighteven do so for this release cycle. But I don't want to make this patch dependent on that hypothetical patch gettingwritten and accepted. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Thanks for the review! + msg OUT text + ) Looks like atypical formatting. +REVOKE ALL ON FUNCTION +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) +FROM PUBLIC; This too. +-- Don't want this to be available to public Add "by default, but superusers can grant access" or so? I think there should be a call to pg_class_aclcheck() here, just like the one in pg_prewarm, so that if the superuser does choose to grant access, users given access can check tables they anyway have permission to access, but not others. Maybe put that in check_relation_relkind_and_relam() and rename it. Might want to look at the pg_surgery precedent, too. Oh, and that functions header comment is also wrong. I think that the way the checks on the block range are performed could be improved. Generally, we want to avoid reporting the same problem with a variety of different message strings, because it adds burden for translators and is potentially confusing for users. You've got two message strings that are only going to be used for empty relations and a third message string that is only going to be used for non-empty relations. What stops you from just ripping off the way that this is done in pg_prewarm, which requires only 2 messages? Then you'd be adding a net total of 0 new messages instead of 3, and in my view they would be clearer than your third message, "block range is out of bounds for relation with block count %u: " INT64_FORMAT " .. " INT64_FORMAT, which doesn't say very precisely what the problem is, and also falls afoul of our usual practice of avoiding the use of INT64_FORMAT in error messages that are subject to translation. I notice that pg_prewarm just silently does nothing if the start and end blocks are swapped, rather than generating an error. We could choose to do differently here, but I'm not sure why we should bother. + all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE; + all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN; + + if ((all_frozen && skip_option == SKIP_PAGES_ALL_FROZEN) || + (all_visible && skip_option == SKIP_PAGES_ALL_VISIBLE)) + { + continue; + } This isn't horrible style, but why not just get rid of the local variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... } Typically no braces around a block containing only one line. + * table contains corrupt all frozen bits, a concurrent vacuum might skip the all-frozen? + * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions + * seems acceptable, since if we had checked it earlier in our scan it would + * have truly been valid at that time, and we break no MVCC guarantees by + * failing to notice the concurrent change in its status. I agree with the first half of this sentence, but I don't know what MVCC guarantees have to do with anything. I'd just delete the second part, or make it a lot clearer. + * Some kinds of tuple header corruption make it unsafe to check the tuple + * attributes, for example when the tuple is foreshortened and such checks + * would read beyond the end of the line pointer (and perhaps the page). In I think of foreshortening mostly as an art term, though I guess it has other meanings. Maybe it would be clearer to say something like "Some kinds of corruption make it unsafe to check the tuple attributes, for example when the line pointer refers to a range of bytes outside the page"? + * Other kinds of tuple header corruption do not bare on the question of bear + pstrdup(_("updating transaction ID marked incompatibly as keys updated and locked only"))); + pstrdup(_("updating transaction ID marked incompatibly as committed and as a multitransaction ID"))); "updating transaction ID" might scare somebody who thinks that you are telling them that you changed something. That's not what it means, but it might not be totally clear. Maybe: tuple is marked as only locked, but also claims key columns were updated multixact should not be marked committed + psprintf(_("data offset differs from expected: %u vs. %u (1 attribute, has nulls)"), For these, how about: tuple data should begin at byte %u, but actually begins at byte %u (1 attribute, has nulls) etc. + psprintf(_("old-style VACUUM FULL transaction ID is in the future: %u"), + psprintf(_("old-style VACUUM FULL transaction ID precedes freeze threshold: %u"), + psprintf(_("old-style VACUUM FULL transaction ID is invalid in this relation: %u"), old-style VACUUM FULL transaction ID %u is in the future old-style VACUUM FULL transaction ID %u precedes freeze threshold %u old-style VACUUM FULL transaction ID %u out of range %u..%u Doesn't the second of these overlap with the third? Similarly in other places, e.g. + psprintf(_("inserting transaction ID is in the future: %u"), I think this should change to: inserting transaction ID %u is in the future + else if (VARATT_IS_SHORT(chunk)) + /* + * could happen due to heap_form_tuple doing its thing + */ + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; Add braces here, since there are multiple lines. + psprintf(_("toast chunk sequence number not the expected sequence number: %u vs. %u"), toast chunk sequence number %u does not match expected sequence number %u There are more instances of this kind of thing. + psprintf(_("toasted attribute has unexpected TOAST tag: %u"), Remove colon. + psprintf(_("attribute ends at offset beyond total tuple length: %u vs. %u (attribute length %u)"), Let's try to specify the attribute number in the attribute messages where we can, e.g. + psprintf(_("attribute ends at offset beyond total tuple length: %u vs. %u (attribute length %u)"), How about: attribute %u with length %u should end at offset %u, but the tuple length is only %u + if (TransactionIdIsNormal(ctx->relfrozenxid) && + TransactionIdPrecedes(xmin, ctx->relfrozenxid)) + { + report_corruption(ctx, + /* translator: Both %u are transaction IDs. */ + psprintf(_("inserting transaction ID is from before freeze cutoff: %u vs. %u"), + xmin, ctx->relfrozenxid)); + fatal = true; + } + else if (!xid_valid_in_rel(xmin, ctx)) + { + report_corruption(ctx, + /* translator: %u is a transaction ID. */ + psprintf(_("inserting transaction ID is in the future: %u"), + xmin)); + fatal = true; + } This seems like good evidence that xid_valid_in_rel needs some rethinking. As far as I can see, every place where you call xid_valid_in_rel, you have checks beforehand that duplicate some of what it does, so that you can give a more accurate error message. That's not good. Either the message should be adjusted so that it covers all the cases "e.g. tuple xmin %u is outside acceptable range %u..%u" or we should just get rid of xid_valid_in_rel() and have separate error messages for each case, e.g. tuple xmin %u precedes relfrozenxid %u". I think it's OK to use terms like xmin and xmax in these messages, rather than inserting transaction ID etc. We have existing instances of that, and while someone might judge it user-unfriendly, I disagree. A person who is qualified to interpret this output must know what 'tuplex min' means immediately, but whether they can understand that 'inserting transaction ID' means the same thing is questionable, I think. This is not a full review, but in general I think that this is getting pretty close to being committable. The error messages seem to still need some polishing and I wouldn't be surprised if there are a few more bugs lurking yet, but I think it's come a long way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Sep 21, 2020, at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > I think there should be a call to pg_class_aclcheck() here, just like > the one in pg_prewarm, so that if the superuser does choose to grant > access, users given access can check tables they anyway have > permission to access, but not others. Maybe put that in > check_relation_relkind_and_relam() and rename it. Might want to look > at the pg_surgery precedent, too. In the presence of corruption, verify_heapam() reports to the user (in other words, leaks) metadata about the corrupted rows. Reasoning about the attack vectors this creates is hard, but a conservative approach is to assume that an attackercan cause corruption in order to benefit from the leakage, and make sure the leakage does not violate any reasonablesecurity expectations. Basing the security decision on whether the user has access to read the table seems insufficient, as it ignores row levelsecurity. Perhaps that is ok if row level security is not enabled for the table or if the user has been granted BYPASSRLS. There is another problem, though. There is no grantable privilege to read dead rows. In the case of corruption,verify_heapam() may well report metadata about dead rows. pg_surgery also appears to leak information about dead rows. Owners of tables can probe whether supplied TIDs refer to deadrows. If a table containing sensitive information has rows deleted prior to ownership being transferred, the new ownerof the table could probe each page of deleted data to determine something of the content that was there. Informationabout the number of deleted rows is already available through the pg_stat_* views, but those views don't givesuch a fine-grained approach to figuring out how large each deleted row was. For a table with fixed content options,the content can sometimes be completely inferred from the length of the row. (Consider a table with a single textcolumn containing either "approved" or "denied".) But pg_surgery is understood to be a collection of sharp tools only to be used under fairly exceptional conditions. amcheck,on the other hand, is something that feels safer and more reasonable to use on a regular basis, perhaps from a cronjob executed by a less trusted user. Forcing the user to be superuser makes it clearer that this feeling of safety isnot justified. I am inclined to just restrict verify_heapam() to superusers and be done. What do you think? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 22, 2020 at 10:55 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I am inclined to just restrict verify_heapam() to superusers and be done. What do you think? The existing amcheck functions were designed to have execute privilege granted to non-superusers, though we never actually advertised that fact. Maybe now would be a good time to start doing so. -- Peter Geoghegan
On Tue, Sep 22, 2020 at 1:55 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I am inclined to just restrict verify_heapam() to superusers and be done. What do you think? I think that's an old and largely failed approach. If you want to use pg_class_ownercheck here rather than pg_class_aclcheck or something like that, seems fair enough. But I don't think there should be an is-superuser check in the code, because we've been trying really hard to get rid of those in most places. And I also don't think there should be no secondary permissions check, because if somebody does grant execute permission on these functions, it's unlikely that they want the person getting that permission to be able to check every relation in the system even those on which they have no other privileges at all. But now I see that there's no secondary permission check in the verify_nbtree.c code. Is that intentional? Peter, what's the justification for that? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > But now I see that there's no secondary permission check in the > verify_nbtree.c code. Is that intentional? Peter, what's the > justification for that? As noted by comments in contrib/amcheck/sql/check_btree.sql (the verify_nbtree.c tests), this is intentional. Note that we explicitly test that a non-superuser role can perform verification following GRANT EXECUTE ON FUNCTION ... . As I mentioned earlier, this is supported (or at least it is supported in my interpretation of things). It just isn't documented anywhere outside the test itself. -- Peter Geoghegan
On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote: > +REVOKE ALL ON FUNCTION > +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) > +FROM PUBLIC; > > This too. Do we really want to use a cstring as an enum-like argument? I think that I see a bug at this point in check_tuple() (in v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch): > + /* If xmax is a multixact, it should be within valid range */ > + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr); > + if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx)) > + { *** SNIP *** > + } > + > + /* If xmax is normal, it should be within valid range */ > + if (TransactionIdIsNormal(xmax)) > + { Why should it be okay to call TransactionIdIsNormal(xmax) at this point? It isn't certain that xmax is an XID at all (could be a MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the value in the first place). Don't you need to check "(infomask & HEAP_XMAX_IS_MULTI) == 0" here? This does look like it's shaping up. Thanks for working on it, Mark. -- Peter Geoghegan
On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog. Ultimately, I rippedthat out. My reasoning was that a simpler patch submission was more likely to be acceptable to the community. Isn't some kind of pragmatic compromise possible? > But I don't want to make this patch dependent on that hypothetical patch getting written and accepted. Fair enough, but if you're alluding to what I said then about check_tuphdr_xids()/clog checking a while back then FWIW I didn't intend to block progress on clog/xact status verification at all. I just don't think that it is sensible to impose an iron clad guarantee about having no assertion failures with corrupt clog data -- that leads to far too much code duplication. But why should you need to provide an absolute guarantee of that? I for one would be fine with making the clog checks an optional extra, that rescinds the no crash guarantee that you're keen on -- just like with the TOAST checks that you have already in v15. It might make sense to review how often crashes occur with simulated corruption, and then to minimize the number of occurrences in the real world. Maybe we could tolerate a usually-no-crash interface to clog -- if it could still have assertion failures. Making a strong guarantee about assertions seems unnecessary. I don't see how verify_heapam will avoid raising an error during basic validation from PageIsVerified(), which will violate the guarantee about not throwing errors. I don't see that as a problem myself, but presumably you will. -- Peter Geoghegan
Peter Geoghegan <pg@bowt.ie> writes: > On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote: >> +REVOKE ALL ON FUNCTION >> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) >> +FROM PUBLIC; >> >> This too. > Do we really want to use a cstring as an enum-like argument? Ugh. We should not be using cstring as a SQL-exposed datatype unless there really is no alternative. Why wasn't this argument declared "text"? regards, tom lane
Greetings, * Peter Geoghegan (pg@bowt.ie) wrote: > On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote: > > But now I see that there's no secondary permission check in the > > verify_nbtree.c code. Is that intentional? Peter, what's the > > justification for that? > > As noted by comments in contrib/amcheck/sql/check_btree.sql (the > verify_nbtree.c tests), this is intentional. Note that we explicitly > test that a non-superuser role can perform verification following > GRANT EXECUTE ON FUNCTION ... . > As I mentioned earlier, this is supported (or at least it is supported > in my interpretation of things). It just isn't documented anywhere > outside the test itself. Would certainly be good to document this but I tend to agree with the comments that ideally- a) it'd be nice for a relatively low-privileged user/process could run the tests in an ongoing manner b) we don't want to add more is-superuser checks c) users shouldn't really be given the ability to see rows they're not supposed to have access to In other places in the code, when an error is generated and the user doesn't have access to the underlying table or doesn't have BYPASSRLS, we don't include the details or the actual data in the error. Perhaps that approach would make sense here (or perhaps not, but it doesn't seem entirely crazy to me, anyway). In other words: a) keep the ability for someone who has EXECUTE on the function to be able to run the function against any relation b) when we detect an issue, perform a permissions check to see if the user calling the function has rights to read the rows of the table and, if RLS is enabled on the table, if they have BYPASSRLS c) if the user has appropriate privileges, log the detailed error, if not, return a generic error with a HINT that details weren't available due to lack of privileges on the relation I can appreciate the concerns regarding dead rows ending up being visible to someone who wouldn't normally be able to see them but I'd argue we could simply document that fact rather than try to build something to address it, for this particular case. If there's push back on that then I'd suggest we have a "can read dead rows" or some such capability that can be GRANT'd (in the form of a default role, I would think) which a user would also have to have in order to get detailed error reports from this function. Thanks, Stephen
Attachment
On Tue, Aug 25, 2020 at 07:36:53AM -0700, Mark Dilger wrote: > Removed. This patch is failing to compile on Windows: C:\projects\postgresql\src\include\fe_utils/print.h(18): fatal error C1083: Cannot open include file: 'libpq-fe.h': No such file or directory [C:\projects\postgresql\pg_amcheck.vcxproj] It looks like you forgot to tweak the scripts in src/tools/msvc/. -- Michael
Attachment
Robert, Peter, Andrey, Stephen, and Michael, Attached is a new version based in part on your review comments, quoted and responded to below as necessary. There remain a few open issues and/or things I did not implement: - This version follows Robert's suggestion of using pg_class_aclcheck() to check that the caller has permission to selectfrom the table being checked. This is inconsistent with the btree checking logic, which does no such check. Thesetwo approaches should be reconciled, but there was apparently no agreement on this issue. - The public facing documentation, currently live at https://www.postgresql.org/docs/13/amcheck.html, claims "amcheck functionsmay only be used by superusers." The docs on master still say the same. This patch replaces that language withalternate language explaining that execute permissions may be granted to non-superusers, along with a warning about therisk of data leakage. Perhaps some portion of that language in this patch should be back-patched? - Stephen's comments about restricting how much information goes into the returned corruption report depending on the permissionsof the caller has not been implemented. I may implement some of this if doing so is consistent with whateverwe decide to do for the aclcheck issue, above, though probably not. It seems overly complicated. - This version does not change clog handling, which leaves Andrey's concern unaddressed. Peter also showed some supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests. I may come back to this issue,depending on time available and further feedback. Moving on to Michael's review.... > On Sep 28, 2020, at 10:56 PM, Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Aug 25, 2020 at 07:36:53AM -0700, Mark Dilger wrote: >> Removed. > > This patch is failing to compile on Windows: > C:\projects\postgresql\src\include\fe_utils/print.h(18): fatal error > C1083: Cannot open include file: 'libpq-fe.h': No such file or > directory [C:\projects\postgresql\pg_amcheck.vcxproj] > > It looks like you forgot to tweak the scripts in src/tools/msvc/. Fixed, I think. I have not tested on windows. Moving on to Stephen's review.... > On Sep 23, 2020, at 6:46 AM, Stephen Frost <sfrost@snowman.net> wrote: > > Greetings, > > * Peter Geoghegan (pg@bowt.ie) wrote: >> On Tue, Sep 22, 2020 at 12:41 PM Robert Haas <robertmhaas@gmail.com> wrote: >>> But now I see that there's no secondary permission check in the >>> verify_nbtree.c code. Is that intentional? Peter, what's the >>> justification for that? >> >> As noted by comments in contrib/amcheck/sql/check_btree.sql (the >> verify_nbtree.c tests), this is intentional. Note that we explicitly >> test that a non-superuser role can perform verification following >> GRANT EXECUTE ON FUNCTION ... . > >> As I mentioned earlier, this is supported (or at least it is supported >> in my interpretation of things). It just isn't documented anywhere >> outside the test itself. > > Would certainly be good to document this but I tend to agree with the > comments that ideally- > > a) it'd be nice for a relatively low-privileged user/process could run > the tests in an ongoing manner > b) we don't want to add more is-superuser checks > c) users shouldn't really be given the ability to see rows they're not > supposed to have access to > > In other places in the code, when an error is generated and the user > doesn't have access to the underlying table or doesn't have BYPASSRLS, > we don't include the details or the actual data in the error. Perhaps > that approach would make sense here (or perhaps not, but it doesn't seem > entirely crazy to me, anyway). In other words: > > a) keep the ability for someone who has EXECUTE on the function to be > able to run the function against any relation > b) when we detect an issue, perform a permissions check to see if the > user calling the function has rights to read the rows of the table > and, if RLS is enabled on the table, if they have BYPASSRLS > c) if the user has appropriate privileges, log the detailed error, if > not, return a generic error with a HINT that details weren't > available due to lack of privileges on the relation > > I can appreciate the concerns regarding dead rows ending up being > visible to someone who wouldn't normally be able to see them but I'd > argue we could simply document that fact rather than try to build > something to address it, for this particular case. If there's push back > on that then I'd suggest we have a "can read dead rows" or some such > capability that can be GRANT'd (in the form of a default role, I would > think) which a user would also have to have in order to get detailed > error reports from this function. There wasn't enough agreement on the thread about how this should work, so I left this idea unimplemented. I'm a bit concerned that restricting the results for non-superusers would create a perverse incentive to use a superuserrole to connect and check tables. On the other hand, there would not be any difference in the output in the commoncase that no corruption exists, so maybe the perverse incentive would not be too significant. Implementing the idea you outline would complicate the patch a fair amount, as we'd need to tailor all the reports in thisway, and extend the tests to verify we're not leaking any information to non-superusers. I would prefer to find a simplersolution. Moving on to Robert's review.... > On Sep 21, 2020, at 2:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Aug 25, 2020 at 10:36 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> Thanks for the review! > > + msg OUT text > + ) > > Looks like atypical formatting. > > +REVOKE ALL ON FUNCTION > +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) > +FROM PUBLIC; > > This too. Changed in this next version. > +-- Don't want this to be available to public > > Add "by default, but superusers can grant access" or so? Hmm. I borrowed the verbiage from elsewhere. contrib/pg_buffercache/pg_buffercache--1.2.sql:-- Don't want these to be available to public. contrib/pg_freespacemap/pg_freespacemap--1.1.sql:-- Don't want these to be available to public. contrib/pg_visibility/pg_visibility--1.1.sql:-- Don't want these to be available to public. > I think there should be a call to pg_class_aclcheck() here, just like > the one in pg_prewarm, so that if the superuser does choose to grant > access, users given access can check tables they anyway have > permission to access, but not others. Maybe put that in > check_relation_relkind_and_relam() and rename it. Might want to look > at the pg_surgery precedent, too. I don't think there are any great options here, but for this next version I've done it with pg_class_aclcheck(). > Oh, and that functions header > comment is also wrong. Changed in this next version. > I think that the way the checks on the block range are performed could > be improved. Generally, we want to avoid reporting the same problem > with a variety of different message strings, because it adds burden > for translators and is potentially confusing for users. You've got two > message strings that are only going to be used for empty relations and > a third message string that is only going to be used for non-empty > relations. What stops you from just ripping off the way that this is > done in pg_prewarm, which requires only 2 messages? Then you'd be > adding a net total of 0 new messages instead of 3, and in my view they > would be clearer than your third message, "block range is out of > bounds for relation with block count %u: " INT64_FORMAT " .. " > INT64_FORMAT, which doesn't say very precisely what the problem is, > and also falls afoul of our usual practice of avoiding the use of > INT64_FORMAT in error messages that are subject to translation. I > notice that pg_prewarm just silently does nothing if the start and end > blocks are swapped, rather than generating an error. We could choose > to do differently here, but I'm not sure why we should bother. This next version borrows pg_prewarm's messages as you suggest, except that pg_prewarm embeds INT64_FORMAT in the messagestrings, which are replaced with %u in this next patch. Also, there is no good way to report an invalid block rangefor empty tables using these messages, so the patch now just exists early in such a case for invalid ranges withoutthrowing an error. This is a little bit non-orthogonal with how invalid block ranges are handled on non-empty tables,but perhaps that's ok. > > + all_frozen = mapbits & VISIBILITYMAP_ALL_VISIBLE; > + all_visible = mapbits & VISIBILITYMAP_ALL_FROZEN; > + > + if ((all_frozen && skip_option == > SKIP_PAGES_ALL_FROZEN) || > + (all_visible && skip_option == > SKIP_PAGES_ALL_VISIBLE)) > + { > + continue; > + } > > This isn't horrible style, but why not just get rid of the local > variables? e.g. if (skip_option == SKIP_PAGES_ALL_FROZEN) { if > ((mapbits & VISIBILITYMAP_ALL_FROZEN) != 0) continue; } else { ... } > > Typically no braces around a block containing only one line. Changed in this next version. > + * table contains corrupt all frozen bits, a concurrent vacuum might skip the > > all-frozen? Changed in this next version. > + * relfrozenxid beyond xid.) Reporting the xid as valid under such conditions > + * seems acceptable, since if we had checked it earlier in our scan it would > + * have truly been valid at that time, and we break no MVCC guarantees by > + * failing to notice the concurrent change in its status. > > I agree with the first half of this sentence, but I don't know what > MVCC guarantees have to do with anything. I'd just delete the second > part, or make it a lot clearer. Changed in this next version to simply omit the MVCC related language. > > + * Some kinds of tuple header corruption make it unsafe to check the tuple > + * attributes, for example when the tuple is foreshortened and such checks > + * would read beyond the end of the line pointer (and perhaps the page). In > > I think of foreshortening mostly as an art term, though I guess it has > other meanings. Maybe it would be clearer to say something like "Some > kinds of corruption make it unsafe to check the tuple attributes, for > example when the line pointer refers to a range of bytes outside the > page"? > > + * Other kinds of tuple header corruption do not bare on the question of > > bear Changed. > + pstrdup(_("updating > transaction ID marked incompatibly as keys updated and locked > only"))); > + pstrdup(_("updating > transaction ID marked incompatibly as committed and as a > multitransaction ID"))); > > "updating transaction ID" might scare somebody who thinks that you are > telling them that you changed something. That's not what it means, but > it might not be totally clear. Maybe: > > tuple is marked as only locked, but also claims key columns were updated > multixact should not be marked committed Changed to use your verbiage. > + > psprintf(_("data offset differs from expected: %u vs. %u (1 attribute, > has nulls)"), > > For these, how about: > > tuple data should begin at byte %u, but actually begins at byte %u (1 > attribute, has nulls) > etc. Is it ok to embed interpolated values into the message string like that? I thought that made it harder for translators. I agree that your language is easier to understand, and have used it in this next version of the patch. Manyof your comments that follow raise the same issue, but I'm using your verbiage anyway. > + > psprintf(_("old-style VACUUM FULL transaction ID is in the future: > %u"), > + > psprintf(_("old-style VACUUM FULL transaction ID precedes freeze > threshold: %u"), > + > psprintf(_("old-style VACUUM FULL transaction ID is invalid in this > relation: %u"), > > old-style VACUUM FULL transaction ID %u is in the future > old-style VACUUM FULL transaction ID %u precedes freeze threshold %u > old-style VACUUM FULL transaction ID %u out of range %u..%u > > Doesn't the second of these overlap with the third? Good point. If the second one reports, so will the third. I've changed it to use if/else if logic to avoid that, and touse your suggested verbiage. > > Similarly in other places, e.g. > > + > psprintf(_("inserting transaction ID is in the future: %u"), > > I think this should change to: inserting transaction ID %u is in the future Changed, along with similarly formatted messages. > > + else if (VARATT_IS_SHORT(chunk)) > + /* > + * could happen due to heap_form_tuple doing its thing > + */ > + chunksize = VARSIZE_SHORT(chunk) - VARHDRSZ_SHORT; > > Add braces here, since there are multiple lines. Changed. > > + psprintf(_("toast > chunk sequence number not the expected sequence number: %u vs. %u"), > > toast chunk sequence number %u does not match expected sequence number %u > > There are more instances of this kind of thing. Changed. > + > psprintf(_("toasted attribute has unexpected TOAST tag: %u"), > > Remove colon. Changed. > + > psprintf(_("attribute ends at offset beyond total tuple length: %u vs. > %u (attribute length %u)"), > > Let's try to specify the attribute number in the attribute messages > where we can, e.g. > > + > psprintf(_("attribute ends at offset beyond total tuple length: %u vs. > %u (attribute length %u)"), > > How about: attribute %u with length %u should end at offset %u, but > the tuple length is only %u I had omitted the attribute numbers from the attribute corruption messages because attnum is one of the OUT parameters fromverify_heapam. I'm including attnum in the message text for this next version, as you request. > + if (TransactionIdIsNormal(ctx->relfrozenxid) && > + TransactionIdPrecedes(xmin, ctx->relfrozenxid)) > + { > + report_corruption(ctx, > + /* > translator: Both %u are transaction IDs. */ > + > psprintf(_("inserting transaction ID is from before freeze cutoff: %u > vs. %u"), > + > xmin, ctx->relfrozenxid)); > + fatal = true; > + } > + else if (!xid_valid_in_rel(xmin, ctx)) > + { > + report_corruption(ctx, > + /* > translator: %u is a transaction ID. */ > + > psprintf(_("inserting transaction ID is in the future: %u"), > + > xmin)); > + fatal = true; > + } > > This seems like good evidence that xid_valid_in_rel needs some > rethinking. As far as I can see, every place where you call > xid_valid_in_rel, you have checks beforehand that duplicate some of > what it does, so that you can give a more accurate error message. > That's not good. Either the message should be adjusted so that it > covers all the cases "e.g. tuple xmin %u is outside acceptable range > %u..%u" or we should just get rid of xid_valid_in_rel() and have > separate error messages for each case, e.g. tuple xmin %u precedes > relfrozenxid %u". This next version is refactored, removing the function xid_valid_in_rel entirely, and structuring get_xid_status differently. > I think it's OK to use terms like xmin and xmax in > these messages, rather than inserting transaction ID etc. We have > existing instances of that, and while someone might judge it > user-unfriendly, I disagree. A person who is qualified to interpret > this output must know what 'tuplex min' means immediately, but whether > they can understand that 'inserting transaction ID' means the same > thing is questionable, I think. Done. > This is not a full review, but in general I think that this is getting > pretty close to being committable. The error messages seem to still > need some polishing and I wouldn't be surprised if there are a few > more bugs lurking yet, but I think it's come a long way. This next version has some other message rewording. While testing, I found it odd to report an xid as out of bounds (inthe future, or before the freeze threshold, etc.), without mentioning the xid value against which it is being comparedunfavorably. We don't normally need to think about the epoch when comparing two xids against each other, as theymust both make sense relative to the current epoch; but for corruption, you can't assume the corrupt xid was writtenrelative to any particular epoch, and only the 32-bit xid value can be printed since the epoch is unknown. The otherxid value (freeze threshold, etc) can be printed with the epoch information, but printing the epoch+xid merely as xid8outdoes (in other words, as a UINT64) makes the messages thoroughly confusing. I went with the equivalent of sprintf("%u:%u",epoch, xid), which follows the precedent from pg_controldata.c, gistdesc.c, and elsewhere. Moving on to Peter's reviews.... > On Sep 22, 2020, at 4:18 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Mon, Sep 21, 2020 at 2:09 PM Robert Haas <robertmhaas@gmail.com> wrote: >> +REVOKE ALL ON FUNCTION >> +verify_heapam(regclass, boolean, boolean, cstring, bigint, bigint) >> +FROM PUBLIC; >> >> This too. > > Do we really want to use a cstring as an enum-like argument? Perhaps not. This next version has that as text. > > I think that I see a bug at this point in check_tuple() (in > v15-0001-Adding-function-verify_heapam-to-amcheck-module.patch): > >> + /* If xmax is a multixact, it should be within valid range */ >> + xmax = HeapTupleHeaderGetRawXmax(ctx->tuphdr); >> + if ((infomask & HEAP_XMAX_IS_MULTI) && !mxid_valid_in_rel(xmax, ctx)) >> + { > > *** SNIP *** > >> + } >> + >> + /* If xmax is normal, it should be within valid range */ >> + if (TransactionIdIsNormal(xmax)) >> + { > > Why should it be okay to call TransactionIdIsNormal(xmax) at this > point? It isn't certain that xmax is an XID at all (could be a > MultiXactId, since you called HeapTupleHeaderGetRawXmax() to get the > value in the first place). Don't you need to check "(infomask & > HEAP_XMAX_IS_MULTI) == 0" here? I think you are right. This check you suggest is used in this next version. > On Sep 22, 2020, at 5:16 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Sat, Aug 29, 2020 at 10:48 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> I had an earlier version of the verify_heapam patch that included a non-throwing interface to clog. Ultimately, I rippedthat out. My reasoning was that a simpler patch submission was more likely to be acceptable to the community. > > Isn't some kind of pragmatic compromise possible? > >> But I don't want to make this patch dependent on that hypothetical patch getting written and accepted. > > Fair enough, but if you're alluding to what I said then about > check_tuphdr_xids()/clog checking a while back then FWIW I didn't > intend to block progress on clog/xact status verification at all. I don't recall your comments factoring into my thinking on this specific issue, but rather a conversation I had off-listwith Robert. The clog interface may be a hot enough code path that adding a flag for non-throwing behavior merelyto support a contrib module might be resisted. If folks generally like such a change to the clog interface, I couldconsider adding that as a third patch in this set. > I > just don't think that it is sensible to impose an iron clad guarantee > about having no assertion failures with corrupt clog data -- that > leads to far too much code duplication. But why should you need to > provide an absolute guarantee of that? > > I for one would be fine with making the clog checks an optional extra, > that rescinds the no crash guarantee that you're keen on -- just like > with the TOAST checks that you have already in v15. It might make > sense to review how often crashes occur with simulated corruption, and > then to minimize the number of occurrences in the real world. Maybe we > could tolerate a usually-no-crash interface to clog -- if it could > still have assertion failures. Making a strong guarantee about > assertions seems unnecessary. > > I don't see how verify_heapam will avoid raising an error during basic > validation from PageIsVerified(), which will violate the guarantee > about not throwing errors. I don't see that as a problem myself, but > presumably you will. My concern is not so much that verify_heapam will stop with an error, but rather that it might trigger a panic that stopsall backends. Stopping with an error merely because it hits corruption is not ideal, as I would rather it completedthe scan and reported all corruptions found, but that's minor compared to the damage done if verify_heapam createsdowntime in a production environment offering high availability guarantees. That statement might seem nuts, giventhat the corrupt table itself would be causing downtime, but that analysis depends on assumptions about table accesspatterns, and there is no a priori reason to think that corrupt pages are necessarily ever being accessed, or accessedin a way that causes crashes (rather than merely wrong results) outside verify_heapam scanning the whole table. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > - This version does not change clog handling, which leaves Andrey's concern unaddressed. Peter also showed some supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests. I may come back to this issue,depending on time available and further feedback. Attached is a patch set that includes the clog handling as discussed. The 0001 and 0002 are effectively unchanged sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004 whichuses the non-throwing interface from within amcheck's heap checking functions. I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test coverage ofverify_heapam in the presence of clog truncation. I was hoping to have that as part of v17, but since it is taking a bitlonger than I anticipated, I'll have to come back with that in a later patch. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> 7 окт. 2020 г., в 04:20, Mark Dilger <mark.dilger@enterprisedb.com> написал(а): > > > >> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> >> - This version does not change clog handling, which leaves Andrey's concern unaddressed. Peter also showed some supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests. I may come back to this issue,depending on time available and further feedback. > > Attached is a patch set that includes the clog handling as discussed. The 0001 and 0002 are effectively unchanged sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004 whichuses the non-throwing interface from within amcheck's heap checking functions. > > I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test coverageof verify_heapam in the presence of clog truncation. I was hoping to have that as part of v17, but since it is takinga bit longer than I anticipated, I'll have to come back with that in a later patch. > Many thanks, Mark! I really appreciate this functionality. It could save me many hours of recreating clogs. I'm not entire sure this message is correct: psprintf(_("xmax %u commit status is lost") It seems to me to be not commit status, but rather transaction status. Thanks! Best regards, Andrey Borodin.
> On Oct 6, 2020, at 11:27 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > >> 7 окт. 2020 г., в 04:20, Mark Dilger <mark.dilger@enterprisedb.com> написал(а): >> >> >> >>> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: >>> >>> - This version does not change clog handling, which leaves Andrey's concern unaddressed. Peter also showed some supportfor (or perhaps just a lack of opposition to) doing more of what Andrey suggests. I may come back to this issue,depending on time available and further feedback. >> >> Attached is a patch set that includes the clog handling as discussed. The 0001 and 0002 are effectively unchanged sinceversion 16 posted yesterday, but this now includes 0003 which creates a non-throwing interface to clog, and 0004 whichuses the non-throwing interface from within amcheck's heap checking functions. >> >> I think this is a pretty good sketch for discussion, though I am unsatisfied with the lack of regression test coverageof verify_heapam in the presence of clog truncation. I was hoping to have that as part of v17, but since it is takinga bit longer than I anticipated, I'll have to come back with that in a later patch. >> > > Many thanks, Mark! I really appreciate this functionality. It could save me many hours of recreating clogs. You are quite welcome, though the thanks may be premature. I posted 0003 and 0004 patches mostly as concrete implementationexamples that can be criticized. > I'm not entire sure this message is correct: psprintf(_("xmax %u commit status is lost") > It seems to me to be not commit status, but rather transaction status. I have changed several such messages to say "transaction status" rather than "commit status". I'll be posting it in a separateemail, shortly. Thanks for reviewing! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 5, 2020, at 5:24 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > There remain a few open issues and/or things I did not implement: > > - This version follows Robert's suggestion of using pg_class_aclcheck() to check that the caller has permission to selectfrom the table being checked. This is inconsistent with the btree checking logic, which does no such check. Thesetwo approaches should be reconciled, but there was apparently no agreement on this issue. This next version, attached, has the acl checking and associated documentation changes split out into patch 0005, makingit easier to review in isolation from the rest of the patch series. Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's reviewupthread. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Mon, Oct 5, 2020 at 5:24 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I don't see how verify_heapam will avoid raising an error during basic > > validation from PageIsVerified(), which will violate the guarantee > > about not throwing errors. I don't see that as a problem myself, but > > presumably you will. > > My concern is not so much that verify_heapam will stop with an error, but rather that it might trigger a panic that stopsall backends. Stopping with an error merely because it hits corruption is not ideal, as I would rather it completedthe scan and reported all corruptions found, but that's minor compared to the damage done if verify_heapam createsdowntime in a production environment offering high availability guarantees. That statement might seem nuts, giventhat the corrupt table itself would be causing downtime, but that analysis depends on assumptions about table accesspatterns, and there is no a priori reason to think that corrupt pages are necessarily ever being accessed, or accessedin a way that causes crashes (rather than merely wrong results) outside verify_heapam scanning the whole table. That seems reasonable to me. I think that it makes sense to never take down the server in a non-debug build with verify_heapam. That's not what I took away from your previous remarks on the issue, but perhaps it doesn't matter now. -- Peter Geoghegan
On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > This next version, attached, has the acl checking and associated documentation changes split out into patch 0005, makingit easier to review in isolation from the rest of the patch series. > > Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's reviewupthread. I was about to commit 0001, after making some cosmetic changes, when I discovered that it won't link for me. I think there must be something wrong with the NLS stuff. My version of 0001 is attached. The error I got is: ccache clang -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror -fno-omit-frame-pointer -bundle -multiply_defined suppress -o amcheck.so verify_heapam.o verify_nbtree.o -L../../src/port -L../../src/common -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -L/opt/local/lib -Wl,-dead_strip_dylibs -Wall -Werror -fno-omit-frame-pointer -bundle_loader ../../src/backend/postgres Undefined symbols for architecture x86_64: "_libintl_gettext", referenced from: _verify_heapam in verify_heapam.o _check_tuple in verify_heapam.o ld: symbol(s) not found for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation) make: *** [amcheck.so] Error 1 -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2020-Oct-21, Robert Haas wrote: > On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > This next version, attached, has the acl checking and associated documentation changes split out into patch 0005, makingit easier to review in isolation from the rest of the patch series. > > > > Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's reviewupthread. > > I was about to commit 0001, after making some cosmetic changes, when I > discovered that it won't link for me. I think there must be something > wrong with the NLS stuff. My version of 0001 is attached. The error I > got is: Hmm ... I don't think we have translation support in contrib, do we? I think you could solve that by adding a "#undef _, #define _(...) (...)" or similar at the top of the offending C files, assuming you don't want to rip out all use of _() there. TBH the usage of "translation:" comments in this patch seems over-enthusiastic to me.
> On Oct 21, 2020, at 1:13 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > > On 2020-Oct-21, Robert Haas wrote: > >> On Wed, Oct 7, 2020 at 9:01 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: >>> This next version, attached, has the acl checking and associated documentation changes split out into patch 0005, makingit easier to review in isolation from the rest of the patch series. >>> >>> Independently of acl considerations, this version also has some verbiage changes in 0004, in response to Andrey's reviewupthread. >> >> I was about to commit 0001, after making some cosmetic changes, when I >> discovered that it won't link for me. I think there must be something >> wrong with the NLS stuff. My version of 0001 is attached. The error I >> got is: > > Hmm ... I don't think we have translation support in contrib, do we? I > think you could solve that by adding a "#undef _, #define _(...) (...)" > or similar at the top of the offending C files, assuming you don't want > to rip out all use of _() there. There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not forRobert. On mac, I'm using the toolchain from XCode, whereas Robert is using MacPorts. Mine reports: Apple clang version 11.0.0 (clang-1100.0.33.17) Target: x86_64-apple-darwin19.6.0 Thread model: posix InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin Robert's reports: clang version 5.0.2 (tags/RELEASE_502/final) Target: x86_64-apple-darwin19.4.0 Thread model: posix InstalledDir: /opt/local/libexec/llvm-5.0/bin On linux, I'm using gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36) Searching around on the web, there are various reports of MacPort's clang not linking libintl correctly, though I don't knowif that is a real problem with MacPorts or just a few cases of user error. Has anybody else following this thread hadissues with MacPort's version of clang vis-a-vis linking libintl's gettext? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I was about to commit 0001, after making some cosmetic changes, when I > discovered that it won't link for me. I think there must be something > wrong with the NLS stuff. My version of 0001 is attached. The error I > got is: Well, the short answer would be "you need to add SHLIB_LINK += $(filter -lintl, $(LIBS)) to the Makefile". However, I would vote against that, because in point of fact amcheck has no translation support, just like all our other contrib modules. What should likely happen instead is to rip out whatever code is overoptimistically expecting it needs to support translation. regards, tom lane
Mark Dilger <mark.dilger@enterprisedb.com> writes: > There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not forRobert. Are you using --enable-nls at all on your Mac build? Because for sure it should not work there, given the failure to include -lintl in amcheck's link step. Some platforms are forgiving of that, but not Mac. regards, tom lane
> On Oct 21, 2020, at 1:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> There is still something screwy here, though, as this compiles, links and runs fine for me on mac and linux, but not forRobert. > > Are you using --enable-nls at all on your Mac build? Because for sure it > should not work there, given the failure to include -lintl in amcheck's > link step. Some platforms are forgiving of that, but not Mac. Thanks, Tom! No, that's the answer. I had a typo/thinko in my configure options, --with-nls instead of --enable-nls, and the warningabout it being an invalid flag went by so fast I didn't see it. I had it spelled correctly on linux, but I guessthat's one of the platforms that is more forgiving. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 21, 2020, at 1:47 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: >> I was about to commit 0001, after making some cosmetic changes, when I >> discovered that it won't link for me. I think there must be something >> wrong with the NLS stuff. My version of 0001 is attached. The error I >> got is: > > Well, the short answer would be "you need to add > > SHLIB_LINK += $(filter -lintl, $(LIBS)) > > to the Makefile". However, I would vote against that, because in point > of fact amcheck has no translation support, just like all our other > contrib modules. What should likely happen instead is to rip out > whatever code is overoptimistically expecting it needs to support > translation. Done that way in the attached, which also include Robert's changes from v19 he posted earlier today. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Done that way in the attached, which also include Robert's changes from v19 he posted earlier today. Committed. Let's see what the buildfarm thinks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > Committed. Let's see what the buildfarm thinks. It is mostly happy, but thorntail is not: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11 I thought that the problem might be related to the fact that thorntail is using force_parallel_mode, but I tried that here and it did not cause a failure. So my next guess is that it is related to the fact that this is a sparc64 machine, but it's hard to tell, since none of the other sparc64 critters have run yet. In any case I don't know why that would cause a failure. The messages in the log aren't very illuminating, unfortunately. :-( Mark, any ideas what might cause specifically that set of tests to fail? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > The messages in the log aren't very > illuminating, unfortunately. :-( Considering this is a TAP test, why in the world is it designed to hide all details of any unexpected amcheck messages? Surely being able to see what amcheck is saying would be helpful here. IOW, don't have the tests abbreviate the module output with count(*), but return the full thing, and then use a regex to see if you got what was expected. If you didn't, the output will show what you did get. regards, tom lane
On Thu, Oct 22, 2020 at 10:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Considering this is a TAP test, why in the world is it designed to hide > all details of any unexpected amcheck messages? Surely being able to > see what amcheck is saying would be helpful here. > > IOW, don't have the tests abbreviate the module output with count(*), > but return the full thing, and then use a regex to see if you got what > was expected. If you didn't, the output will show what you did get. Yeah, that thought crossed my mind, too. But I'm not sure it would help in the case of this particular failure, because I think the problem is that we're expecting to get complaints and instead getting none. It might be good to change it anyway, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
lapwing just spit up a possibly relevant issue: ccache gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2-Werror -fPIC -I. -I. -I../../src/include -DENFORCE_REGRESSION_TEST_NAME_RESTRICTIONS -D_GNU_SOURCE -I/usr/include/libxml2 -I/usr/include/et -c -o verify_heapam.o verify_heapam.c verify_heapam.c: In function 'get_xid_status': verify_heapam.c:1432:5: error: 'fxid.value' may be used uninitialized in this function [-Werror=maybe-uninitialized] cc1: all warnings being treated as errors
> On Oct 22, 2020, at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Committed. Let's see what the buildfarm thinks. > > It is mostly happy, but thorntail is not: > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11 > > I thought that the problem might be related to the fact that thorntail > is using force_parallel_mode, but I tried that here and it did not > cause a failure. So my next guess is that it is related to the fact > that this is a sparc64 machine, but it's hard to tell, since none of > the other sparc64 critters have run yet. In any case I don't know why > that would cause a failure. The messages in the log aren't very > illuminating, unfortunately. :-( > > Mark, any ideas what might cause specifically that set of tests to fail? The code is correctly handling an uncorrupted table, but then more or less randomly failing some of the time when processinga corrupt table. Tom identified a problem with an uninitialized variable. I'm putting together a new patch set to address it. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 9:01 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Oct 22, 2020, at 7:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Thu, Oct 22, 2020 at 8:51 AM Robert Haas <robertmhaas@gmail.com> wrote: >>> Committed. Let's see what the buildfarm thinks. >> >> It is mostly happy, but thorntail is not: >> >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-22%2012%3A58%3A11 >> >> I thought that the problem might be related to the fact that thorntail >> is using force_parallel_mode, but I tried that here and it did not >> cause a failure. So my next guess is that it is related to the fact >> that this is a sparc64 machine, but it's hard to tell, since none of >> the other sparc64 critters have run yet. In any case I don't know why >> that would cause a failure. The messages in the log aren't very >> illuminating, unfortunately. :-( >> >> Mark, any ideas what might cause specifically that set of tests to fail? > > The code is correctly handling an uncorrupted table, but then more or less randomly failing some of the time when processinga corrupt table. > > Tom identified a problem with an uninitialized variable. I'm putting together a new patch set to address it. The 0001 attached patch addresses the -Werror=maybe-uninitialized problem. The 0002 attached patch addresses the test failures: The failing test is designed to stop the server, create blunt force trauma to the heap and toast files through overwritinggarbage bytes, restart the server, and verify that corruption is detected by amcheck's verify_heapam(). The exacttrauma is intended to be the same on all platforms, in terms of the number of bytes written and the location in thefile that it gets written, but owing to differences between platforms, by design the test does not expect a particularcorruption message. The test was overwriting far fewer bytes than I had intended, but since it was still sufficient to create corruption on theplatforms where I tested, I failed to notice. It should do a more thorough job now. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Oct 22, 2020 at 3:15 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The 0001 attached patch addresses the -Werror=maybe-uninitialized problem. I am skeptical. Why so much code churn to fix a compiler warning? And even in the revised code, *status isn't set in all cases, so I don't see why this would satisfy the compiler. Even if it satisfies this particular compiler for some other reason, some other compiler is bound to be unhappy sometime. It's better to just arrange to set *status always, and use a dummy value in cases where it doesn't matter. Also, "return XID_BOUNDS_OK;;" has exceeded its recommended allowance of semicolons. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 1:00 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Oct 22, 2020 at 3:15 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> The 0001 attached patch addresses the -Werror=maybe-uninitialized problem. > > I am skeptical. Why so much code churn to fix a compiler warning? And > even in the revised code, *status isn't set in all cases, so I don't > see why this would satisfy the compiler. Even if it satisfies this > particular compiler for some other reason, some other compiler is > bound to be unhappy sometime. It's better to just arrange to set > *status always, and use a dummy value in cases where it doesn't > matter. Also, "return XID_BOUNDS_OK;;" has exceeded its recommended > allowance of semicolons. I think the compiler warning was about fxid not being set. The callers pass NULL for status if they don't want status checked,so writing *status unconditionally would be an error. Also, if the xid being checked is out of bounds, we can'tcheck the status of the xid in clog. As for the code churn, I probably refactored it a bit more than I needed to fix the compiler warning about fxid, but thatwas because the old arrangement seemed to make it harder to reason about when and where fxid got set. I think that ismore clear now. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
ooh, looks like prairiedog sees the problem too. That means I should be able to reproduce it under a debugger, if you're not certain yet where the problem lies. regards, tom lane
... btw, having now looked more closely at get_xid_status(), I wonder how come there aren't more compilers bitching about it, because it is very very obviously broken. In particular, the case of requesting status for an xid that is BootstrapTransactionId or FrozenTransactionId *will* fall through to perform FullTransactionIdPrecedesOrEquals with an uninitialized fxid. The fact that most compilers seem to fail to notice that is quite scary. I suppose it has something to do with FullTransactionId being a struct, which makes me wonder if that choice was quite as wise as we thought. Meanwhile, so far as this code goes, I wonder why you don't just change it to always set that value, ie XidBoundsViolation result; FullTransactionId fxid; FullTransactionId clog_horizon; + fxid = FullTransactionIdFromXidAndCtx(xid, ctx); + /* Quick check for special xids */ if (!TransactionIdIsValid(xid)) result = XID_INVALID; else if (xid == BootstrapTransactionId || xid == FrozenTransactionId) result = XID_BOUNDS_OK; else { /* Check if the xid is within bounds */ - fxid = FullTransactionIdFromXidAndCtx(xid, ctx); if (!fxid_in_cached_range(fxid, ctx)) { regards, tom lane
> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > ooh, looks like prairiedog sees the problem too. That means I should be > able to reproduce it under a debugger, if you're not certain yet where > the problem lies. Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code. I thinkthey are a busted perl test. The test was supposed to corrupt the heap by overwriting a heap file with a large chunkof garbage, but in fact only wrote a small amount of garbage. The idea was to write about 2000 bytes starting at offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl test,that didn't happen. I think the uninitialized variable warning is warning about a real problem in the c-code, but I have no reason to think thatparticular problem is causing this particular regression test failure. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 1:23 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > ... btw, having now looked more closely at get_xid_status(), I wonder > how come there aren't more compilers bitching about it, because it > is very very obviously broken. In particular, the case of > requesting status for an xid that is BootstrapTransactionId or > FrozenTransactionId *will* fall through to perform > FullTransactionIdPrecedesOrEquals with an uninitialized fxid. > > The fact that most compilers seem to fail to notice that is quite scary. > I suppose it has something to do with FullTransactionId being a struct, > which makes me wonder if that choice was quite as wise as we thought. > > Meanwhile, so far as this code goes, I wonder why you don't just change it > to always set that value, ie > > XidBoundsViolation result; > FullTransactionId fxid; > FullTransactionId clog_horizon; > > + fxid = FullTransactionIdFromXidAndCtx(xid, ctx); > + > /* Quick check for special xids */ > if (!TransactionIdIsValid(xid)) > result = XID_INVALID; > else if (xid == BootstrapTransactionId || xid == FrozenTransactionId) > result = XID_BOUNDS_OK; > else > { > /* Check if the xid is within bounds */ > - fxid = FullTransactionIdFromXidAndCtx(xid, ctx); > if (!fxid_in_cached_range(fxid, ctx)) > { Yeah, I reached the same conclusion before submitting the fix upthread. I structured it a bit differently, but I believefxid will now always get set before being used, though sometimes the function returns before doing either. I had the same thought about compilers not catching that, too. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> ooh, looks like prairiedog sees the problem too. That means I should be >> able to reproduce it under a debugger, if you're not certain yet where >> the problem lies. > Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code. I thinkthey are a busted perl test. The test was supposed to corrupt the heap by overwriting a heap file with a large chunkof garbage, but in fact only wrote a small amount of garbage. The idea was to write about 2000 bytes starting at offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl test,that didn't happen. Hm, but why are we seeing the failure only on specific machine architectures? sparc64 and ppc32 is a weird pairing, too. regards, tom lane
> On Oct 22, 2020, at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> On Oct 22, 2020, at 1:09 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> ooh, looks like prairiedog sees the problem too. That means I should be >>> able to reproduce it under a debugger, if you're not certain yet where >>> the problem lies. > >> Thanks, Tom, but I question whether the regression test failures are from a problem in the verify_heapam.c code. I thinkthey are a busted perl test. The test was supposed to corrupt the heap by overwriting a heap file with a large chunkof garbage, but in fact only wrote a small amount of garbage. The idea was to write about 2000 bytes starting at offset32 in the page, in order to corrupt the line pointers, but owing to my incorrect use of syswrite in the perl test,that didn't happen. > > Hm, but why are we seeing the failure only on specific machine > architectures? sparc64 and ppc32 is a weird pairing, too. It is seeking to position 32 and writing '\x77\x77\x77\x77'. x86_64 is little-endian, and ppc32 and sparc64 are both big-endian,right? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Oct 22, 2020, at 1:31 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Hm, but why are we seeing the failure only on specific machine >> architectures? sparc64 and ppc32 is a weird pairing, too. > It is seeking to position 32 and writing '\x77\x77\x77\x77'. x86_64 is > little-endian, and ppc32 and sparc64 are both big-endian, right? They are, but that should not meaningfully affect the results of that corruption step. You zapped only one line pointer not several, but it would look the same regardless of endiannness. I find it more plausible that we might see the bad effects of the uninitialized variable only on those arches --- but that theory is still pretty shaky, since you'd think compiler choices about register or stack-location assignment would be the controlling factor, and those should be all over the map. regards, tom lane
I wrote: > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> It is seeking to position 32 and writing '\x77\x77\x77\x77'. x86_64 is >> little-endian, and ppc32 and sparc64 are both big-endian, right? > They are, but that should not meaningfully affect the results of > that corruption step. You zapped only one line pointer not > several, but it would look the same regardless of endiannness. Oh, wait a second. ItemIdData has the flag bits in the middle: typedef struct ItemIdData { unsigned lp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of line pointer, see below */ lp_len:15; /* byte length of tuple */ } ItemIdData; meaning that for that particular bit pattern, one endianness is going to see the flags as 01 (LP_NORMAL) and the other as 10 (LP_REDIRECT). The offset/len are corrupt either way, but I'd certainly expect that amcheck would produce different complaints about those two cases. So it's unsurprising if this test case's output is endian-dependent. regards, tom lane
> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> It is seeking to position 32 and writing '\x77\x77\x77\x77'. x86_64 is >>> little-endian, and ppc32 and sparc64 are both big-endian, right? > >> They are, but that should not meaningfully affect the results of >> that corruption step. You zapped only one line pointer not >> several, but it would look the same regardless of endiannness. > > Oh, wait a second. ItemIdData has the flag bits in the middle: > > typedef struct ItemIdData > { > unsigned lp_off:15, /* offset to tuple (from start of page) */ > lp_flags:2, /* state of line pointer, see below */ > lp_len:15; /* byte length of tuple */ > } ItemIdData; > > meaning that for that particular bit pattern, one endianness > is going to see the flags as 01 (LP_NORMAL) and the other as 10 > (LP_REDIRECT). The offset/len are corrupt either way, but > I'd certainly expect that amcheck would produce different > complaints about those two cases. So it's unsurprising if > this test case's output is endian-dependent. Yeah, I'm already looking at that. The logic in verify_heapam skips over line pointers that are unused or dead, and thetest is reporting zero corruption (and complaining about that), so it's probably not going to help to overwrite all theline pointers with this particular bit pattern any more than to just overwrite the first one, as it would just skip themall. I think the test should overwrite the line pointers with a variety of different bit patterns, or one calculated to work onall platforms. I'll have to write that up. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> It is seeking to position 32 and writing '\x77\x77\x77\x77'. x86_64 is >>> little-endian, and ppc32 and sparc64 are both big-endian, right? > >> They are, but that should not meaningfully affect the results of >> that corruption step. You zapped only one line pointer not >> several, but it would look the same regardless of endiannness. > > Oh, wait a second. ItemIdData has the flag bits in the middle: > > typedef struct ItemIdData > { > unsigned lp_off:15, /* offset to tuple (from start of page) */ > lp_flags:2, /* state of line pointer, see below */ > lp_len:15; /* byte length of tuple */ > } ItemIdData; > > meaning that for that particular bit pattern, one endianness > is going to see the flags as 01 (LP_NORMAL) and the other as 10 > (LP_REDIRECT). The offset/len are corrupt either way, but > I'd certainly expect that amcheck would produce different > complaints about those two cases. So it's unsurprising if > this test case's output is endian-dependent. Well, the issue is that on big-endian machines it is not reporting any corruption at all. Are you sure the difference willbe LP_NORMAL vs LP_REDIRECT? I was thinking it was LP_DEAD vs LP_REDIRECT, as the little endian platforms are seeingcorruption messages about bad redirect line pointers, and the big-endian are apparently skipping over the line pointerentirely, which makes sense if it is LP_DEAD but not if it is LP_NORMAL. It would also skip over LP_UNUSED, but Idon't see how that could be stored in lp_flags, because 0x77 is going to either be 01110111 or 11101110, and in neithercase do you get two zeros adjacent, but you could get two ones adjacent. (LP_UNUSED = binary 00 and LP_DEAD = binary11) — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Yeah, I'm already looking at that. The logic in verify_heapam skips over line pointers that are unused or dead, and thetest is reporting zero corruption (and complaining about that), so it's probably not going to help to overwrite all theline pointers with this particular bit pattern any more than to just overwrite the first one, as it would just skip themall. > I think the test should overwrite the line pointers with a variety of different bit patterns, or one calculated to workon all platforms. I'll have to write that up. What we need here is to produce the same test results on either endianness. So probably the thing to do is apply the equivalent of ntohl() to produce a string that looks right for either host endianness. As a separate matter, you'd want to test corruption producing any of the four flag bitpatterns, probably. It says here you can use Perl's pack/unpack functions to get the equivalent of ntohl(), but I've not troubled to work out how. regards, tom lane
On Thu, Oct 22, 2020 at 5:51 AM Robert Haas <robertmhaas@gmail.com> wrote: > Committed. Let's see what the buildfarm thinks. This is great work. Thanks Mark and Robert. -- Peter Geoghegan
> On Oct 22, 2020, at 2:26 PM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Oct 22, 2020 at 5:51 AM Robert Haas <robertmhaas@gmail.com> wrote: >> Committed. Let's see what the buildfarm thinks. > > This is great work. Thanks Mark and Robert. That's the first time I've laughed today. Having turned the build-farm red, this is quite ironic feedback! Thanks all thesame for the sentiment. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Oct 22, 2020 at 2:39 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > This is great work. Thanks Mark and Robert. > > That's the first time I've laughed today. Having turned the build-farm red, this is quite ironic feedback! Thanks allthe same for the sentiment. Breaking the buildfarm is not a capital offense. Especially when it happens with patches that are in some sense low level and/or novel, and therefore inherently more likely to cause trouble. -- Peter Geoghegan
On Thu, Oct 22, 2020 at 4:04 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I think the compiler warning was about fxid not being set. The callers pass NULL for status if they don't want statuschecked, so writing *status unconditionally would be an error. Also, if the xid being checked is out of bounds, wecan't check the status of the xid in clog. Sorry, you're (partly) right. The new logic is a lot more clear that we never used that uninitialized. I'll remove the extra semi-colon and commit this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Oct 22, 2020, at 2:06 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Oh, wait a second. ItemIdData has the flag bits in the middle: >> meaning that for that particular bit pattern, one endianness >> is going to see the flags as 01 (LP_NORMAL) and the other as 10 >> (LP_REDIRECT). > Well, the issue is that on big-endian machines it is not reporting any > corruption at all. Are you sure the difference will be LP_NORMAL vs > LP_REDIRECT? [ thinks a bit harder... ] Probably not. The byte/bit string looks the same either way, given that it's four repetitions of the same byte value. But which field is which will differ: we have either oooooooooooooooFFlllllllllllllll 01110111011101110111011101110111 or lllllllllllllllFFooooooooooooooo 01110111011101110111011101110111 So now I think this is a REDIRECT on either architecture, but the offset and length fields have different values, causing the redirect pointer to point to different places. Maybe it happens to point at a DEAD tuple in the big-endian case. regards, tom lane
I wrote: > So now I think this is a REDIRECT on either architecture, but the > offset and length fields have different values, causing the redirect > pointer to point to different places. Maybe it happens to point > at a DEAD tuple in the big-endian case. Just to make sure, I tried this test program: #include <stdio.h> #include <string.h> typedef struct ItemIdData { unsigned lp_off:15, /* offset to tuple (from start of page) */ lp_flags:2, /* state of line pointer, see below */ lp_len:15; /* byte length of tuple */ } ItemIdData; int main() { ItemIdData lp; memset(&lp, 0x77, sizeof(lp)); printf("off = %x, flags = %x, len = %x\n", lp.lp_off, lp.lp_flags, lp.lp_len); return 0; } I get off = 7777, flags = 2, len = 3bbb on a little-endian machine, and off = 3bbb, flags = 2, len = 7777 on big-endian. It'd be less symmetric if the bytes weren't all the same ... regards, tom lane
I wrote: > I get > off = 7777, flags = 2, len = 3bbb > on a little-endian machine, and > off = 3bbb, flags = 2, len = 7777 > on big-endian. It'd be less symmetric if the bytes weren't > all the same ... ... but given that this is the test value we are using, why don't both endiannesses whine about a non-maxalign'd offset? The code really shouldn't even be trying to follow these redirects, because we risk SIGBUS on picky architectures. regards, tom lane
> On Oct 22, 2020, at 6:41 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> So now I think this is a REDIRECT on either architecture, but the >> offset and length fields have different values, causing the redirect >> pointer to point to different places. Maybe it happens to point >> at a DEAD tuple in the big-endian case. > > Just to make sure, I tried this test program: > > #include <stdio.h> > #include <string.h> > > typedef struct ItemIdData > { > unsigned lp_off:15, /* offset to tuple (from start of page) */ > lp_flags:2, /* state of line pointer, see below */ > lp_len:15; /* byte length of tuple */ > } ItemIdData; > > int main() > { > ItemIdData lp; > > memset(&lp, 0x77, sizeof(lp)); > printf("off = %x, flags = %x, len = %x\n", > lp.lp_off, lp.lp_flags, lp.lp_len); > return 0; > } > > I get > > off = 7777, flags = 2, len = 3bbb > > on a little-endian machine, and > > off = 3bbb, flags = 2, len = 7777 > > on big-endian. It'd be less symmetric if the bytes weren't > all the same ... I think we're going in the wrong direction here. The idea behind this test was to have as little knowledge about the layoutof pages as possible and still verify that damaging the pages would result in corruption reports. Of course, not alldamage will result in corruption reports, because some damage looks legit. I think it was just luck (good or bad dependingon your perspective) that the damage in the test as committed works on little-endian but not big-endian. I can embed this knowledge that you have researched into the test if you want me to, but my instinct is to go the other directionand have even less knowledge about pages in the test. That would work if instead of expecting corruption for everytime the test writes the file, instead to have it just make sure that it gets corruption reports at least some of thetimes that it does so. That seems more maintainable long term. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> I get >> off = 7777, flags = 2, len = 3bbb >> on a little-endian machine, and >> off = 3bbb, flags = 2, len = 7777 >> on big-endian. It'd be less symmetric if the bytes weren't >> all the same ... > > ... but given that this is the test value we are using, why > don't both endiannesses whine about a non-maxalign'd offset? > The code really shouldn't even be trying to follow these > redirects, because we risk SIGBUS on picky architectures. Ahh, crud. It's because syswrite($fh, '\x77\x77\x77\x77', 500) is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal withbackslashes and such. It should have been double-quoted. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 22, 2020, at 6:50 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Oct 22, 2020, at 6:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> I wrote: >>> I get >>> off = 7777, flags = 2, len = 3bbb >>> on a little-endian machine, and >>> off = 3bbb, flags = 2, len = 7777 >>> on big-endian. It'd be less symmetric if the bytes weren't >>> all the same ... >> >> ... but given that this is the test value we are using, why >> don't both endiannesses whine about a non-maxalign'd offset? >> The code really shouldn't even be trying to follow these >> redirects, because we risk SIGBUS on picky architectures. > > Ahh, crud. It's because > > syswrite($fh, '\x77\x77\x77\x77', 500) > > is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal withbackslashes and such. It should have been double-quoted. The reason this never came up in testing is what I was talking about elsewhere -- this test isn't designed to create *specific*corruptions. It's just supposed to corrupt the table in some random way. For whatever reasons I'm not too curiousabout, that string corrupts on little endian machines but not big endian machines. If we want to have a test thattailors very specific corruptions, I don't think the way to get there is by debugging this test. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Mark Dilger <mark.dilger@enterprisedb.com> writes: > Ahh, crud. It's because > syswrite($fh, '\x77\x77\x77\x77', 500) > is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal withbackslashes and such. It should have been double-quoted. Argh. So we really have, using same test except memcpy(&lp, "\\x77", sizeof(lp)); little endian: off = 785c, flags = 2, len = 1b9b big endian: off = 2e3c, flags = 0, len = 3737 which explains the apparent LP_DEAD result. I'm not particularly on board with your suggestion of "well, if it works sometimes then it's okay". Then we have no idea of what we really tested. regards, tom lane
> On Oct 22, 2020, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> Ahh, crud. It's because >> syswrite($fh, '\x77\x77\x77\x77', 500) >> is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literal withbackslashes and such. It should have been double-quoted. > > Argh. So we really have, using same test except > > memcpy(&lp, "\\x77", sizeof(lp)); > > little endian: off = 785c, flags = 2, len = 1b9b > big endian: off = 2e3c, flags = 0, len = 3737 > > which explains the apparent LP_DEAD result. > > I'm not particularly on board with your suggestion of "well, if it works > sometimes then it's okay". Then we have no idea of what we really tested. > > regards, tom lane Ok, I've pruned it down to something you may like better. Instead of just checking that *some* corruption occurs, it checksthe returned corruption against an expected regex, and if it fails to match, you should see in the logs what you gotvs. what you expected. It only corrupts the first two line pointers, the first one with 0x77777777 and the second one with 0xAAAAAAAA, which areconsciously chosen to be bitwise reverses of each other and just strings of alternating bits rather than anything thatcould have a more complicated interpretation. On my little-endian mac, the 0x77777777 value creates a line pointer which redirects to an invalid offset 0x7777, which getsreported as decimal 30583 in the corruption report, "line pointer redirection to item at offset 30583 exceeds maximumoffset 38". The test is indifferent to whether the corruption it is looking for is reported relative to the firstline pointer or the second one, so if endian-ness matters, it may be the 0xAAAAAAAA that results in that corruptionmessage. I don't have a machine handy to test that. It would be nice to determine the minimum amount of paranoianecessary to make this portable and not commit the rest. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Oct 22, 2020, at 9:21 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Oct 22, 2020, at 7:01 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> Ahh, crud. It's because >>> syswrite($fh, '\x77\x77\x77\x77', 500) >>> is wrong twice. The 500 was wrong, but the string there isn't the bit pattern we want -- it's just a string literalwith backslashes and such. It should have been double-quoted. >> >> Argh. So we really have, using same test except >> >> memcpy(&lp, "\\x77", sizeof(lp)); >> >> little endian: off = 785c, flags = 2, len = 1b9b >> big endian: off = 2e3c, flags = 0, len = 3737 >> >> which explains the apparent LP_DEAD result. >> >> I'm not particularly on board with your suggestion of "well, if it works >> sometimes then it's okay". Then we have no idea of what we really tested. >> >> regards, tom lane > > Ok, I've pruned it down to something you may like better. Instead of just checking that *some* corruption occurs, it checksthe returned corruption against an expected regex, and if it fails to match, you should see in the logs what you gotvs. what you expected. > > It only corrupts the first two line pointers, the first one with 0x77777777 and the second one with 0xAAAAAAAA, which areconsciously chosen to be bitwise reverses of each other and just strings of alternating bits rather than anything thatcould have a more complicated interpretation. > > On my little-endian mac, the 0x77777777 value creates a line pointer which redirects to an invalid offset 0x7777, whichgets reported as decimal 30583 in the corruption report, "line pointer redirection to item at offset 30583 exceeds maximumoffset 38". The test is indifferent to whether the corruption it is looking for is reported relative to the firstline pointer or the second one, so if endian-ness matters, it may be the 0xAAAAAAAA that results in that corruptionmessage. I don't have a machine handy to test that. It would be nice to determine the minimum amount of paranoianecessary to make this portable and not commit the rest. Obviously, that should have said 0x55555555 and 0xAAAAAAAA. After writing the patch that way, I checked that the old value0x77777777 also works on my mac, which it does, and checked that writing the line pointers starting at offset 24 ratherthan 32 works on my mac, which it does, and then went on to write this rather confused email and attached the patchwith those changes, which all work (at least on my mac) but are potentially less portable than what I had before testingthose changes. I apologize for any confusion my email from last night may have caused. The patch I *should* have attached last night this time: — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Mark Dilger <mark.dilger@enterprisedb.com> writes: > The patch I *should* have attached last night this time: Thanks, I'll do some big-endian testing with this. regards, tom lane
I wrote: > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> The patch I *should* have attached last night this time: > Thanks, I'll do some big-endian testing with this. Seems to work, so I pushed it (after some compulsive fooling about with whitespace and perltidy-ing). It appears to me that the code coverage for verify_heapam.c is not very good though, only circa 50%. Do we care to expend more effort on that? regards, tom lane
> On Oct 23, 2020, at 11:04 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > I wrote: >> Mark Dilger <mark.dilger@enterprisedb.com> writes: >>> The patch I *should* have attached last night this time: > >> Thanks, I'll do some big-endian testing with this. > > Seems to work, so I pushed it (after some compulsive fooling > about with whitespace and perltidy-ing). Thanks for all the help! > It appears to me that > the code coverage for verify_heapam.c is not very good though, > only circa 50%. Do we care to expend more effort on that? Part of the issue here is that I developed the heapcheck code as a sequence of patches, and there is much greater coveragein the tests in the 0002 patch, which hasn't been committed yet. (Nor do I know that it ever will be.) Over time,the patch set became: 0001 -- adds verify_heapam.c to contrib/amcheck, with basic test coverage 0002 -- adds pg_amcheck command line interface to contrib/pg_amcheck, with more extensive test coverage 0003 -- creates a non-throwing interface to clog 0004 -- uses the non-throwing clog interface from within verify_heapam 0005 -- adds some controversial acl checks to verify_heapam Your question doesn't have much to do with 3,4,5 above, but it definitely matters whether we're going to commit 0002. Thetest in that patch, in contrib/pg_amcheck/t/004_verify_heapam.pl, does quite a bit of bit twiddling of heap tuples andtoast records and checks that the right corruption messages come back. Part of the reason I was trying to keep 0001'st/001_verify_heapam.pl test ignorant of the exact page layout information is that I already had this other test thatcovers that. So, should I port that test from (currently non-existant) contrib/pg_amcheck into contrib/amcheck, or should we wait to seeif the 0002 patch is going to get committed? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hmm, we're not out of the woods yet: thorntail is even less happy than before. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-23%2018%3A08%3A11 I do not have 64-bit big-endian hardware to play with unfortunately. But what I suspect is happening here is less about endianness and more about alignment pickiness; or maybe we were unlucky enough to index off the end of the shmem segment. I see that verify_heapam does this for non-redirect tuples: /* Set up context information about this next tuple */ ctx.lp_len = ItemIdGetLength(ctx.itemid); ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid); ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); with absolutely no thought for the possibility that lp_off is out of range or not maxaligned. The checks for a sane lp_len seem to have gone missing as well. regards, tom lane
On Fri, Oct 23, 2020 at 11:51 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > /* Set up context information about this next tuple */ > ctx.lp_len = ItemIdGetLength(ctx.itemid); > ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid); > ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); > > with absolutely no thought for the possibility that lp_off is out of > range or not maxaligned. The checks for a sane lp_len seem to have > gone missing as well. That is surprising. verify_nbtree.c has PageGetItemIdCareful() for this exact reason. -- Peter Geoghegan
> On Oct 23, 2020, at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Hmm, we're not out of the woods yet: thorntail is even less happy > than before. > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=thorntail&dt=2020-10-23%2018%3A08%3A11 > > I do not have 64-bit big-endian hardware to play with unfortunately. > But what I suspect is happening here is less about endianness and > more about alignment pickiness; or maybe we were unlucky enough to > index off the end of the shmem segment. I see that verify_heapam > does this for non-redirect tuples: > > /* Set up context information about this next tuple */ > ctx.lp_len = ItemIdGetLength(ctx.itemid); > ctx.tuphdr = (HeapTupleHeader) PageGetItem(ctx.page, ctx.itemid); > ctx.natts = HeapTupleHeaderGetNatts(ctx.tuphdr); > > with absolutely no thought for the possibility that lp_off is out of > range or not maxaligned. The checks for a sane lp_len seem to have > gone missing as well. You certainly appear to be right about that. I've added the extra checks, and extended the regression test to include them. Patch attached. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Mark Dilger <mark.dilger@enterprisedb.com> writes: >> On Oct 23, 2020, at 11:51 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I do not have 64-bit big-endian hardware to play with unfortunately. >> But what I suspect is happening here is less about endianness and >> more about alignment pickiness; or maybe we were unlucky enough to >> index off the end of the shmem segment. > You certainly appear to be right about that. I've added the extra checks, and extended the regression test to includethem. Patch attached. Meanwhile, I've replicated the SIGBUS problem on gaur's host, so that's definitely what's happening. (Although PPC is also alignment-picky on the hardware level, I believe that both macOS and Linux try to mask that by having kernel trap handlers execute unaligned accesses, leaving only a nasty performance loss behind. That's why I failed to see this effect when checking your previous patch on an old Apple box. We likely won't see it in the buildfarm either, unless maybe on Noah's AIX menagerie.) I'll check this patch on gaur and push it if it's clean. regards, tom lane
Mark Dilger <mark.dilger@enterprisedb.com> writes: > You certainly appear to be right about that. I've added the extra checks, and extended the regression test to includethem. Patch attached. Pushed with some more fooling about. The "bit reversal" idea is not a sufficient guide to picking values that will hit all the code checks. For instance, I was seeing out-of-range warnings on one endianness and not the other because on the other one the maxalign check rejected the value first. I ended up manually tweaking the corruption patterns until they hit all the tests on both endiannesses. Pretty much the opposite of black-box testing, but it's not like our notions of line pointer layout are going to change anytime soon. I made some logic rearrangements in the C code, too. regards, tom lane
> On Oct 23, 2020, at 4:12 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Mark Dilger <mark.dilger@enterprisedb.com> writes: >> You certainly appear to be right about that. I've added the extra checks, and extended the regression test to includethem. Patch attached. > > Pushed with some more fooling about. The "bit reversal" idea is not > a sufficient guide to picking values that will hit all the code checks. > For instance, I was seeing out-of-range warnings on one endianness and > not the other because on the other one the maxalign check rejected the > value first. I ended up manually tweaking the corruption patterns > until they hit all the tests on both endiannesses. Pretty much the > opposite of black-box testing, but it's not like our notions of line > pointer layout are going to change anytime soon. > > I made some logic rearrangements in the C code, too. Thanks Tom! And Peter, your comment earlier save me some time. Thanks to you, also! — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Oct 23, 2020 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Seems to work, so I pushed it (after some compulsive fooling > about with whitespace and perltidy-ing). It appears to me that > the code coverage for verify_heapam.c is not very good though, > only circa 50%. Do we care to expend more effort on that? There are two competing goods here. On the one hand, more test coverage is better than less. On the other hand, finicky tests that have platform-dependent results or fail for strange reasons not indicative of actual problems with the code are often judged not to be worth the trouble. An early version of this patch set had a very extensive chunk of Perl code in it that actually understood the page layout and, if we adopt something like that, it would probably be easier to test a whole bunch of scenarios. The downside is that it was a lot of code that basically duplicated a lot of backend logic in Perl, and I was (and am) afraid that people will complain about the amount of code and/or the difficulty of maintaining it. On the other hand, having all that code might allow better testing not only of this particular patch but also other scenarios involving corrupted pages, so maybe it's wrong to view all that code as a burden that we have to carry specifically to test this; or, alternatively, maybe it's worth carrying even if we only use it for this. On the third hand, as Mark points out, if we get 0002 committed, that will help somewhat with test coverage even if we do nothing else. Thanks for committing (and adjusting) the patches for the existing buildfarm failures. If I understand the buildfarm results correctly, hornet is still unhappy even after 321633e17b07968e68ca5341429e2c8bbf15c331? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 26, 2020, at 6:37 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Oct 23, 2020 at 2:04 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Seems to work, so I pushed it (after some compulsive fooling >> about with whitespace and perltidy-ing). It appears to me that >> the code coverage for verify_heapam.c is not very good though, >> only circa 50%. Do we care to expend more effort on that? > > There are two competing goods here. On the one hand, more test > coverage is better than less. On the other hand, finicky tests that > have platform-dependent results or fail for strange reasons not > indicative of actual problems with the code are often judged not to be > worth the trouble. An early version of this patch set had a very > extensive chunk of Perl code in it that actually understood the page > layout and, if we adopt something like that, it would probably be > easier to test a whole bunch of scenarios. The downside is that it was > a lot of code that basically duplicated a lot of backend logic in > Perl, and I was (and am) afraid that people will complain about the > amount of code and/or the difficulty of maintaining it. On the other > hand, having all that code might allow better testing not only of this > particular patch but also other scenarios involving corrupted pages, > so maybe it's wrong to view all that code as a burden that we have to > carry specifically to test this; or, alternatively, maybe it's worth > carrying even if we only use it for this. On the third hand, as Mark > points out, if we get 0002 committed, that will help somewhat with > test coverage even if we do nothing else. Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line utiiltyis not wanted. > > Thanks for committing (and adjusting) the patches for the existing > buildfarm failures. If I understand the buildfarm results correctly, > hornet is still unhappy even after > 321633e17b07968e68ca5341429e2c8bbf15c331? That appears to be a failed test for pg_surgery rather than for amcheck. Or am I reading the log wrong? — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line utiiltyis not wanted. How much consensus do we think we have around 0002 at this point? I think I remember a vote in favor and no votes against, but I haven't been paying a whole lot of attention. > > Thanks for committing (and adjusting) the patches for the existing > > buildfarm failures. If I understand the buildfarm results correctly, > > hornet is still unhappy even after > > 321633e17b07968e68ca5341429e2c8bbf15c331? > > That appears to be a failed test for pg_surgery rather than for amcheck. Or am I reading the log wrong? Oh, yeah, you're right. I don't know why it just failed now, though: there are a bunch of successful runs preceding it. But I guess it's unrelated to this thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Oct 26, 2020, at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command line utiiltyis not wanted. > > How much consensus do we think we have around 0002 at this point? I > think I remember a vote in favor and no votes against, but I haven't > been paying a whole lot of attention. My sense over the course of the thread is that people were very much in favor of having heap checking functionality, butquite vague on whether they wanted the command line interface. I think the interface is useful, but I'd rather hear fromothers on this list whether it is useful enough to justify maintaining it. As the author of it, I'm biased. Hopefullyothers with a more objective view of the matter will read this and vote? I don't recall patches 0003 through 0005 getting any votes. 0003 and 0004, which create and use a non-throwing interfaceto clog, were written in response to Andrey's request, so I'm guessing that's kind of a vote in favor. 0005 wasfactored out of of 0001 in response to a lack of agreement about whether verify_heapam should have acl checks. You seemedin favor, and Peter against, but I don't think we heard other opinions. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >>> hornet is still unhappy even after >>> 321633e17b07968e68ca5341429e2c8bbf15c331? >> That appears to be a failed test for pg_surgery rather than for amcheck. Or am I reading the log wrong? > Oh, yeah, you're right. I don't know why it just failed now, though: > there are a bunch of successful runs preceding it. But I guess it's > unrelated to this thread. pg_surgery's been unstable since it went in. I believe Andres is working on a fix. regards, tom lane
Hi, On October 26, 2020 7:13:15 AM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote: >Robert Haas <robertmhaas@gmail.com> writes: >> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger >> <mark.dilger@enterprisedb.com> wrote: >>>> hornet is still unhappy even after >>>> 321633e17b07968e68ca5341429e2c8bbf15c331? > >>> That appears to be a failed test for pg_surgery rather than for >amcheck. Or am I reading the log wrong? > >> Oh, yeah, you're right. I don't know why it just failed now, though: >> there are a bunch of successful runs preceding it. But I guess it's >> unrelated to this thread. > >pg_surgery's been unstable since it went in. I believe Andres is >working on a fix. I posted one a while ago - was planning to push a cleaned up version soon if nobody comments in the near future. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
> On Oct 26, 2020, at 7:08 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Oct 26, 2020, at 7:00 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> On Mon, Oct 26, 2020 at 9:56 AM Mark Dilger >> <mark.dilger@enterprisedb.com> wrote: >>> Much of the test in 0002 could be ported to work without committing the rest of 0002, if the pg_amcheck command lineutiilty is not wanted. >> >> How much consensus do we think we have around 0002 at this point? I >> think I remember a vote in favor and no votes against, but I haven't >> been paying a whole lot of attention. > > My sense over the course of the thread is that people were very much in favor of having heap checking functionality, butquite vague on whether they wanted the command line interface. I think the interface is useful, but I'd rather hear fromothers on this list whether it is useful enough to justify maintaining it. As the author of it, I'm biased. Hopefullyothers with a more objective view of the matter will read this and vote? > > I don't recall patches 0003 through 0005 getting any votes. 0003 and 0004, which create and use a non-throwing interfaceto clog, were written in response to Andrey's request, so I'm guessing that's kind of a vote in favor. 0005 wasfactored out of of 0001 in response to a lack of agreement about whether verify_heapam should have acl checks. You seemedin favor, and Peter against, but I don't think we heard other opinions. The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase. (0001 was already committed last week.) Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming. There are no substantialchanges. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas <robertmhaas@gmail.com> writes: > On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> Done that way in the attached, which also include Robert's changes from v19 he posted earlier today. > Committed. Let's see what the buildfarm thinks. Another thing that the buildfarm is pointing out is [WARN] FOUserAgent - The contents of fo:block line 2 exceed the available area in the inline-progression direction by morethan 50 points. (See position 148863:380) This is coming from the sample output for verify_heapam(), which is too wide to fit in even a normal-size browser window, let alone A4 PDF. While we could perhaps hack it up to allow more line breaks, or see if \x formatting helps, my own suggestion would be to just nuke the sample output altogether. It doesn't look like it is any sort of representative real output, and it is not useful enough to be worth spending time to patch up. regards, tom lane
> On Oct 26, 2020, at 9:12 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Robert Haas <robertmhaas@gmail.com> writes: >> On Wed, Oct 21, 2020 at 11:45 PM Mark Dilger >> <mark.dilger@enterprisedb.com> wrote: >>> Done that way in the attached, which also include Robert's changes from v19 he posted earlier today. > >> Committed. Let's see what the buildfarm thinks. > > Another thing that the buildfarm is pointing out is > > [WARN] FOUserAgent - The contents of fo:block line 2 exceed the available area in the inline-progression direction by morethan 50 points. (See position 148863:380) > > This is coming from the sample output for verify_heapam(), which is too > wide to fit in even a normal-size browser window, let alone A4 PDF. > > While we could perhaps hack it up to allow more line breaks, or see > if \x formatting helps, my own suggestion would be to just nuke the > sample output altogether. Ok. > It doesn't look like it is any sort of > representative real output, It is not. It came from artificially created corruption in the regression tests. I may even have manually edited that,though I don't recall. > and it is not useful enough to be worth > spending time to patch up. Ok. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Oct 26, 2020 at 12:12 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase. (0001 was already committed lastweek.) > > Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming. There are no substantialchanges. Here's a review of 0002. I basically like the direction this is going but I guess nobody will be surprised that there are some things in here that I think could be improved. +const char *usage_text[] = { + "pg_amcheck is the PostgreSQL command line frontend for the amcheck database corruption checker.", + "", This looks like a novel approach to the problem of printing out the usage() information, and I think that it's inferior to the technique used elsewhere of just having a bunch of printf() statements, because unless I misunderstand, it doesn't permit localization. + " -b, --startblock begin checking table(s) at the given starting block number", + " -e, --endblock check table(s) only up to the given ending block number", + " -B, --toast-startblock begin checking toast table(s) at the given starting block", + " -E, --toast-endblock check toast table(s) only up to the given ending block", I am not very convinced by this. What's the use case? If you're just checking a single table, you might want to specify a start and end block, but then you don't need separate options for the TOAST and non-TOAST cases, do you? If I want to check pg_statistic, I'll say pg_amcheck -t pg_catalog.pg_statistic. If I want to check the TOAST table for pg_statistic, I'll say pg_amcheck -t pg_toast.pg_toast_2619. In either case, if I want to check just the first three blocks, I can add -b 0 -e 2. + " -f, --skip-all-frozen do NOT check blocks marked as all frozen", + " -v, --skip-all-visible do NOT check blocks marked as all visible", I think this is using up too many one character option names for too little benefit on things that are too closely related. How about, -s, --skip=all-frozen|all-visible|none? And then -v could mean verbose, which could trigger things like printing all the queries sent to the server, setting PQERRORS_VERBOSE, etc. + " -x, --check-indexes check btree indexes associated with tables being checked", + " -X, --skip-indexes do NOT check any btree indexes", + " -i, --index=PATTERN check the specified index(es) only", + " -I, --exclude-index=PATTERN do NOT check the specified index(es)", This is a lotta controls for something that has gotta have some default. Either the default is everything, in which case I don't see why I need -x, or it's nothing, in which case I don't see why I need -X. + " -c, --check-corrupt check indexes even if their associated table is corrupt", + " -C, --skip-corrupt do NOT check indexes if their associated table is corrupt", Ditto. (I think the default be to check corrupt, and there can be an option to skip it.) + " -a, --heapallindexed check index tuples against the table tuples", + " -A, --no-heapallindexed do NOT check index tuples against the table tuples", Ditto. (Not sure what the default should be, though.) + " -r, --rootdescend search from the root page for each index tuple", + " -R, --no-rootdescend do NOT search from the root page for each index tuple", Ditto. (Again, not sure about the default.) I'm also not sure if these descriptions are clear enough, but it may also be hard to do a good job in a brief space. Still, comparing this to the documentation of heapallindexed makes me rather nervous. This is only trying to verify that the index contains all the tuples in the heap, not that the values in the heap and index tuples actually match. +typedef struct +AmCheckSettings +{ + char *dbname; + char *host; + char *port; + char *username; +} ConnectOptions; Making the struct name different from the type name seems not good, and the struct name also shouldn't be on a separate line. +typedef enum trivalue +{ + TRI_DEFAULT, + TRI_NO, + TRI_YES +} trivalue; Ugh. It's not this patch's fault, but we really oughta move this to someplace more centralized. +typedef struct ... +} AmCheckSettings; I'm not sure I consider all of these things settings, "db" in particular. But maybe that's nitpicking. +static void expand_schema_name_patterns(const SimpleStringList *patterns, + const SimpleOidList *exclude_oids, + SimpleOidList *oids + bool strict_names); This is copied from pg_dump, along with I think at least one other function from nearby. Unlike the trivalue case above, this would be the first duplication of this logic. Can we push this stuff into pgcommon, perhaps? + /* + * Default behaviors for user settable options. Note that these default + * to doing all the safe checks and none of the unsafe ones, on the theory + * that if a user says "pg_amcheck mydb" without specifying any additional + * options, we should check everything we know how to check without + * risking any backend aborts. + */ This to me seems too conservative. The result is that by default we check only tables, not indexes. I don't think that's going to be what users want. I don't know whether they want the heapallindexed or rootdescend behaviors for index checks, but I think they want their indexes checked. Happy to hear opinions from actual users on what they want; this is just me guessing that you've guessed wrong. :-) + if (settings.db == NULL) + { + pg_log_error("no connection to server after initial attempt"); + exit(EXIT_BADCONN); + } I think this is documented as meaning out of memory, and reported that way elsewhere. Anyway I am going to keep complaining until there are no cases where we tell the user it broke without telling them what broke. Which means this bit is a problem too: + if (!settings.db) + { + pg_log_error("no connection to server"); + exit(EXIT_BADCONN); + } Something went wrong, good luck figuring out what it was! + /* + * All information about corrupt indexes are returned via ereport, not as + * tuples. We want all the details to report if corruption exists. + */ + PQsetErrorVerbosity(settings.db, PQERRORS_VERBOSE); Really? Why? If I need the source code file name, function name, and line number to figure out what went wrong, that is not a great sign for the quality of the error reports it produces. + /* + * The btree checking logic which optionally checks the contents + * of an index against the corresponding table has not yet been + * sufficiently hardened against corrupt tables. In particular, + * when called with heapallindexed true, it segfaults if the file + * backing the table relation has been erroneously unlinked. In + * any event, it seems unwise to reconcile an index against its + * table when we already know the table is corrupt. + */ + old_heapallindexed = settings.heapallindexed; + if (corruptions) + settings.heapallindexed = false; This seems pretty lame to me. Even if the btree checker can't tolerate corruption to the extent that the heap checker does, seg faulting because of a missing file seems like a bug that we should just fix (and probably back-patch). I'm not very convinced by the decision to override the user's decision about heapallindexed either. Maybe I lack imagination, but that seems pretty arbitrary. Suppose there's a giant index which is missing entries for 5 million heap tuples and also there's 1 entry in the table which has an xmin that is less than the pg_clas.relfrozenxid value by 1. You are proposing that because I have the latter problem I don't want you to check for the former one. But I, John Q. Smartuser, do not want you to second-guess what I told you on the command line that I wanted. :-) I think in general you're worrying too much about the possibility of this tool causing backend crashes. I think it's good that you wrote the heapcheck code in a way that's hardened against that, and I think we should try to harden other things as time permits. But I don't think that the remote possibility of a crash due to the lack of such hardening should dictate the design behavior of this tool. If the crash possibilities are not remote, then I think the solution is to fix them, rather than cutting out important checks. It doesn't seem like great design to me that get_table_check_list() gets just the OID of the table itself, and then later if we decide to check the TOAST table we've got to run a separate query for each table we want to check to fetch the TOAST OID, when we could've just fetched both in get_table_check_list() by including two columns in the query rather than one and it would've been basically free. Imagine if some user wrote a query that fetched the primary key value for all their rows and then had their application run a separate query to fetch the entire contents of each of those rows, said contents consisting of one more integer. And then suppose they complained about performance. We'd tell them they were doing it wrong, and so here. + if (settings.db == NULL) + fatal("no connection on entry to check_table"); Uninformative. Is this basically an Assert? If so maybe just make it one. If not maybe fail somewhere else with a better message? + if (startblock == NULL) + startblock = "NULL"; + if (endblock == NULL) + endblock = "NULL"; It seems like it would be more elegant to initialize settings.startblock and settings.endblock to "NULL." However, there's also a related problem, which is that the startblock and endblock values can be anything, and are interpolated with quoting. I don't think that it's good to ship a tool with SQL injection hazards built into it. I think that you should (a) check that these values are integers during argument parsing and error out if they are not and then (b) use either a prepared query or PQescapeLiteral() anyway. + stop = (on_error_stop) ? "true" : "false"; + toast = (check_toast) ? "true" : "false"; The parens aren't really needed here. + printf("(relname=%s,blkno=%s,offnum=%s,attnum=%s)\n%s\n", + PQgetvalue(res, i, 0), /* relname */ + PQgetvalue(res, i, 1), /* blkno */ + PQgetvalue(res, i, 2), /* offnum */ + PQgetvalue(res, i, 3), /* attnum */ + PQgetvalue(res, i, 4)); /* msg */ I am not quite sure how to format the output, but this looks like something designed by an engineer who knows too much about the topic. I suspect users won't find the use of things like "relname" and "blkno" too easy to understand. At least I think we should say "relation, block, offset, attribute" instead of "relname, blkno, offnum, attnum". I would probably drop the parenthesis and add spaces, so that you end up with something like: relation "%s", block "%s", offset "%s", attribute "%s": I would also define variant strings so that we entirely omit things that are NULL. e.g. have four strings: relation "%s": relation "%s", block "%s":( relation "%s", block "%s", offset "%s": relation "%s", block "%s", offset "%s", attribute "%s": Would it make it more readable if we indented the continuation line by four spaces or something? + corruption_cnt++; + printf("%s\n", error); + pfree(error); Seems like we could still print the relation name in this case, and that it would be a good idea to do so, in case it's not in the message that the server returns. The general logic in this part of the code looks a bit strange to me. If ExecuteSqlQuery() returns PGRES_TUPLES_OK, we print out the details for each returned row. Otherwise, if error = true, we print the error. But, what if neither of those things are the case? Then we'd just print nothing despite having gotten back some weird response from the server. That actually can't happen, because ExecuteSqlQuery() always sets *error when the return code is not PGRES_TUPLES_OK, but you wouldn't know that from looking at this code. Honestly, as written, ExecSqlQuery() seems like kind of a waste. The OrDie() version is useful as a notational shorthand, but this version seems to add more confusion than clarity. It has only three callers: the ones in check_table() and check_indexes() have the problem described above, and the one in get_toast_oid() could just as well be using the OrDie() version. And also we should probably get rid of it entirely by fetching the toast OIDs the first time around, as mentioned above. check_indexes() lacks a function comment. It seems to have more or less the same problem as get_toast_oid() -- an extra query per table to get the list of indexes. I guess it has a better excuse: there could be lots of indexes per table, and we're fetching multiple columns of data for each one, whereas in the TOAST case we are issuing an extra query per table to fetch a single integer. But, couldn't we fetch information about all the indexes we want to check in one go, rather than fetching them separately for each table being checked? I'm not sure if that would create too much other complexity, but it seems like it would be quicker. + if (settings.db == NULL) + fatal("no connection on entry to check_index"); + if (idxname == NULL) + fatal("no index name on entry to check_index"); + if (tblname == NULL) + fatal("no table name on entry to check_index"); Again, probably these should be asserts, or if they're not, the error should be reported better and maybe elsewhere. Similarly in some other places, like expand_schema_name_patterns(). + * The loop below runs multiple SELECTs might sometimes result in + * duplicate entries in the Oid list, but we don't care. This is missing a which, like the place you copied it from, but the version in pg_dumpall.c is better. expand_table_name_patterns() should be reformatted to not gratuitously exceed 80 columns. Ditto for expand_index_name_patterns(). I sort of expected that this patch might use threads to allow parallel checking - seems like it would be a useful feature. I originally intended to review the docs and regression tests in the same email as the patch itself, but this email has gotten rather long and taken rather longer to get together than I had hoped, so I'm going to stop here for now and come back to that stuff. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 19, 2020 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote: > I'm also not sure if these descriptions are clear enough, but it may > also be hard to do a good job in a brief space. Still, comparing this > to the documentation of heapallindexed makes me rather nervous. This > is only trying to verify that the index contains all the tuples in the > heap, not that the values in the heap and index tuples actually match. That's a good point. As things stand, heapallindexed verification does not notice when there are extra index tuples in the index that are in some way inconsistent with the heap. Hopefully this isn't too much of a problem in practice because the presence of extra spurious tuples gets detected by the index structure verification process. But in general that might not happen. Ideally heapallindex verification would verify 1:1 correspondence. It doesn't do that right now, but it could. This could work by having two bloom filters -- one for the heap, another for the index. The implementation would look for the absence of index tuples that should be in the index initially, just like today. But at the end it would modify the index bloom filter by &= it with the complement of the heap bloom filter. If any bits are left set in the index bloom filter, we go back through the index once more and locate index tuples that have at least some matching bits in the index bloom filter (we cannot expect all of the bits from each of the hash functions used by the bloom filter to still be matches). From here we can do some kind of lookup for maybe-not-matching index tuples that we locate. Make sure that they point to an LP_DEAD line item in the heap or something. Make sure that they have the same values as the heap tuple if they're still retrievable (i.e. if we haven't pruned the heap tuple away already). > This to me seems too conservative. The result is that by default we > check only tables, not indexes. I don't think that's going to be what > users want. I don't know whether they want the heapallindexed or > rootdescend behaviors for index checks, but I think they want their > indexes checked. Happy to hear opinions from actual users on what they > want; this is just me guessing that you've guessed wrong. :-) My thoughts on these two options: * I don't think that users will ever want rootdescend verification. That option exists now because I wanted to have something that relied on the uniqueness property of B-Tree indexes following the Postgres 12 work. I didn't add retail index tuple deletion, so it seemed like a good idea to have something that makes the same assumptions that it would have to make. To validate the design. Another factor is that Alexander Korotkov made the basic bt_index_parent_check() tests a lot better for Postgres 13. This undermined the practical argument for using rootdescend verification. Finally, note that bt_index_parent_check() was always supposed to be something that was to be used only when you already knew that you had big problems, and wanted absolutely thorough verification without regard for the costs. This isn't the common case at all. It would be reasonable to not expose anything from bt_index_parent_check() at all, or to give it much less prominence. Not really sure of what the right balance is here myself, so I'm not insisting on anything. Just telling you what I know about it. * heapallindexed is kind of expensive, but valuable. But the extra check is probably less likely to help on the second or subsequent index on a table. It might be worth considering an option that only uses it with only one index: Preferably the primary key index, failing that some unique index, and failing that some other index. > This seems pretty lame to me. Even if the btree checker can't tolerate > corruption to the extent that the heap checker does, seg faulting > because of a missing file seems like a bug that we should just fix > (and probably back-patch). I'm not very convinced by the decision to > override the user's decision about heapallindexed either. I strongly agree. > Maybe I lack > imagination, but that seems pretty arbitrary. Suppose there's a giant > index which is missing entries for 5 million heap tuples and also > there's 1 entry in the table which has an xmin that is less than the > pg_clas.relfrozenxid value by 1. You are proposing that because I have > the latter problem I don't want you to check for the former one. But > I, John Q. Smartuser, do not want you to second-guess what I told you > on the command line that I wanted. :-) Even if your user is just average, they still have one major advantage over the architects of pg_amcheck: actual knowledge of the problem in front of them. > I think in general you're worrying too much about the possibility of > this tool causing backend crashes. I think it's good that you wrote > the heapcheck code in a way that's hardened against that, and I think > we should try to harden other things as time permits. But I don't > think that the remote possibility of a crash due to the lack of such > hardening should dictate the design behavior of this tool. If the > crash possibilities are not remote, then I think the solution is to > fix them, rather than cutting out important checks. I couldn't agree more. I think that you need to have a kind of epistemic modesty with this stuff. Okay, we guarantee that the backend won't crash when certain amcheck functions are run, based on these caveats. But don't we always guarantee something like that? And are the specific caveats actually that different in each case, when you get right down to it? A guarantee does not exist in a vacuum. It always has implicit limitations. For example, any guarantee implicitly comes with the caveat "unless I, the guarantor, am wrong". Normally this doesn't really matter because normally we're not concerned about extreme events that will probably never happen even once. But amcheck is very much not like that. The chances of the guarantor being the weakest link are actually rather high. Everyone is better off with a design that accepts this view of things. I'm also suspicious of guarantees like this for less philosophical reasons. It seems to me like it solves our problem rather than the user's problem. Having data that is so badly corrupt that it's difficult to avoid segfaults when we perform some kind of standard transformations on it is an appalling state of affairs for the user. The segfault itself is very much not the point at all. We should focus on making the tool as thorough and low overhead as possible. If we have to make the tool significantly more complicated to avoid extremely unlikely segfaults then we're actually doing the user a disservice, because we're increasing the chances that we the guarantors will be the weakest link (which was already high enough). This smacks of hubris. I also agree that hardening is a worthwhile exercise here, of course. We should be holding amcheck to a higher standard when it comes to not segfaulting with corrupt data. -- Peter Geoghegan
On Thu, Nov 19, 2020 at 2:48 PM Peter Geoghegan <pg@bowt.ie> wrote: > Ideally heapallindex verification would verify 1:1 correspondence. It > doesn't do that right now, but it could. Well, that might be a cool new mode, but it doesn't necessarily have to supplant the thing we have now. The problem immediately before us is just making sure that the user can understand what we will and won't be checking. > My thoughts on these two options: > > * I don't think that users will ever want rootdescend verification. That seems too absolute. I think it's fine to say, we don't think that users will want this, so let's not do it by default. But if it's so useless as to not be worth a command-line option, then it was a mistake to put it into contrib at all. Let's expose all the things we have, and try to set the defaults according to what we expect to be most useful. > * heapallindexed is kind of expensive, but valuable. But the extra > check is probably less likely to help on the second or subsequent > index on a table. > > It might be worth considering an option that only uses it with only > one index: Preferably the primary key index, failing that some unique > index, and failing that some other index. This seems a bit too clever for me. I would prefer a simpler schema, where we choose the default we think most people will want and use it for everything -- and allow the user to override. > Even if your user is just average, they still have one major advantage > over the architects of pg_amcheck: actual knowledge of the problem in > front of them. Quite so. > I think that you need to have a kind of epistemic modesty with this > stuff. Okay, we guarantee that the backend won't crash when certain > amcheck functions are run, based on these caveats. But don't we always > guarantee something like that? And are the specific caveats actually > that different in each case, when you get right down to it? A > guarantee does not exist in a vacuum. It always has implicit > limitations. For example, any guarantee implicitly comes with the > caveat "unless I, the guarantor, am wrong". Yep. > I'm also suspicious of guarantees like this for less philosophical > reasons. It seems to me like it solves our problem rather than the > user's problem. Having data that is so badly corrupt that it's > difficult to avoid segfaults when we perform some kind of standard > transformations on it is an appalling state of affairs for the user. > The segfault itself is very much not the point at all. I mostly agree with everything you say here, but I think we need to be careful not to accept the position that seg faults are no big deal. Consider the following users, all of whom start with a database that they believe to be non-corrupt: Alice runs pg_amcheck. It says that nothing is wrong, and that happens to be true. Bob runs pg_amcheck. It says that there are problems, and there are. Carol runs pg_amcheck. It says that nothing is wrong, but in fact something is wrong. Dan runs pg_amcheck. It says that there are problems, but in fact there are none. Erin runs pg_amcheck. The server crashes. Alice and Bob are clearly in the best shape here, but Carol and Dan arguably haven't been harmed very much. Sure, Carol enjoys a false sense of security, but since she otherwise believed things were OK, the impact of whatever problems exist is evidently not that bad. Dan is worrying over nothing, but the damage is only to his psyche, not his database; we can hope he'll eventually sort out what has happened without grave consequences. Erin, on the other hand, is very possibly in a lot of trouble with her boss and her coworkers. She had what seemed to be a healthy database, and from their perspective, she shot it in the head without any real cause. It will be faint consolation to her and her coworkers that the database was corrupt all along: until she ran the %$! tool, they did not have a problem that affected the ability of their business to generate revenue. Now they had an outage, and that does. While I obviously haven't seen this exact scenario play out for a customer, because pg_amcheck is not committed, I have seen similar scenarios over and over. It's REALLY bad when the database goes down. Then the application goes down, and then it gets really ugly. As long as the database was just returning wrong answers or eating data, nobody's boss really cared that much, but now that it's down, they care A LOT. This is of course not to say that nobody cares about the accuracy of results from the database: many people care a lot, and that's why it's good to have tools like this. But we should not underestimate the horror caused by a crash. A working database, even with some wrong data in it, is a problem people would probably like to get fixed. A down database is an emergency. So I think we should actually get a lot more serious about ensuring that corrupt data on disk doesn't cause crashes, even for regular SELECT statements. I don't think we can take an arbitrary performance hit to get there, which is a challenge, but I do think that even a brief outage is nothing to take lightly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 19, 2020 at 12:06 PM Robert Haas <robertmhaas@gmail.com> wrote: > I originally intended to review the docs and regression tests in the > same email as the patch itself, but this email has gotten rather long > and taken rather longer to get together than I had hoped, so I'm going > to stop here for now and come back to that stuff. Broad question: Does pg_amcheck belong in src/bin, or in contrib? You have it in the latter place, but I'm not sure if that's the right idea. I'm not saying it *isn't* the right idea, but I'm just wondering what other people think. Now, on to the docs: + Currently, this requires execute privileges on <xref linkend="amcheck"/>'s + <function>bt_index_parent_check</function> and <function>verify_heapam</function> This makes me wonder why there isn't an option to call bt_index_check() rather than bt_index_parent_check(). It doesn't seem to be standard practice to include the entire output of the command's --help option in the documentation. That means as soon as anybody changes anything they've got to change the documentation too. I don't see anything like that in the pages for psql or vacuumlo or pg_verifybackup. It also doesn't seem like a useful thing to do. Anyone who is reading the documentation probably is in a position to try --help if they wish; they don't need that duplicated here. Looking at those other pages, what seems to be typical for an SGML is to list all the options and give a short paragraph on what each one does. What you have instead is a narrative description. I recommend looking over the reference page for one of those other command-line utilities and adapting it to this case. Back to the the code: +static const char * +get_index_relkind_quals(void) +{ + if (!index_relkind_quals) + index_relkind_quals = psprintf("'%c'", RELKIND_INDEX); + return index_relkind_quals; +} I feel like there ought to be a way to work this out at compile time rather than leaving it to runtime. I think that replacing the function body with "return CppAsString2(RELKIND_INDEX);" would have the same result, and once you do that you don't really need the function any more. This is arguably cheating a bit: RELKIND_INDEX is defined as 'i' and CppAsString2() turns that into a string containing those three characters. That happens to work because what we want to do is quote this for use in SQL, and SQL happens to use single quotes for literals just like C does for individual characters. It would be mor elegant to figure out a way to interpolate just the character into C string, but I don't know of a macro trick that will do that. I think one could write char *something = { '\'', RELKIND_INDEX, '\'', '\0' } but that would be pretty darn awkward for the table case where you want an ANY with three relkinds in there. But maybe you could get around that by changing the query slightly. Suppose instead of relkind = BLAH, you write POSITION(relkind IN '%s') > 0. Then you could just have the caller pass either: char *index_relkinds = { RELKIND_INDEX, '\0' }; -or- char *table_relkinds = { RELKIND_RELATION, RELKIND_MATVIEW, RELKIND_TOASTVALUE, '\0' }; The patch actually has RELKIND_PARTITIONED_TABLE there rather than RELKIND_RELATION, but that seems wrong to me, because partitioned tables don't have storage, and toast tables do. And if we're going to include RELKIND_PARTITIONED_TABLE for some reason, then why not RELKIND_PARTITIONED_INDEX for the index case? On the tests: I think 003_check.pl needs to stop and restart the table between populating the tables and corrupting them. Otherwise, how do we know that the subsequent checks are going to actually see the corruption rather than something already cached in memory? There are some philosophical questions to consider too, about how these tests are written and what our philosophy ought to be here, but I am again going to push that off to a future email. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Nov 19, 2020, at 11:47 AM, Peter Geoghegan <pg@bowt.ie> wrote: > >> I think in general you're worrying too much about the possibility of >> this tool causing backend crashes. I think it's good that you wrote >> the heapcheck code in a way that's hardened against that, and I think >> we should try to harden other things as time permits. But I don't >> think that the remote possibility of a crash due to the lack of such >> hardening should dictate the design behavior of this tool. If the >> crash possibilities are not remote, then I think the solution is to >> fix them, rather than cutting out important checks. > > I couldn't agree more. Owing to how much run-time overhead it would entail, much of the backend code has not been, and probably will not be, hardenedagainst corruption. The amcheck code uses backend code for accessing heaps and indexes. Only some of those usescan be preceded with sufficient safety checks to avoid stepping on landmines. It makes sense to me to have a "don'trun through minefields" option, and a "go ahead, run through minefields" option for pg_amcheck, given that users indiffering situations will have differing business consequences to bringing down the server in question. As an example that we've already looked at, checking the status of an xid against clog is a dangerous thing to do. I wrotea patch to make it safer to query clog (0003) and a patch for pg_amcheck to use the safer interface (0004) and it looksunlikely either of those will ever be committed. I doubt other backend hardening is any more likely to get committed. It doesn't follow that if crash possibilities are not remote that we should therefore harden the backend. Theperformance considerations of the backend are not well aligned with the safety considerations of this tool. The backendcode is written with the assumption of non-corrupt data, and this tool with the assumption of corrupt data, or atleast a fair probability of corrupt data. I don't see how any one-hardening-fits-all will ever work. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 19, 2020 at 1:50 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > It makes sense to me to have a "don't run through minefields" option, and a "go ahead, run through minefields" option forpg_amcheck, given that users in differing situations will have differing business consequences to bringing down the serverin question. This kind of framing suggests zero-risk bias to me: https://en.wikipedia.org/wiki/Zero-risk_bias It's simply not helpful to think of the risks as "running through a minefield" versus "not running through a minefield". I also dislike this framing because in reality nobody runs through a minefield, unless maybe it's a battlefield and the alternative is probably even worse. Risks are not discrete -- they're continuous. And they're situational. I accept that there are certain reasonable gradations in the degree to which a segfault is bad, even in contexts in which pg_amcheck runs into actual serious problems. And as Robert points out, experience suggests that on average people care about availability the most when push comes to shove (though I hasten to add that that's not the same thing as considering a once-off segfault to be the greater evil here). Even still, I firmly believe that it's a mistake to assign *infinite* weight to not having a segfault. That is likely to have certain unintended consequences that could be even worse than a segfault, such as not detecting pernicious corruption over many months because our can't-segfault version of core functionality fails to have the same bugs as the actual core functionality (and thus fails to detect a problem in the core functionality). The problem with giving infinite weight to any one bad outcome is that it makes it impossible to draw reasonable distinctions between it and some other extreme bad outcome. For example, I would really not like to get infected with Covid-19. But I also think that it would be much worse to get infected with Ebola. It follows that Covid-19 must not be infinitely bad, because if it is then I can't make this useful distinction -- which might actually matter. If somebody hears me say this, and takes it as evidence of my lackadaisical attitude towards Covid-19, I can live with that. I care about avoiding criticism as much as the next person, but I refuse to prioritize it over all other things. > I doubt other backend hardening is any more likely to get committed. I suspect you're right about that. Because of the risks of causing real harm to users. The backend code is obviously *not* written with the assumption that data cannot be corrupt. There are lots of specific ways in which it is hardened (e.g., there are many defensive "can't happen" elog() statements). I really don't know why you insist on this black and white framing. -- Peter Geoghegan
On Tue, Oct 27, 2020 at 5:12 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase. (0001 was already committed lastweek.) > > Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming. There are no substantialchanges. Hi Mark, The command line stuff fails to build on Windows[1]. I think it's just missing #include "getopt_long.h" (see contrib/vacuumlo/vacuumlo.c). [1] https://ci.appveyor.com/project/postgresql-cfbot/postgresql/build/1.0.123328
> On Nov 19, 2020, at 9:06 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Oct 26, 2020 at 12:12 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> The v20 patches 0002, 0003, and 0005 still apply cleanly, but 0004 required a rebase. (0001 was already committed lastweek.) >> >> Here is a rebased set of 4 patches, numbered 0002..0005 to be consistent with the previous naming. There are no substantialchanges. > > Here's a review of 0002. I basically like the direction this is going > but I guess nobody will be surprised that there are some things in > here that I think could be improved. Thanks for the review! The tools pg_dump and pg_amcheck both need to allow the user to specify which schemas, tables, and indexes either to dumpor to check. There are command line options in pg_dump for this purpose, and functions for compiling lists of correspondingdatabase objects. In prior versions of the pg_amcheck patch, I did some copy-and-pasting of this logic, andthen had to fix up the copied functions a bit, given that pg_dump has its own ecosystem with things like fatal() and exit_nicely()and such. In hindsight, it would have been better to factor these functions out into a shared location. I have done that, factoringthem into fe_utils, and am attaching a series of patches that accomplishes that refactoring. Here are some briefexplanations of what these are for. See also the commit comments in each patch: v3-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch pg_dump allows on-exit callbacks to be registered, which it expects to get called when exit_nicely() is invoked. It doesn'twork to factor functions out of pg_dump without having this infrastructure, as the functions being factored out includefacilities for logging and exiting on error. Therefore, moving these functions into fe_utils. v3-0002-Refactoring-ExecuteSqlQuery-and-related-functions.patch pg_dump has functions for running queries, but those functions take a pg_dump specific argument of type Archive rather thanPGconn, with the expectation that the Archive's connection will be used. This has to be cleaned up a bit before thesefunctions can be moved out of pg_dump to a shared location. Also, pg_dump has a fixed expectation that when a queryfails, specific steps will be taken to print out the error information and exit. That's reasonable behavior, but notall callers will want that. Since the ultimate goal of this refactoring is to have higher level functions that translateshell patterns into oid lists, it's reasonable to imagine that not all callers will want to exit if the query fails. In particular, pg_amcheck won't want errors to automatically trigger exit() calls, given that pg_amcheck tries tocontinue in the face of errors. Therefore, adding a default error handler that does what pg_dump expects, but with aneye towards other callers being able to define handlers that behave differently. v3-0003-Creating-query_utils-frontend-utility.patch Moving the refactored functions to the shared location in fe_utils. This is kept separate from 0002 for ease of review. v3-0004-Adding-CurrentQueryHandler-logic.patch Extending the query error handling logic begun in the 0002 patch. It wasn't appropriate in the pg_dump project, but nowthe logic is in fe_utils. v3-0005-Refactoring-pg_dumpall-functions.patch Refactoring some remaining functions in the pg_dump project to use the new fe_utils facilities. v3-0006-Refactoring-expand_schema_name_patterns-and-frien.patch Refactoring functions in pg_dump that expand a list of patterns into a list of matching database objects. Specifically,changing them to no take pg_dump specific argument types, just as was done in 0002. v3-0007-Moving-pg_dump-functions-to-new-file-option_utils.patch Moving the functions refactored in 0006 into a new location fe_utils/option_utils v3-0008-Normalizing-option_utils-interface.patch Reworking the functions moved in 0007 to have a more general purpose interface. The refactoring in 0006 only went so faras to make the functions moveable out of pg_dump. This refactoring is intentionally kept separate for ease of review. v3-0009-Adding-contrib-module-pg_amcheck.patch Adding contrib/pg_amcheck project, about which your review comments below apply. Not included in this patch set, but generated during the development of this patch set, I refactored processSQLNamePattern. string_utils mixes the logic for converting a shell-style pattern into a SQL style regex with thelogic of performing the sql query to look up matching database objects. That makes it hard to look up multiple patternsin a single query, something that an intermediate version of this patch set was doing. I ultimately stopped doingthat, as the code was overly complex, but the refactoring of processSQLNamePattern is not over-complicated and probablyhas some merit in its own right. Since it is not related to the pg_amcheck code, I expect that I will be postingthat separately. Also not included in this patch set, but likely to be in the next rev, is a patch that adds more interesting table and indexcorruption via PostgresNode, creating torn pages and such. That work is complete so far as I know, but I don't haveall the regression tests that use it written yet, so I'll hold off posting it for now. Not yet written but still needed is the parallelization of the checking. I'll be working on that for the next patch set. There is enough work here in need of review that I'm posting this now, hoping to get feedback on the general direction I'mgoing with this. To your review.... > > +const char *usage_text[] = { > + "pg_amcheck is the PostgreSQL command line frontend for the > amcheck database corruption checker.", > + "", > > This looks like a novel approach to the problem of printing out the > usage() information, and I think that it's inferior to the technique > used elsewhere of just having a bunch of printf() statements, because > unless I misunderstand, it doesn't permit localization. Since contrib modules are not localized, it seemed not to be a problem, but you've raised the question of whether pg_amcheckmight be moved into core. I've changed it as suggested so that such a move would incur less code churn. The advantageto how I had it before was that each line was a bit shorter, making it fit better into the 80 column limit. > + " -b, --startblock begin checking table(s) at the > given starting block number", > + " -e, --endblock check table(s) only up to the > given ending block number", > + " -B, --toast-startblock begin checking toast table(s) > at the given starting block", > + " -E, --toast-endblock check toast table(s) only up > to the given ending block", > > I am not very convinced by this. What's the use case? If you're just > checking a single table, you might want to specify a start and end > block, but then you don't need separate options for the TOAST and > non-TOAST cases, do you? If I want to check pg_statistic, I'll say > pg_amcheck -t pg_catalog.pg_statistic. If I want to check the TOAST > table for pg_statistic, I'll say pg_amcheck -t pg_toast.pg_toast_2619. > In either case, if I want to check just the first three blocks, I can > add -b 0 -e 2. Removed -B, --toast-startblock and -E, --toast-endblock. > > + " -f, --skip-all-frozen do NOT check blocks marked as > all frozen", > + " -v, --skip-all-visible do NOT check blocks marked as > all visible", > > I think this is using up too many one character option names for too > little benefit on things that are too closely related. How about, -s, > --skip=all-frozen|all-visible|none? I'm already using -s for "strict-names', but I implemented your suggestion with -S, --skip > And then -v could mean verbose, > which could trigger things like printing all the queries sent to the > server, setting PQERRORS_VERBOSE, etc. I added -v, --verbose as you suggest. > + " -x, --check-indexes check btree indexes associated > with tables being checked", > + " -X, --skip-indexes do NOT check any btree indexes", > + " -i, --index=PATTERN check the specified index(es) only", > + " -I, --exclude-index=PATTERN do NOT check the specified index(es)", > > This is a lotta controls for something that has gotta have some > default. Either the default is everything, in which case I don't see > why I need -x, or it's nothing, in which case I don't see why I need > -X. I removed -x, --check-indexes and instead made that the default. > > + " -c, --check-corrupt check indexes even if their > associated table is corrupt", > + " -C, --skip-corrupt do NOT check indexes if their > associated table is corrupt", > > Ditto. (I think the default be to check corrupt, and there can be an > option to skip it.) Likewise, I removed -c, --check-corrupt and made that the default. > + " -a, --heapallindexed check index tuples against the > table tuples", > + " -A, --no-heapallindexed do NOT check index tuples > against the table tuples", > > Ditto. (Not sure what the default should be, though.) I removed -A, --no-heapallindexed and made that the default. > > + " -r, --rootdescend search from the root page for > each index tuple", > + " -R, --no-rootdescend do NOT search from the root > page for each index tuple", > > Ditto. (Again, not sure about the default.) I removed -R, --no-rootdescend and made that the default. Peter argued elsewhere for removing this altogether, but as Irecall you argued against that, so for now I'm keeping the --rootdescend option. > I'm also not sure if these descriptions are clear enough, but it may > also be hard to do a good job in a brief space. Yes. Better verbiage welcome. > Still, comparing this > to the documentation of heapallindexed makes me rather nervous. This > is only trying to verify that the index contains all the tuples in the > heap, not that the values in the heap and index tuples actually match. This is complicated. The most reasonable approach from the point of view of somebody running pg_amcheck is to have the scanof the table and the scan of the index cooperate so that work is not duplicated. But from the point of view of amcheck(not pg_amcheck), there is no assumption that the table is being scanned just because the index is being checked. I'm not sure how best to resolve this, except that I'd rather punt this to a future version rather than requirethe first version of pg_amcheck to deal with it. > +typedef struct > +AmCheckSettings > +{ > + char *dbname; > + char *host; > + char *port; > + char *username; > +} ConnectOptions; > > Making the struct name different from the type name seems not good, > and the struct name also shouldn't be on a separate line. Fixed. > +typedef enum trivalue > +{ > + TRI_DEFAULT, > + TRI_NO, > + TRI_YES > +} trivalue; > > Ugh. It's not this patch's fault, but we really oughta move this to > someplace more centralized. Not changed in this patch. > +typedef struct > ... > +} AmCheckSettings; > > I'm not sure I consider all of these things settings, "db" in > particular. But maybe that's nitpicking. It is definitely nitpicking, but I agree with it. This next patch uses a static variable named "conn" rather than "settings.db". > +static void expand_schema_name_patterns(const SimpleStringList *patterns, > + > const SimpleOidList *exclude_oids, > + > SimpleOidList *oids > + > bool strict_names); > > This is copied from pg_dump, along with I think at least one other > function from nearby. Unlike the trivalue case above, this would be > the first duplication of this logic. Can we push this stuff into > pgcommon, perhaps? Yes, these functions were largely copied from pg_dump. I have moved them out of pg_dump and into fe_utils, but that wasa large enough effort that it deserves its own thread, so I'm creating a thread for that work independent of this thread. > + /* > + * Default behaviors for user settable options. Note that these default > + * to doing all the safe checks and none of the unsafe ones, > on the theory > + * that if a user says "pg_amcheck mydb" without specifying > any additional > + * options, we should check everything we know how to check without > + * risking any backend aborts. > + */ > > This to me seems too conservative. The result is that by default we > check only tables, not indexes. I don't think that's going to be what > users want. Checking indexes has been made the default, as discussed above. > I don't know whether they want the heapallindexed or > rootdescend behaviors for index checks, but I think they want their > indexes checked. Happy to hear opinions from actual users on what they > want; this is just me guessing that you've guessed wrong. :-) The heapallindexed and rootdescend options still exist but are false by default. > + if (settings.db == NULL) > + { > + pg_log_error("no connection to server after > initial attempt"); > + exit(EXIT_BADCONN); > + } > > I think this is documented as meaning out of memory, and reported that > way elsewhere. Anyway I am going to keep complaining until there are > no cases where we tell the user it broke without telling them what > broke. Which means this bit is a problem too: > > + if (!settings.db) > + { > + pg_log_error("no connection to server"); > + exit(EXIT_BADCONN); > + } > > Something went wrong, good luck figuring out what it was! I have changed this to more closely follow the behavior in scripts/common.c:connectDatabase. If pg_amcheck were moved intosrc/bin/scripts, I could just use that function outright. > + /* > + * All information about corrupt indexes are returned via > ereport, not as > + * tuples. We want all the details to report if corruption exists. > + */ > + PQsetErrorVerbosity(settings.db, PQERRORS_VERBOSE); > > Really? Why? If I need the source code file name, function name, and > line number to figure out what went wrong, that is not a great sign > for the quality of the error reports it produces. Yeah, you are right about that. In any event, the user can now specifiy --verbose if they like and get that extra information(not that they need it). I have removed this offending bit of code. > + /* > + * The btree checking logic which optionally > checks the contents > + * of an index against the corresponding table > has not yet been > + * sufficiently hardened against corrupt > tables. In particular, > + * when called with heapallindexed true, it > segfaults if the file > + * backing the table relation has been > erroneously unlinked. In > + * any event, it seems unwise to reconcile an > index against its > + * table when we already know the table is corrupt. > + */ > + old_heapallindexed = settings.heapallindexed; > + if (corruptions) > + settings.heapallindexed = false; > > This seems pretty lame to me. Even if the btree checker can't tolerate > corruption to the extent that the heap checker does, seg faulting > because of a missing file seems like a bug that we should just fix > (and probably back-patch). I'm not very convinced by the decision to > override the user's decision about heapallindexed either. Maybe I lack > imagination, but that seems pretty arbitrary. Suppose there's a giant > index which is missing entries for 5 million heap tuples and also > there's 1 entry in the table which has an xmin that is less than the > pg_clas.relfrozenxid value by 1. You are proposing that because I have > the latter problem I don't want you to check for the former one. But > I, John Q. Smartuser, do not want you to second-guess what I told you > on the command line that I wanted. :-) I've removed this bit. I'm not sure what I was seeing back when I first wrote this code, but I no longer see any segfaultsfor missing relation files. > I think in general you're worrying too much about the possibility of > this tool causing backend crashes. I think it's good that you wrote > the heapcheck code in a way that's hardened against that, and I think > we should try to harden other things as time permits. But I don't > think that the remote possibility of a crash due to the lack of such > hardening should dictate the design behavior of this tool. If the > crash possibilities are not remote, then I think the solution is to > fix them, rather than cutting out important checks. Right. I've been worrying a bit less about this lately, in part because you and Peter are less concerned about it than Iwas, and in part because I've been banging away with various test cases and don't see all that much worth worrying about. > It doesn't seem like great design to me that get_table_check_list() > gets just the OID of the table itself, and then later if we decide to > check the TOAST table we've got to run a separate query for each table > we want to check to fetch the TOAST OID, when we could've just fetched > both in get_table_check_list() by including two columns in the query > rather than one and it would've been basically free. Imagine if some > user wrote a query that fetched the primary key value for all their > rows and then had their application run a separate query to fetch the > entire contents of each of those rows, said contents consisting of one > more integer. And then suppose they complained about performance. We'd > tell them they were doing it wrong, and so here. Good points. I've changed get_table_check_list to query both the main table and toast table oids as you suggest. > + if (settings.db == NULL) > + fatal("no connection on entry to check_table"); > > Uninformative. Is this basically an Assert? If so maybe just make it > one. If not maybe fail somewhere else with a better message? Looking at this again, I don't think it is even worth making it into an Assert, so I just removed it, along with similaruseless checks of the same type elsewhere. > > + if (startblock == NULL) > + startblock = "NULL"; > + if (endblock == NULL) > + endblock = "NULL"; > > It seems like it would be more elegant to initialize > settings.startblock and settings.endblock to "NULL." However, there's > also a related problem, which is that the startblock and endblock > values can be anything, and are interpolated with quoting. I don't > think that it's good to ship a tool with SQL injection hazards built > into it. I think that you should (a) check that these values are > integers during argument parsing and error out if they are not and > then (b) use either a prepared query or PQescapeLiteral() anyway. I've changed the logic to use strtol to parse these, and I'm storing them as long rather than as strings. > + stop = (on_error_stop) ? "true" : "false"; > + toast = (check_toast) ? "true" : "false"; > > The parens aren't really needed here. True. Removed. > + > printf("(relname=%s,blkno=%s,offnum=%s,attnum=%s)\n%s\n", > + PQgetvalue(res, i, 0), /* relname */ > + PQgetvalue(res, i, 1), /* blkno */ > + PQgetvalue(res, i, 2), /* offnum */ > + PQgetvalue(res, i, 3), /* attnum */ > + PQgetvalue(res, i, 4)); /* msg */ > > I am not quite sure how to format the output, but this looks like > something designed by an engineer who knows too much about the topic. > I suspect users won't find the use of things like "relname" and > "blkno" too easy to understand. At least I think we should say > "relation, block, offset, attribute" instead of "relname, blkno, > offnum, attnum". I would probably drop the parenthesis and add spaces, > so that you end up with something like: > > relation "%s", block "%s", offset "%s", attribute "%s": > > I would also define variant strings so that we entirely omit things > that are NULL. e.g. have four strings: > > relation "%s": > relation "%s", block "%s":( > relation "%s", block "%s", offset "%s": > relation "%s", block "%s", offset "%s", attribute "%s": > > Would it make it more readable if we indented the continuation line by > four spaces or something? I tried it that way and agree it looks better, including having the msg line indented four spaces. Changed. > + corruption_cnt++; > + printf("%s\n", error); > + pfree(error); > > Seems like we could still print the relation name in this case, and > that it would be a good idea to do so, in case it's not in the message > that the server returns. We don't know the relation name in this case, only the oid, but I agree that would be useful to have, so I added that. > The general logic in this part of the code looks a bit strange to me. > If ExecuteSqlQuery() returns PGRES_TUPLES_OK, we print out the details > for each returned row. Otherwise, if error = true, we print the error. > But, what if neither of those things are the case? Then we'd just > print nothing despite having gotten back some weird response from the > server. That actually can't happen, because ExecuteSqlQuery() always > sets *error when the return code is not PGRES_TUPLES_OK, but you > wouldn't know that from looking at this code. > > Honestly, as written, ExecSqlQuery() seems like kind of a waste. The > OrDie() version is useful as a notational shorthand, but this version > seems to add more confusion than clarity. It has only three callers: > the ones in check_table() and check_indexes() have the problem > described above, and the one in get_toast_oid() could just as well be > using the OrDie() version. And also we should probably get rid of it > entirely by fetching the toast OIDs the first time around, as > mentioned above. These functions have been factored out of pg_dump into fe_utils, so this bit of code review doesn't refer to anything now. > check_indexes() lacks a function comment. It seems to have more or > less the same problem as get_toast_oid() -- an extra query per table > to get the list of indexes. I guess it has a better excuse: there > could be lots of indexes per table, and we're fetching multiple > columns of data for each one, whereas in the TOAST case we are issuing > an extra query per table to fetch a single integer. But, couldn't we > fetch information about all the indexes we want to check in one go, > rather than fetching them separately for each table being checked? I'm > not sure if that would create too much other complexity, but it seems > like it would be quicker. If the --skip-corrupt option is given, we need to only check the indexes associated with a table once the table has beenfound to be non-corrupt. Querying for all the indexes upfront, we'd need to keep information about which table the indexcame from, and check that against lists of tables that have been checked, etc. It seems pretty messy, even more sowhen considering the limited list facilities available to frontend code. I have made no changes in this version, though I'm not rejecting your idea here. Maybe I'll think of a clean way to do thisfor a later patch? > + if (settings.db == NULL) > + fatal("no connection on entry to check_index"); > + if (idxname == NULL) > + fatal("no index name on entry to check_index"); > + if (tblname == NULL) > + fatal("no table name on entry to check_index"); > > Again, probably these should be asserts, or if they're not, the error > should be reported better and maybe elsewhere. > > Similarly in some other places, like expand_schema_name_patterns(). I removed these checks entirely. > + * The loop below runs multiple SELECTs might sometimes result in > + * duplicate entries in the Oid list, but we don't care. > > This is missing a which, like the place you copied it from, but the > version in pg_dumpall.c is better. > > expand_table_name_patterns() should be reformatted to not gratuitously > exceed 80 columns. Ditto for expand_index_name_patterns(). Refactoring into fe_utils, as mentioned above. > I sort of expected that this patch might use threads to allow parallel > checking - seems like it would be a useful feature. Yes, I think that makes sense, but I'm going to work on that in the next patch. > I originally intended to review the docs and regression tests in the > same email as the patch itself, but this email has gotten rather long > and taken rather longer to get together than I had hoped, so I'm going > to stop here for now and come back to that stuff. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v3-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch
- v3-0002-Refactoring-ExecuteSqlQuery-and-related-functions.patch
- v3-0003-Creating-query_utils-frontend-utility.patch
- v3-0004-Adding-CurrentQueryHandler-logic.patch
- v3-0005-Refactoring-pg_dumpall-functions.patch
- v3-0006-Refactoring-expand_schema_name_patterns-and-frien.patch
- v3-0007-Moving-pg_dump-functions-to-new-file-option_utils.patch
- v3-0008-Normalizing-option_utils-interface.patch
- v3-0009-Adding-contrib-module-pg_amcheck.patch
> On Jan 6, 2021, at 11:05 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > I have done that, factoring them into fe_utils, and am attaching a series of patches that accomplishes that refactoring. The previous set should have been named v30, not v3. My apologies for any confusion. The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's review,and a .gitignore file added in contrib/pg_amcheck/ — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v31-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch
- v31-0002-Refactoring-ExecuteSqlQuery-and-related-function.patch
- v31-0003-Creating-query_utils-frontend-utility.patch
- v31-0004-Adding-CurrentQueryHandler-logic.patch
- v31-0005-Refactoring-pg_dumpall-functions.patch
- v31-0006-Refactoring-expand_schema_name_patterns-and-frie.patch
- v31-0007-Moving-pg_dump-functions-to-new-file-option_util.patch
- v31-0008-Normalizing-option_utils-interface.patch
- v31-0009-Adding-contrib-module-pg_amcheck.patch
> On Nov 19, 2020, at 11:47 AM, Peter Geoghegan <pg@bowt.ie> wrote: > > On Thu, Nov 19, 2020 at 9:06 AM Robert Haas <robertmhaas@gmail.com> wrote: >> I'm also not sure if these descriptions are clear enough, but it may >> also be hard to do a good job in a brief space. Still, comparing this >> to the documentation of heapallindexed makes me rather nervous. This >> is only trying to verify that the index contains all the tuples in the >> heap, not that the values in the heap and index tuples actually match. > > That's a good point. As things stand, heapallindexed verification does > not notice when there are extra index tuples in the index that are in > some way inconsistent with the heap. Hopefully this isn't too much of > a problem in practice because the presence of extra spurious tuples > gets detected by the index structure verification process. But in > general that might not happen. > > Ideally heapallindex verification would verify 1:1 correspondence. It > doesn't do that right now, but it could. > > This could work by having two bloom filters -- one for the heap, > another for the index. The implementation would look for the absence > of index tuples that should be in the index initially, just like > today. But at the end it would modify the index bloom filter by &= it > with the complement of the heap bloom filter. If any bits are left set > in the index bloom filter, we go back through the index once more and > locate index tuples that have at least some matching bits in the index > bloom filter (we cannot expect all of the bits from each of the hash > functions used by the bloom filter to still be matches). > > From here we can do some kind of lookup for maybe-not-matching index > tuples that we locate. Make sure that they point to an LP_DEAD line > item in the heap or something. Make sure that they have the same > values as the heap tuple if they're still retrievable (i.e. if we > haven't pruned the heap tuple away already). This approach sounds very good to me, but beyond the scope of what I'm planning for this release cycle. >> This to me seems too conservative. The result is that by default we >> check only tables, not indexes. I don't think that's going to be what >> users want. I don't know whether they want the heapallindexed or >> rootdescend behaviors for index checks, but I think they want their >> indexes checked. Happy to hear opinions from actual users on what they >> want; this is just me guessing that you've guessed wrong. :-) > > My thoughts on these two options: > > * I don't think that users will ever want rootdescend verification. > > That option exists now because I wanted to have something that relied > on the uniqueness property of B-Tree indexes following the Postgres 12 > work. I didn't add retail index tuple deletion, so it seemed like a > good idea to have something that makes the same assumptions that it > would have to make. To validate the design. > > Another factor is that Alexander Korotkov made the basic > bt_index_parent_check() tests a lot better for Postgres 13. This > undermined the practical argument for using rootdescend verification. The latest version of the patch has rootdescend off by default, but a switch to turn it on. The documentation for that switchin doc/src/sgml/pgamcheck.sgml summarizes your comments: + This form of verification was originally written to help in the + development of btree index features. It may be of limited or even of no + use in helping detect the kinds of corruption that occur in practice. + In any event, it is known to be a rather expensive check to perform. For my own self, I don't care if rootdescend is an option in pg_amcheck. You and Robert expressed somewhat different opinions,and I tried to split the difference. I'm happy to go a different direction if that's what the consensus is. > Finally, note that bt_index_parent_check() was always supposed to be > something that was to be used only when you already knew that you had > big problems, and wanted absolutely thorough verification without > regard for the costs. This isn't the common case at all. It would be > reasonable to not expose anything from bt_index_parent_check() at all, > or to give it much less prominence. Not really sure of what the right > balance is here myself, so I'm not insisting on anything. Just telling > you what I know about it. This still needs work. Currently, there is a switch to turn off index checking, with the checks on by default. But thereis no switch controlling which kind of check is performed (bt_index_check vs. bt_index_parent_check). Making mattersmore complicated, selecting both rootdescend and bt_index_check wouldn't make sense, as there is no rootdescend optionon that function. So users would need multiple flags to turn on various options, with some flag combinations drawingan error about the flags not being mutually compatible. That's doable, but people may not like that interface. > * heapallindexed is kind of expensive, but valuable. But the extra > check is probably less likely to help on the second or subsequent > index on a table. There is a switch for enabling this. It is off by default. > It might be worth considering an option that only uses it with only > one index: Preferably the primary key index, failing that some unique > index, and failing that some other index. It might make sense for somebody to submit this for a later release. I don't have any plans to work on this during thisrelease cycle. >> I'm not very convinced by the decision to >> override the user's decision about heapallindexed either. > > I strongly agree. I have removed the override. > >> Maybe I lack >> imagination, but that seems pretty arbitrary. Suppose there's a giant >> index which is missing entries for 5 million heap tuples and also >> there's 1 entry in the table which has an xmin that is less than the >> pg_clas.relfrozenxid value by 1. You are proposing that because I have >> the latter problem I don't want you to check for the former one. But >> I, John Q. Smartuser, do not want you to second-guess what I told you >> on the command line that I wanted. :-) > > Even if your user is just average, they still have one major advantage > over the architects of pg_amcheck: actual knowledge of the problem in > front of them. There is a switch for skipping index checks on corrupt tables. By default, the indexes will be checked. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jan 8, 2021 at 6:33 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's review,and a .gitignore file added in contrib/pg_amcheck/ I couple more little things from Windows CI: C:\projects\postgresql\src\include\fe_utils/option_utils.h(19): fatal error C1083: Cannot open include file: 'libpq-fe.h': No such file or directory [C:\projects\postgresql\pg_amcheck.vcxproj] Does contrib/amcheck/Makefile need to say "SHLIB_PREREQS = submake-libpq" like other contrib modules that use libpq? pg_backup_utils.obj : error LNK2001: unresolved external symbol exit_nicely [C:\projects\postgresql\pg_dump.vcxproj] I think this is probably because additions to src/fe_utils/Makefile's OBJS list need to be manually replicated in src/tools/msvc/Mkvcbuild.pm's @pgfeutilsfiles list. (If I'm right about that, perhaps it needs a comment to remind us Unix hackers of that, or perhaps it should be automated...)
> On Jan 10, 2021, at 12:41 PM, Thomas Munro <thomas.munro@gmail.com> wrote: > > On Fri, Jan 8, 2021 at 6:33 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> The attached patches, v31, are mostly the same, but with "getopt_long.h" included from pg_amcheck.c per Thomas's review,and a .gitignore file added in contrib/pg_amcheck/ > > I couple more little things from Windows CI: > > C:\projects\postgresql\src\include\fe_utils/option_utils.h(19): > fatal error C1083: Cannot open include file: 'libpq-fe.h': No such > file or directory [C:\projects\postgresql\pg_amcheck.vcxproj] > > Does contrib/amcheck/Makefile need to say "SHLIB_PREREQS = > submake-libpq" like other contrib modules that use libpq? Added in v32. > pg_backup_utils.obj : error LNK2001: unresolved external symbol > exit_nicely [C:\projects\postgresql\pg_dump.vcxproj] > > I think this is probably because additions to src/fe_utils/Makefile's > OBJS list need to be manually replicated in > src/tools/msvc/Mkvcbuild.pm's @pgfeutilsfiles list. (If I'm right > about that, perhaps it needs a comment to remind us Unix hackers of > that, or perhaps it should be automated...) Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon There are also a few additions in v32 to typedefs.list, and some whitespace changes due to running pgindent. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v32-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch
- v32-0002-Refactoring-ExecuteSqlQuery-and-related-function.patch
- v32-0003-Creating-query_utils-frontend-utility.patch
- v32-0004-Adding-CurrentQueryHandler-logic.patch
- v32-0005-Refactoring-pg_dumpall-functions.patch
- v32-0006-Refactoring-expand_schema_name_patterns-and-frie.patch
- v32-0007-Moving-pg_dump-functions-to-new-file-option_util.patch
- v32-0008-Normalizing-option_utils-interface.patch
- v32-0009-Adding-contrib-module-pg_amcheck.patch
On Mon, Jan 11, 2021 at 1:16 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon exit_utils.c fails to achieve the goal of making this code independent of pg_dump, because of: #ifdef WIN32 if (parallel_init_done && GetCurrentThreadId() != mainThreadId) _endthreadex(code); #endif parallel_init_done is a pg_dump-ism. Perhaps this chunk of code could be a handler that gets registered using exit_nicely() rather than hard-coded like this. Note that the function comments for exit_nicely() are heavily implicated in this problem, since they also apply to stuff that only happens in pg_dump and not other utilities. I'm skeptical about the idea of putting functions into string_utils.c with names as generic as include_filter() and exclude_filter(). Existing cases like fmtId() and fmtQualifiedId() are not great either, but I think this is worse and that we should do some renaming. On a related note, it's not clear to me why these should be classified as string_utils while stuff like expand_schema_name_patterns() gets classified as option_utils. These are neither generic string-processing functions nor are they generic options-parsing functions. They are functions for expanding shell-glob style patterns for database object names. And they seem like they ought to be together, because they seem to do closely-related things. I'm open to an argument that this is wrongheaded on my part, but it looks weird to me the way it is. I'm pretty unimpressed by query_utils.c. The CurrentResultHandler stuff looks grotty, and you don't seem to really use it anywhere. And it seems woefully overambitious to me anyway: this doesn't apply to every kind of "result" we've got hanging around, absolutely nothing even close to that, even though a name like CurrentResultHandler sounds very broad. It also means more global variables, which is a thing of which the PostgreSQL codebase already has a deplorable oversupply. quiet_handler() and noop_handler() aren't used anywhere either, AFAICS. I wonder if it would be better to pass in callbacks rather than relying on global variables. e.g.: typedef void (*fatal_error_callback)(const char *fmt,...) pg_attribute_printf(1, 2) pg_attribute_noreturn(); Then you could have a few helper functions that take an argument of type fatal_error_callback and throw the right fatal error for (a) wrong PQresultStatus() and (b) result is not one row. Do you need any other cases? exiting_handler() seems to think that the caller might want to allow any number of tuples, or any positive number, or any particular cout, but I'm not sure if all of those cases are really needed. This stuff is finnicky and hard to get right. You don't really want to create a situation where the same code keeps getting duplicated, or the behavior's just a little bit inconsistent everywhere, but it also isn't great to build layers upon layers of abstraction around something like ExecuteSqlQuery which is, in the end, a four-line function. I don't think there's any problem with something like pg_dump having it's own function to execute-a-query-or-die. Maybe that function ends up doing something like TheGenericFunctionToExecuteOrDie(my_die_fn, the_query), or maybe pg_dump can just open-code it but have a my_die_fn to pass down to the glob-expansion stuff, or, well, I don't know. -- Robert Haas EDB: http://www.enterprisedb.com
> On Jan 14, 2021, at 1:13 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Mon, Jan 11, 2021 at 1:16 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> Added in v32, along with adding pg_amcheck to @contrib_uselibpq, @contrib_uselibpgport, and @contrib_uselibpgcommon > > exit_utils.c fails to achieve the goal of making this code independent > of pg_dump, because of: > > #ifdef WIN32 > if (parallel_init_done && GetCurrentThreadId() != mainThreadId) > _endthreadex(code); > #endif > > parallel_init_done is a pg_dump-ism. Perhaps this chunk of code could > be a handler that gets registered using exit_nicely() rather than > hard-coded like this. Note that the function comments for > exit_nicely() are heavily implicated in this problem, since they also > apply to stuff that only happens in pg_dump and not other utilities. The 0001 patch has been restructured to not have this problem. > I'm skeptical about the idea of putting functions into string_utils.c > with names as generic as include_filter() and exclude_filter(). > Existing cases like fmtId() and fmtQualifiedId() are not great either, > but I think this is worse and that we should do some renaming. On a > related note, it's not clear to me why these should be classified as > string_utils while stuff like expand_schema_name_patterns() gets > classified as option_utils. These are neither generic > string-processing functions nor are they generic options-parsing > functions. They are functions for expanding shell-glob style patterns > for database object names. And they seem like they ought to be > together, because they seem to do closely-related things. I'm open to > an argument that this is wrongheaded on my part, but it looks weird to > me the way it is. The logic to filter which relations are checked is completely restructured and is kept in pg_amcheck.c > I'm pretty unimpressed by query_utils.c. The CurrentResultHandler > stuff looks grotty, and you don't seem to really use it anywhere. And > it seems woefully overambitious to me anyway: this doesn't apply to > every kind of "result" we've got hanging around, absolutely nothing > even close to that, even though a name like CurrentResultHandler > sounds very broad. It also means more global variables, which is a > thing of which the PostgreSQL codebase already has a deplorable > oversupply. quiet_handler() and noop_handler() aren't used anywhere > either, AFAICS. > > I wonder if it would be better to pass in callbacks rather than > relying on global variables. e.g.: > > typedef void (*fatal_error_callback)(const char *fmt,...) > pg_attribute_printf(1, 2) pg_attribute_noreturn(); > > Then you could have a few helper functions that take an argument of > type fatal_error_callback and throw the right fatal error for (a) > wrong PQresultStatus() and (b) result is not one row. Do you need any > other cases? exiting_handler() seems to think that the caller might > want to allow any number of tuples, or any positive number, or any > particular cout, but I'm not sure if all of those cases are really > needed. The error callback stuff has been refactored in this next patch set, and also now includes handlers for parallel slots, asthe src/bin/scripts/scripts_parallel.c stuff has been moved to fe_utils and made more general. As it was, there were hardcodedassumptions that are valid for reindexdb and vacuumdb, but not general enough for pg_amcheck to use. The refactoringin patches 0002 through 0005 make it more generally usable. Patch 0008 uses it in pg_amcheck. > This stuff is finnicky and hard to get right. You don't really want to > create a situation where the same code keeps getting duplicated, or > the behavior's just a little bit inconsistent everywhere, but it also > isn't great to build layers upon layers of abstraction around > something like ExecuteSqlQuery which is, in the end, a four-line > function. I don't think there's any problem with something like > pg_dump having it's own function to execute-a-query-or-die. Maybe that > function ends up doing something like > TheGenericFunctionToExecuteOrDie(my_die_fn, the_query), or maybe > pg_dump can just open-code it but have a my_die_fn to pass down to the > glob-expansion stuff, or, well, I don't know. There are some real improvements in this next patch set. The number of queries issued to the database to determine the databases to use is much reduced. I had been following thepattern in pg_dump, but abandoned that for something new. The parallel slots stuff is now used for parallelism, much like what is done in vacuumdb and reindexdb. The pg_amcheck application can now be run over one database, multiple specified databases, or all databases. Relations, schemas, and databases can be included and excluded by pattern, like "(db1|db2|db3).myschema.(mytable|myindex)". The real-world use-cases for this that I have in mind are things like: pg_amcheck --jobs=12 --all \ --exclude-relation="db7.schema.known_corrupt_table" \ --exclude-relation="db*.schema.known_big_table" and pg_amcheck --jobs=20 \ --include-relation="*.compliance.audited" I might be missing something, but I think the interface is a superset of the interface from reindexdb and vacuumdb. Noneof the new interface stuff (patterns, allowing multiple databases to be given on the command line, etc) is required. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
- v33-0001-Moving-exit_nicely-and-fatal-into-fe_utils.patch
- v33-0002-Introducing-PGresultHandler-abstraction.patch
- v33-0003-Preparing-for-move-of-parallel-slot-infrastructu.patch
- v33-0004-Moving-and-renaming-scripts_parallel.patch
- v33-0005-Parameterizing-parallel-slot-result-handling.patch
- v33-0006-Moving-handle_help_version_opts.patch
- v33-0007-Refactoring-processSQLNamePattern.patch
- v33-0008-Adding-contrib-module-pg_amcheck.patch
I like 0007 quite a bit and am inclined to commit it soon, as it doesn't depend on the earlier patches. But: - I think the residual comment in processSQLNamePattern beginning with "Note:" could use some wordsmithing to account for the new structure of things -- maybe just "this pass" -> "this function". - I suggest changing initializations like maxbuf = buf + 2 to maxbuf = &buf[2] for clarity. Regarding 0001: - My preference would be to dump on_exit_nicely_final() and just rely on order of registration. - I'm not entirely sure it's a good ideal to expose something named fatal() like this, because that's a fairly short and general name. On the other hand, it's pretty descriptive and it's not clear why someone including exit_utils.h would want any other definitiion. I guess we can always change it later if it proves to be problematic; it's got a lot of callers and I guess there's no point in churning the code without a clear reason. - I don't quite see why we need this at all. Like, exit_nicely() is a pg_dump-ism. It would make sense to centralize it if we were going to use it for pg_amcheck, but you don't. If you were going to, you'd need to adapt 0003 to use exit_nicely() instead of exit(), but you don't, nor do you add any other new calls to exit_nicely() anywhere, except for one in 0002. That makes the PGresultHandler stuff depend on exit_nicely(), which might be important if you were going to refactor pg_dump to use that abstraction, but you don't. I'm not opposed to the idea of centralized exit processing for frontend utilities; indeed, it seems like a good idea. But this doesn't seem to get us there. AFAICS it just entangles pg_dump with pg_amcheck unnecessarily in a way that doesn't really benefit either of them. Regarding 0002: - I don't think this is separately committable because it adds an abstraction but not any uses of that abstraction to demonstrate that it's actually any good. Perhaps it should just be merged into 0005, and even into parallel_slot.h vs. having its own header. I'm not really sure about that, though - Is this really much of an abstraction layer? Like, how generic can this be when the argument list includes ExecStatusType expected_status and int expected_ntups? - The logic seems to be very similar to some of the stuff that you move around in 0003, like executeQuery() and executeCommand(), but it doesn't get unified. I'm not necessarily saying it should be, but it's weird to do all this refactoring and end up with something that still looks this 0003, 0004, and 0006 look pretty boring; they are just moving code around. Is there any point in splitting the code from 0003 across two files? Maybe it's fine. If I run pg_amcheck --all -j4 do I get a serialization boundary across databases? Like, I have to completely finish db1 before I can go onto db2, even though maybe only one worker is still busy with it? -- Robert Haas EDB: http://www.enterprisedb.com
> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > If I run pg_amcheck --all -j4 do I get a serialization boundary across > databases? Like, I have to completely finish db1 before I can go onto > db2, even though maybe only one worker is still busy with it? Yes, you do. That's patterned on reindexdb and vacuumdb. Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > I like 0007 quite a bit and am inclined to commit it soon, as it > doesn't depend on the earlier patches. But: > > - I think the residual comment in processSQLNamePattern beginning with > "Note:" could use some wordsmithing to account for the new structure > of things -- maybe just "this pass" -> "this function". > - I suggest changing initializations like maxbuf = buf + 2 to maxbuf = > &buf[2] for clarity. Ok, I should be able to get you an updated version of 0007 with those changes here soon for you to commit. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Jan 28, 2021 at 12:40 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > If I run pg_amcheck --all -j4 do I get a serialization boundary across > > databases? Like, I have to completely finish db1 before I can go onto > > db2, even though maybe only one worker is still busy with it? > > Yes, you do. That's patterned on reindexdb and vacuumdb. Sounds lame, but fair enough. We can leave that problem for another day. -- Robert Haas EDB: http://www.enterprisedb.com
> On Jan 28, 2021, at 9:49 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jan 28, 2021 at 12:40 PM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >>> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: >>> If I run pg_amcheck --all -j4 do I get a serialization boundary across >>> databases? Like, I have to completely finish db1 before I can go onto >>> db2, even though maybe only one worker is still busy with it? >> >> Yes, you do. That's patterned on reindexdb and vacuumdb. > > Sounds lame, but fair enough. We can leave that problem for another day. Yeah, I agree that it's lame, and should eventually be addressed. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On Jan 28, 2021, at 9:41 AM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > > >> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> >> I like 0007 quite a bit and am inclined to commit it soon, as it >> doesn't depend on the earlier patches. But: >> >> - I think the residual comment in processSQLNamePattern beginning with >> "Note:" could use some wordsmithing to account for the new structure >> of things -- maybe just "this pass" -> "this function". >> - I suggest changing initializations like maxbuf = buf + 2 to maxbuf = >> &buf[2] for clarity. > > Ok, I should be able to get you an updated version of 0007 with those changes here soon for you to commit. I made those changes, and fixed a bug that would impact the pg_amcheck callers. I'll have to extend the regression testcoverage in 0008 since it obviously wasn't caught, but that's not part of this patch since there are no callers thatuse the dbname.schema.relname format as yet. This is the only patch for v34, since you want to commit it separately. It's renamed as 0001 here.... — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Jan 28, 2021, at 9:13 AM, Robert Haas <robertmhaas@gmail.com> wrote: Attached is patch set 35. Per your review comments, I have restructured the patches in the following way: v33's 0007 is now the first patch, v35's 0001 v33's 0001 is no more. The frontend infrastructure for error handling and exiting may be resubmitted someday in anotherpatch, but they aren't necessary for pg_amcheck v33's 0002 is no more. The PGresultHandler stuff that it defined inspires some of what comes later in v35's 0003, but itisn't sufficiently similar to what v35 does to be thought of as moving from v33-0002 into v35-0003. v33's 0003, 0004 and 0006 are combined into v35's 0002 v33's 0005 becomes v35's 0003 v33's 0007 becomes v35's 0004 Additionally, pg_amcheck testing is extended beyond what v33 had in v35's new 0005 patch, but pg_amcheck doesn't depend onthis new 0005 patch ever being committed, so if you don't like it, just throw it in the bit bucket. > > I like 0007 quite a bit and am inclined to commit it soon, as it > doesn't depend on the earlier patches. But: > > - I think the residual comment in processSQLNamePattern beginning with > "Note:" could use some wordsmithing to account for the new structure > of things -- maybe just "this pass" -> "this function". > - I suggest changing initializations like maxbuf = buf + 2 to maxbuf = > &buf[2] for clarity Already responded to this in the v34 development a few days ago. Nothing meaningfully changes between 34 and 35. > Regarding 0001: > > - My preference would be to dump on_exit_nicely_final() and just rely > on order of registration. > - I'm not entirely sure it's a good ideal to expose something named > fatal() like this, because that's a fairly short and general name. On > the other hand, it's pretty descriptive and it's not clear why someone > including exit_utils.h would want any other definitiion. I guess we > can always change it later if it proves to be problematic; it's got a > lot of callers and I guess there's no point in churning the code > without a clear reason. > - I don't quite see why we need this at all. Like, exit_nicely() is a > pg_dump-ism. It would make sense to centralize it if we were going to > use it for pg_amcheck, but you don't. If you were going to, you'd need > to adapt 0003 to use exit_nicely() instead of exit(), but you don't, > nor do you add any other new calls to exit_nicely() anywhere, except > for one in 0002. That makes the PGresultHandler stuff depend on > exit_nicely(), which might be important if you were going to refactor > pg_dump to use that abstraction, but you don't. I'm not opposed to the > idea of centralized exit processing for frontend utilities; indeed, it > seems like a good idea. But this doesn't seem to get us there. AFAICS > it just entangles pg_dump with pg_amcheck unnecessarily in a way that > doesn't really benefit either of them. Removed from v35. > Regarding 0002: > > - I don't think this is separately committable because it adds an > abstraction but not any uses of that abstraction to demonstrate that > it's actually any good. Perhaps it should just be merged into 0005, > and even into parallel_slot.h vs. having its own header. I'm not > really sure about that, though Yeah, this is gone from v35, with hints of it moved into 0003 as part of the parallel slots refactoring. > - Is this really much of an abstraction layer? Like, how generic can > this be when the argument list includes ExecStatusType expected_status > and int expected_ntups? The new format takes a void *context argument. > - The logic seems to be very similar to some of the stuff that you > move around in 0003, like executeQuery() and executeCommand(), but it > doesn't get unified. I'm not necessarily saying it should be, but it's > weird to do all this refactoring and end up with something that still > looks this Yeah, I agree with this. The refactoring is a lot less ambitious in v35, to avoid these issues. > 0003, 0004, and 0006 look pretty boring; they are just moving code > around. Is there any point in splitting the code from 0003 across two > files? Maybe it's fine. Combined. > If I run pg_amcheck --all -j4 do I get a serialization boundary across > databases? Like, I have to completely finish db1 before I can go onto > db2, even though maybe only one worker is still busy with it? The command line interface and corresponding semantics for specifying which tables to check, which schemas to check, andwhich databases to check should be the same as that for reindexdb and vacuumdb, and the behavior for handing off thosetargets to be checked/reindexed/vacuumed through the parallel slots interface should be the same. It seems a bit muchto refactor reindexdb and vacuumdb to match pg_amcheck when pg_amcheck hasn't been accepted for commit as yet. If/whenthat happens, and if the project generally approves of going in this direction, I think the next step will be to refactorsome of this logic out of pg_amcheck into fe_utils and use it from all three utilities. At that time, I'd like totackle the serialization choke point in all three, and handle it in the same way for them all. For the new v35-0005 patch, I have extended PostgresNode.pm with some new corruption abilities. In short, it can now takea snapshot of the files that back a relation, and can corruptly rollback those files to prior versions, in full or inpart. This allows creating kinds of corruption that are hard to create through mere bit twiddling. For example, if therelation backing an index is rolled back to a prior version, amcheck's btree checking sees the index as not corrupt, butwhen asked to reconcile the entries in the heap with the index, it can see that not all of them are present. This givestest coverage of corruption checking functionality that is otherwise hard to achieve. To check that the PostgresNode.pm changes themselves work, v35-0005 adds src/test/modules/corruption To check pg_amcheck, and by implication amcheck, v35-0005 adds contrib/pg_amcheck/t/006_relfile_damage.pl Once again, v35-0005 does not need to be committed -- pg_amcheck works just fine without it. You and I have discussed this off-list, but for the record, amcheck and pg_amcheck currently only check heaps and btree indexes. Other object types, such as sequences and non-btree indexes, are not checked. Some basic sanity checking of otherobject types would be a good addition, and pg_amcheck has been structured in a way where it should be fairly straightforwardto add support for those. The only such sanity checking that I thought could be done in a short timeframewas to check that the relation files backing the objects were not missing, and we decided off-list such checkingwasn't worth much, so I didn't add it. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
> On Jan 31, 2021, at 4:05 PM, Mark Dilger <mark.dilger@enterprisedb.com> wrote: > > Attached is patch set 35. I found some things to improve in the v35 patch set. Please find attached the v36 patch set, which differs from v35 in thefollowing ways: 0001 -- no changes 0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm 0003 -- no changes 0004: -- Fixes handling of amcheck contrib module installed in non-default schema. -- Adds database name to corruption messages to make identifying the relation being complained about unambiguous in multi-databasechecks -- Fixes an instance where pg_amcheck was querying pg_database without schema-qualifying it -- Simplifying some functions in pg_amcheck.c -- Updating a comment to reflect the renaming of a variable that the comment mentioned by name 0005 -- fixes =pod added in PostgresNode.pm. The =pod was grammatically correct so far I can tell, but rendered strangelyin perldoc. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Feb 2, 2021 at 6:10 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > 0001 -- no changes Committed. > 0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm Here are a few minor cosmetic issues with this patch: - connect_utils.c lacks a file header comment. - Some or perhaps all of the other file header comments need an update for 2021. - There's bogus hunks in the diff for string_utils.c. I think the rest of this looks good. I spent a long time puzzling over whether consumeQueryResult() and processQueryResult() needed to be moved, but then I realized that this patch actually makes them into static functions inside parallel_slot.c, rather than public functions as they were before. I like that. The only reason those functions need to be moved at all is so that the scripts_parallel/parallel_slot stuff can continue to do its thing, so this is actually a better way of grouping things together than what we have now. > 0003 -- no changes I think it would be better if there were no handler by default, and failing to set one leads to an assertion failure when we get to the point where one would be called. I don't think I understand the point of renaming processQueryResult and consumeQueryResult. Isn't that just code churn for its own sake? PGresultHandler seems too generic. How about ParallelSlotHandler or ParallelSlotResultHandler? I'm somewhat inclined to propose s/ParallelSlot/ConnectionSlot/g but I guess it's better not to get sucked into renaming things. It's a little strange that we end up with mutators to set the slot's handler and handler context when we elsewhere feel free to monkey with a slot's connection directly, but it's not a perfect world and I can't think of anything I'd like better. -- Robert Haas EDB: http://www.enterprisedb.com
> On Feb 3, 2021, at 2:03 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Tue, Feb 2, 2021 at 6:10 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: >> 0001 -- no changes > > Committed. Thanks! >> 0002 -- fixing omissions in @pgfeutilsfiles in file src/tools/msvc/Mkvcbuild.pm Numbered 0001 in this next patch set. > Here are a few minor cosmetic issues with this patch: > > - connect_utils.c lacks a file header comment. Fixed > - Some or perhaps all of the other file header comments need an update for 2021. Fixed. > - There's bogus hunks in the diff for string_utils.c. Removed. > I think the rest of this looks good. I spent a long time puzzling over > whether consumeQueryResult() and processQueryResult() needed to be > moved, but then I realized that this patch actually makes them into > static functions inside parallel_slot.c, rather than public functions > as they were before. I like that. The only reason those functions need > to be moved at all is so that the scripts_parallel/parallel_slot stuff > can continue to do its thing, so this is actually a better way of > grouping things together than what we have now. >> 0003 -- no changes Numbered 0002 in this next patch set. > I think it would be better if there were no handler by default, and > failing to set one leads to an assertion failure when we get to the > point where one would be called. Changed to have no default handler, and to use Assert(PointerIsValid(handler)) as you suggest. > I don't think I understand the point of renaming processQueryResult > and consumeQueryResult. Isn't that just code churn for its own sake? I didn't like the names. I had to constantly look back where they were defined to remember which of them processed/consumedall the results and which only processed/consumed one of them. Part of that problem was that their namesare both singular. I have restored the names in this next patch set. > PGresultHandler seems too generic. How about ParallelSlotHandler or > ParallelSlotResultHandler? ParallelSlotResultHandler works for me. I'm using that, and renaming s/TableCommandSlotHandler/TableCommandResultHandler/to be consistent. > I'm somewhat inclined to propose s/ParallelSlot/ConnectionSlot/g but I > guess it's better not to get sucked into renaming things. I admit that I lost a fair amount of time on this project because I thought "scripts_parallel.c" and "parallel_slot" referredto some kind of threading, but only later looked closely enough to see that this is an event loop, not a parallelthreading system. I don't think "slot" is terribly informative, and if we rename I don't think it needs to be partof the name we choose. ConnectionEventLoop would be more intuitive to me than either of ParallelSlot/ConnectionSlot,but this seems like bikeshedding so I'm going to ignore it for now. > It's a little strange that we end up with mutators to set the slot's > handler and handler context when we elsewhere feel free to monkey with > a slot's connection directly, but it's not a perfect world and I can't > think of anything I'd like better. I created those mutators in an earlier version of the patch where the slot had a few more fields to set, and it helped tohave a single function call set all the fields. I agree it looks less nice now that there are only two fields to set. I also made changes to clean up 0003 (formerly numbered 0004) — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > I also made changes to clean up 0003 (formerly numbered 0004) "deduplice" is a typo. I'm not sure that I agree with check_each_database()'s commentary about why it doesn't make sense to optimize the resolve-the-databases step. Like, suppose I type 'pg_amcheck sasquatch'. I think the way you have it coded it's going to tell me that there are no databases to check, which might make me think I used the wrong syntax or something. I want it to tell me that sasquatch does not exist. If I happen to be a cryptid believer, I may reject that explanation as inaccurate, but at least there's no question about what pg_amcheck thinks the problem is. Why does check_each_database() go out of its way to run the main query without the always-secure search path? If there's a good reason, I think it deserves a comment saying what the reason is. If there's not a good reason, then I think it should use the always-secure search path for 100% of everything. Same question applies to check_one_database(). ParallelSlotSetHandler(free_slot, VerifyHeapamSlotHandler, sql.data) could stand to be split over two lines, like you do for the nearly run_command() call, so that it doesn't go past 80 columns. I suggest having two variables instead of one for amcheck_schema. Using the same variable to store the unescaped value and then later the escaped value is, IMHO, confusing. Whatever you call the escaped version, I'd rename the function parameters elsewhere to match. "status = PQsendQuery(conn, sql) == 1" seems a bit uptight to me. Why not just make status an int and then just "status = PQsendQuery(conn, sql)" and then test for status != 0? I don't really care if you don't change this, it's not actually important. But personally I'd rather code it as if any non-zero value meant success. I think the pg_log_error() in run_command() could be worded a bit better. I don't think it's a good idea to try to include the type of object in there like this, because of the translatability guidelines around assembling messages from fragments. And I don't think it's good to say that the check failed because the reality is that we weren't able to ask for the check to be run in the first place. I would rather log this as something like "unable to send query: %s". I would also assume we need to bail out entirely if that happens. I'm not totally sure what sorts of things can make PQsendQuery() fail but I bet it boils down to having lost the server connection. Should that occur, trying to send queries for all of the remaining objects is going to result in repeating the same error many times, which isn't going to be what anybody wants. It's unclear to me whether we should give up on the whole operation but I think we have to at least give up on that connection... unless I'm confused about what the failure mode is likely to be here. It looks to me like the user won't be able to tell by the exit code what happened. What I did with pg_verifybackup, and what I suggest we do here, is exit(1) if anything went wrong, either in terms of failing to execute queries or in terms of those queries returning problem reports. With pg_verifybackup, I thought about trying to make it like 0 => backup OK, 2 => backup not OK, 2 => trouble, but I found it too hard to distinguish what should be exit(1) and what should be exit(2) and the coding wasn't trivial either, so I went with the simpler scheme. The opening line of appendDatabaseSelect() could be adjusted to put the regexps parameter on the next line, avoiding awkward wrapping. If they are being run with a safe search path, the queries in appendDatabaseSelect(), appendSchemaSelect(), etc. could be run without all the paranoia. If not, maybe they should be. The casts to text don't include the paranoia: with an unsafe search path, we need pg_catalog.text here. Or no cast at all, which seems like it ought to be fine too. Not quite sure why you are doing all that casting to text; the datatype is presumably 'name' and ought to collate like collate "C" which is probably fine. It would probably be a better idea for appendSchemaSelect to declare a PQExpBuffer and call initPQExpBuffer just once, and then resetPQExpBuffer after each use, and finally termPQExpBuffer just once. The way you have it is not expensive enough to really matter, but avoiding repeated allocate/free cycles is probably best. I wonder if a pattern like .foo.bar ends up meaning the same thing as a pattern like foo.bar, with the empty database name being treated the same as if nothing were specified. From the way appendTableCTE() is coded, it seems to me that if I ask for tables named j* excluding tables named jam* I still might get toast tables for my jam, which seems wrong. There does not seem to be any clear benefit to defining CT_TABLE = 0 in this case, so I would let the compiler deal with it. We should not be depending on that to have any particular numeric value. Why does pg_amcheck.c have a header file pg_amcheck.h if there's only one source file? If you had multiple source files then the header would be a reasonable place to put stuff they all need, but you don't. Copying the definitions of HEAP_TABLE_AM_OID and BTREE_AM_OID into pg_amcheck.h or anywhere else seems bad. I think you just be doing #include "catalog/pg_am_d.h". I think I'm out of steam for today but I'll try to look at this more soon. In general I think this patch and the whole series are pretty close to being ready to commit, even though there are still things I think need fixing here and there. Thanks, -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Numbered 0001 in this next patch set. Hi, I committed 0001 as you had it and 0002 with some more cleanups. Things I did: - Adjusted some comments. - Changed processQueryResult so that it didn't do foo(bar) with foo being a pointer. Generally we prefer (*foo)(bar) when it can be confused with a direct function call, but wunk->foo(bar) is also considered acceptable. - Changed the return type of ParallelSlotResultHandler to be bool, because having it return PGresult * seemed to offer no advantages. -- Robert Haas EDB: http://www.enterprisedb.com
> On Feb 4, 2021, at 1:04 PM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Feb 4, 2021 at 11:10 AM Mark Dilger > <mark.dilger@enterprisedb.com> wrote: >> I also made changes to clean up 0003 (formerly numbered 0004) > > "deduplice" is a typo. Fixed. > I'm not sure that I agree with check_each_database()'s commentary > about why it doesn't make sense to optimize the resolve-the-databases > step. Like, suppose I type 'pg_amcheck sasquatch'. I think the way you > have it coded it's going to tell me that there are no databases to > check, which might make me think I used the wrong syntax or something. > I want it to tell me that sasquatch does not exist. If I happen to be > a cryptid believer, I may reject that explanation as inaccurate, but > at least there's no question about what pg_amcheck thinks the problem > is. The way v38 is coded, 'pg_amcheck sasquatch" will return a non-zero error code with an error message, database "sasquatch"does not exist. The problem only comes up if you run it like one of the following: pg_amcheck --maintenance-db postgres sasquatch pg_amcheck postgres sasquatch pg_amcheck "sasquatch.myschema.mytable" In each of those, pg_amcheck first connects to the initial database ("postgres" or whatever) and tries to resolve all databasesto check matching patterns like '^(postgres)$' and '^(sasquatch)$' and doesn't find any sasquatch matches, but alsodoesn't complain. In v39, this is changed to complain when patterns do not match. This can be turned off with --no-strict-names. > Why does check_each_database() go out of its way to run the main query > without the always-secure search path? If there's a good reason, I > think it deserves a comment saying what the reason is. If there's not > a good reason, then I think it should use the always-secure search > path for 100% of everything. Same question applies to > check_one_database(). That bit of code survived some refactoring, but it doesn't make sense to keep it, assuming it ever made sense at all. Removedin v39. The calls to connectDatabase will always secure the search_path, so pg_amcheck need not touch that directly. > ParallelSlotSetHandler(free_slot, VerifyHeapamSlotHandler, sql.data) > could stand to be split over two lines, like you do for the nearly > run_command() call, so that it doesn't go past 80 columns. Fair enough. The code has been treated to a pass through pgindent as well. > I suggest having two variables instead of one for amcheck_schema. > Using the same variable to store the unescaped value and then later > the escaped value is, IMHO, confusing. Whatever you call the escaped > version, I'd rename the function parameters elsewhere to match. The escaped version is now part of a struct, so there shouldn't be any confusion about this. > "status = PQsendQuery(conn, sql) == 1" seems a bit uptight to me. Why > not just make status an int and then just "status = PQsendQuery(conn, > sql)" and then test for status != 0? I don't really care if you don't > change this, it's not actually important. But personally I'd rather > code it as if any non-zero value meant success. I couldn't remember why I coded it like that, since it doesn't look like my style, then noticed I copied that from reindexdb.c,upon which this code is patterned. I agree it looks strange, and I've changed it in v39. Unlike the call sitein reindexdb, there isn't any reason for pg_amcheck to store the returned value in a variable, so in v39 it doesn't. > I think the pg_log_error() in run_command() could be worded a bit > better. I don't think it's a good idea to try to include the type of > object in there like this, because of the translatability guidelines > around assembling messages from fragments. And I don't think it's good > to say that the check failed because the reality is that we weren't > able to ask for the check to be run in the first place. I would rather > log this as something like "unable to send query: %s". I would also > assume we need to bail out entirely if that happens. I'm not totally > sure what sorts of things can make PQsendQuery() fail but I bet it > boils down to having lost the server connection. Should that occur, > trying to send queries for all of the remaining objects is going to > result in repeating the same error many times, which isn't going to be > what anybody wants. It's unclear to me whether we should give up on > the whole operation but I think we have to at least give up on that > connection... unless I'm confused about what the failure mode is > likely to be here. Changed in v39 to report the error as you suggest. It will reconnect and retry a command one time on error. That should cover the case that the connection to the databasewas merely lost. If the second attempt also fails, no further retry of the same command is attempted, though commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other remainingdatabases in the list. Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relation indb2 before finally moving on to db3. That seems pretty awful considering how many relations that could be, but failingto soldier on in the face of errors seems a strange design for a corruption checking tool. > It looks to me like the user won't be able to tell by the exit code > what happened. What I did with pg_verifybackup, and what I suggest we > do here, is exit(1) if anything went wrong, either in terms of failing > to execute queries or in terms of those queries returning problem > reports. With pg_verifybackup, I thought about trying to make it like > 0 => backup OK, 2 => backup not OK, 2 => trouble, but I found it too > hard to distinguish what should be exit(1) and what should be exit(2) > and the coding wasn't trivial either, so I went with the simpler > scheme. In v39, exit(1) is used for all errors which are intended to stop the program. It is important to recognize that findingcorruption is not an error in this sense. A query to verify_heapam() can fail if the relation's checksums are bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers. There can be errorsif the file backing a relation is missing. There may be other corruption error cases that I have not yet thought about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them. As discussedabove, failing to send the query to the server is not viewed as a reason to exit, either. It would be hard to quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such failures,and I'm not sure that pg_amcheck should just abort. > > The opening line of appendDatabaseSelect() could be adjusted to put > the regexps parameter on the next line, avoiding awkward wrapping. > > If they are being run with a safe search path, the queries in > appendDatabaseSelect(), appendSchemaSelect(), etc. could be run > without all the paranoia. If not, maybe they should be. The casts to > text don't include the paranoia: with an unsafe search path, we need > pg_catalog.text here. Or no cast at all, which seems like it ought to > be fine too. Not quite sure why you are doing all that casting to > text; the datatype is presumably 'name' and ought to collate like > collate "C" which is probably fine. In v39, everything is being run with a safe search path, and the paranoia and casts are largely gone. > It would probably be a better idea for appendSchemaSelect to declare a > PQExpBuffer and call initPQExpBuffer just once, and then > resetPQExpBuffer after each use, and finally termPQExpBuffer just > once. The way you have it is not expensive enough to really matter, > but avoiding repeated allocate/free cycles is probably best. I'm not sure what this comment refers to, but this function doesn't exist in v39. > I wonder if a pattern like .foo.bar ends up meaning the same thing as > a pattern like foo.bar, with the empty database name being treated the > same as if nothing were specified. That's really a question of how patternToSQLRegex parses that string. In general, "a.b.c" => ("^(a)$", "^(b)$", "^(c)$"),so I would expect your example to have a database pattern "^()$" which should only match databases with zero lengthnames, presumably none. I've added a regression test for this, and indeed that's what it does. > From the way appendTableCTE() is coded, it seems to me that if I ask > for tables named j* excluding tables named jam* I still might get > toast tables for my jam, which seems wrong. In v39, the query is entirely reworked, so I can't respond directly to this, though I agree that excluding a table shouldmean the toast table does not automatically get included. There is an interaction, though, if you select both "j*'and "pg_toast.*" and then exclude "jam". > There does not seem to be any clear benefit to defining CT_TABLE = 0 > in this case, so I would let the compiler deal with it. We should not > be depending on that to have any particular numeric value. The enum is removed in v39. > Why does pg_amcheck.c have a header file pg_amcheck.h if there's only > one source file? If you had multiple source files then the header > would be a reasonable place to put stuff they all need, but you don't. Everything is in pg_amcheck.c now. > Copying the definitions of HEAP_TABLE_AM_OID and BTREE_AM_OID into > pg_amcheck.h or anywhere else seems bad. I think you just be doing > #include "catalog/pg_am_d.h". Good point. Done. > I think I'm out of steam for today but I'll try to look at this more > soon. In general I think this patch and the whole series are pretty > close to being ready to commit, even though there are still things I > think need fixing here and there. Reworking the code took a while. Version 39 patches attached. — Mark Dilger EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > It will reconnect and retry a command one time on error. That should cover the case that the connection to the databasewas merely lost. If the second attempt also fails, no further retry of the same command is attempted, though commandsfor remaining relation targets will still be attempted, both for the database that had the error and for other remainingdatabases in the list. > > Assuming something is wrong with "db2", the command `pg_amcheck db1 db2 db3` could result in two failures per relationin db2 before finally moving on to db3. That seems pretty awful considering how many relations that could be, butfailing to soldier on in the face of errors seems a strange design for a corruption checking tool. That doesn't seem right at all. I think a PQsendQuery() failure is so remote that it's probably justification for giving up on the entire operation. If it's caused by a problem with some object, it probably means that accessing that object caused the whole database to go down, and retrying the object will take the database down again. Retrying the object is betting that the user interrupted connectivity between pg_amcheck and the database but the interruption is only momentary and the user actually wants to complete the operation. That seems unlikely to me. I think it's far more probably that the database crashed or got shut down and continuing is futile. My proposal is: if we get an ERROR trying to *run* a query, give up on that object but still try the other ones after reconnecting. If we get a FATAL or PANIC trying to *run* a query, give up on the entire operation. If even sending a query fails, also give up. > In v39, exit(1) is used for all errors which are intended to stop the program. It is important to recognize that findingcorruption is not an error in this sense. A query to verify_heapam() can fail if the relation's checksums are bad,and that happens beyond verify_heapam()'s control when the page is not allowed into the buffers. There can be errorsif the file backing a relation is missing. There may be other corruption error cases that I have not yet thought about. The connections' errors get reported to the user, but pg_amcheck does not exit as a consequence of them. As discussedabove, failing to send the query to the server is not viewed as a reason to exit, either. It would be hard to quantifyall the failure modes, but presumably the catalogs for a database could be messed up enough to cause such failures,and I'm not sure that pg_amcheck should just abort. I agree that exit(1) should happen after any error intended to stop the program. But I think it should also happen at the end of the run if we hit any problems for which we did not stop, so that exit(0) means your database is healthy. -- Robert Haas EDB: http://www.enterprisedb.com
On Wed, Feb 17, 2021 at 1:46 PM Mark Dilger <mark.dilger@enterprisedb.com> wrote: > Reworking the code took a while. Version 39 patches attached. Regarding the documentation, I think the Usage section at the top is far too extensive and duplicates the option description section to far too great an extent. You have 21 usage examples for a command with 34 options. Even if we think it's a good idea to give a brief summary of usage, it's got to be brief; we certainly don't need examples of obscure special-purpose options like --maintenance-db here. Looking through the commands in "PostgreSQL Client Applications" and "Additional Supplied Programs," most of them just have a synopsis section and nothing like this Usage section. Those that do have a Usage section typically use it for a narrative description of what to do with the tool (e.g. see pg_test_timing), not a long list of examples. I'm inclined to think you should nuke all the examples and incorporate the descriptive text, to the extent that it's needed, either into the descriptions of the individual options or, if the behavior spans many options, into the Description section. A few of these examples could move down into an Examples section at the bottom, perhaps, but I think 21 is still too many. I'd try to limit it to 5-7. Just hit the highlights. I also think that perhaps it's not best to break up the list of options into so many different categories the way you have. Notice that for example pg_dump and psql don't do this, instead putting everything into one ordered list, despite also having a lot of options. This is arguably worse if you want to understand which options are related to each other, but it's better if you are just looking for something based on alphabetical order. -- Robert Haas EDB: http://www.enterprisedb.com