Thread: More snapshot-management fun

More snapshot-management fun

From
Tom Lane
Date:
While pursuing the unprotected-use-of-CatalogSnapshot problem I noted
earlier, I came across a couple of related issues:

* ProcArrayInstallImportedXmin and ProcArrayInstallRestoredXmin will
forcibly overwrite MyPgXact->xmin with some random xmin or other.
While they take pains to ensure that doesn't cause the global xmin
horizon to go backwards, there's no obvious interlock ensuring that
it doesn't cause MyPgXact->xmin to go *forwards* enough to render
some locally-stored snapshot unprotected.  I think we need to add a
call that will check for that and throw an error.  (In the case of
CatalogSnapshot, it could just invalidate CatalogSnapshot, but there
isn't any way for snapmgr.c to unilaterally invalidate a registered
or stacked-active snapshot.)  I think this may be a live bug with
respect to CatalogSnapshot, even if callers are defending against
there being any other active snapshots.

* I tried addingAssert(MyPgXact->xmin != InvalidTransactionId);
to SnapshotResetXmin() just after it's determined that RegisteredSnapshots
is nonempty, reasoning that the advertised xmin had better be nonzero if
we have any registered snapshots.  It promptly blew up on me.  The reason
is that we clear MyPgXact->xmin during ProcArrayEndTransaction, long
before AtEOXact_Snapshot.  AtEOXact_Snapshot itself didn't have a problem,
but in some cases we may call InvalidateCatalogSnapshot during the cleanup
sequence, and the modified version of that that I'm testing invokes
SnapshotResetXmin which detected the inconsistency.  I find this very
scary: it means that all through transaction cleanup, we have unprotected
snapshots sitting around.  Yeah, they should probably not get used for
anything, but there's no very strong mechanism guaranteeing that.

An easy attempt at a fix for the second problem would be to not clear
MyPgXact->xmin during ProcArrayEndTransaction, but leave it to be done
when we flush our snapshots.  But I seem to recall that it's intentional
and necessary that we clear that synchronously with clearing MyPgXact->xid.

An idea I'm playing with is to add logic so that just before
ProcArrayEndTransaction, we run through all registered and stacked
snapshots and replace their "satisfies" function pointers with pointers
to a function that just throws an error, thus catching any attempt to
use the snapshots for anything during transaction cleanup.  This is sort
of an annoying use of cycles, though I suppose we could do it only in
assert-enabled builds.

Thoughts?
        regards, tom lane



Re: More snapshot-management fun

From
Tom Lane
Date:
I wrote:
> * ProcArrayInstallImportedXmin and ProcArrayInstallRestoredXmin will
> forcibly overwrite MyPgXact->xmin with some random xmin or other.
> While they take pains to ensure that doesn't cause the global xmin
> horizon to go backwards,

and while we're on the subject, I doubt that those pains are sufficient.
Assume that xact A is importing an xmin from xact B, and meanwhile xact
C is attempting to determine the global xmin.  Since the above-mentioned
functions take the ProcArrayLock in shared mode, these things can happen
concurrently.  Assume that the xmin in question is the oldest one in the
system, and consider the following sequence of events:

1. C's scan of the procarray reaches xact A.  Its xmin isn't set yet,
so nothing happens.

2. A imports xmin from B.  Now A's xmin is set, but too late for C
to notice.

3. B resets its MyPgXact->xmin to zero.  Even though it's not allowed to
exit its transaction because C still has the shared ProcArrayLock, that
doesn't stop B from running SnapshotResetXmin intra-transaction.

4. C's scan of the procarray reaches xact B.  B's xmin is now zero,
so that doesn't inform C either.

Upshot: A now has an xmin that is before C's opinion of the global
xmin, allowing C to prune away data that A needs.

In practice this may not be a live bug, because I suspect that we don't
export snapshots that wouldn't be kept till end-of-transaction, so that
B would not actually advance/clear its xmin while C holds the lock.
But there's certainly nothing protecting against it in procarray.c,
and no documentation of the hazard anyplace.
        regards, tom lane