Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE - Mailing list pgsql-hackers

From Andres Freund
Subject Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
Date
Msg-id 20230109220751.7quf2kfzfdvd76hw@awork3.anarazel.de
Whole thread Raw
In response to Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE  (Mark Dilger <mark.dilger@enterprisedb.com>)
Responses Re: BUG: Postgres 14 + vacuum_defer_cleanup_age + FOR UPDATE + UPDATE
List pgsql-hackers
Hi,

On 2023-01-09 13:55:02 -0800, Mark Dilger wrote:
> > On Jan 9, 2023, at 11:34 AM, Andres Freund <andres@anarazel.de> wrote:
> > 
> > 1) Because ctx->next_xid is set after the XidFromFullTransactionId() call in
> > update_cached_xid_range(), we end up using the xid 0 (or an outdated value in
> > subsequent calls) to determine whether epoch needs to be reduced.
> 
> Can you say a bit more about your analysis here, preferably with pointers to
> the lines of code you are analyzing?  Does the problem exist in amcheck as
> currently committed, or are you thinking about a problem that arises only
> after applying your patch?  I'm a bit fuzzy on where xid 0 gets used.

The problems exist in the code as currently committed. I'm not sure what
exactly the consequences are, the result is that oldest_fxid will be, at least
temporarily, bogus.

Consider the first call to update_cached_xid_range():

/*
 * Update our cached range of valid transaction IDs.
 */
static void
update_cached_xid_range(HeapCheckContext *ctx)
{
    /* Make cached copies */
    LWLockAcquire(XidGenLock, LW_SHARED);
    ctx->next_fxid = ShmemVariableCache->nextXid;
    ctx->oldest_xid = ShmemVariableCache->oldestXid;
    LWLockRelease(XidGenLock);

    /* And compute alternate versions of the same */
    ctx->oldest_fxid = FullTransactionIdFromXidAndCtx(ctx->oldest_xid, ctx);
    ctx->next_xid = XidFromFullTransactionId(ctx->next_fxid);
}

The problem is that the call to FullTransactionIdFromXidAndCtx() happens
before ctx->next_xid is assigned, even though FullTransactionIdFromXidAndCtx()
uses ctx->next_xid.

static FullTransactionId
FullTransactionIdFromXidAndCtx(TransactionId xid, const HeapCheckContext *ctx)
{
    uint32        epoch;

    if (!TransactionIdIsNormal(xid))
        return FullTransactionIdFromEpochAndXid(0, xid);
    epoch = EpochFromFullTransactionId(ctx->next_fxid);
    if (xid > ctx->next_xid)
        epoch--;
    return FullTransactionIdFromEpochAndXid(epoch, xid);
}

Because ctx->next_xid is 0, due to not having been set yet, "xid > ctx->next_xid"
will always be true, leading to epoch being reduced by one.

In the common case of there never having been an xid wraparound, we'll thus
underflow epoch, generating an xid far into the future.


The tests encounter the issue today. If you add
    Assert(TransactionIdIsValid(ctx->next_xid));
    Assert(FullTransactionIdIsValid(ctx->next_fxid));
early in FullTransactionIdFromXidAndCtx() it'll be hit in the
amcheck/pg_amcheck tests.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Show various offset arrays for heap WAL records
Next
From: Peter Geoghegan
Date:
Subject: Re: Reducing the WAL overhead of freezing in VACUUM by deduplicating per-tuple freeze plans