Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

From Andres Freund
Subject Re: snapshot too old issues, first around wraparound and then more.
Date
Msg-id 20200403001235.e6jfdll3gh2ygbuc@alap3.anarazel.de
Whole thread Raw
In response to Re: snapshot too old issues, first around wraparound and then more.  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
Hi,

I just spend a good bit more time improving my snapshot patch, so it
could work well with a fixed version of the old_snapshot_threshold
feature. Mostly so there's no unnecessary dependency on the resolution
of the issues in that patch.


When testing my changes, for quite a while, I could not get
src/test/modules/snapshot_too_old/ to trigger a single too-old error.

It turns out, that's because there's not a single tuple removed due to
old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only
reason the current code triggers any such errors is that

a) TransactionIdLimitedForOldSnapshots() is always called in
   heap_page_prune_opt(), even if the "not limited" horizon
   (i.e. RecentGlobalDataXmin) is more than old enough to allow for
   pruning. That includes pages without a pd_prune_xid.

b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold
   == 0 branch, always calls
    SetOldSnapshotThresholdTimestamp(ts, xlimit)
   even if the horizon has not changed due to snapshot_too_old (xlimit
   is initially set tot the "input" horizon, and only increased if
   between (recentXmin, MyProc->xmin)).


To benefit from the snapshot scalability improvements in my patchset, it
is important to avoid unnecessarily determining the "accurate" xmin
horizon, if it's clear from the "lower boundary" horizon that pruning
can happen. Therefore I changed heap_page_prune_opt() and
heap_page_prune() to only limit when we couldn't prune.

In the course of that I separated getting the horizon from
TransactionIdLimitedForOldSnapshots() and triggering errors when an
already removed tuple would be needed via
TransactionIdLimitedForOldSnapshots().

Because there are no occasions to actually remove tuples in the entire
test, there now were no TransactionIdLimitedForOldSnapshots() calls. And
thus no errors.  My code turns out to actually work.


Thus, if I change the code in master from:
        TransactionId xlimit = recentXmin;
...
        if (old_snapshot_threshold == 0)
        {
            if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
                && TransactionIdFollows(latest_xmin, xlimit))
                xlimit = latest_xmin;

            ts -= 5 * USECS_PER_SEC;
            SetOldSnapshotThresholdTimestamp(ts, xlimit);

            return xlimit;
        }

to
...
        if (old_snapshot_threshold == 0)
        {
            if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin)
                && TransactionIdFollows(latest_xmin, xlimit))
            {
                xlimit = latest_xmin;
                SetOldSnapshotThresholdTimestamp(ts, xlimit);
            }

            ts -= 5 * USECS_PER_SEC;

            return xlimit;
        }

there's not a single error raised in the existing tests. Not a *single*
tuple removal is caused by old_snapshot_threshold. We just test the
order of SetOldSnapshotThresholdTimestamp() calls. We have code in the
backend to support testing old_snapshot_threshold, but we don't test
anything meaningful in the feature. We basically test a oddly behaving
version version of "transaction_timeout = 5s".  I can't emphasize enough
how baffling I find this.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Berserk Autovacuum (let's save next Mandrill)
Next
From: Andres Freund
Date:
Subject: Re: snapshot too old issues, first around wraparound and then more.