snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers

Hi,

Sorry, this mail is somewhat long. But I think it's important that at
least a few committers read it, since I think we're going to have to
make some sort of call about what to do.


I am trying to change the snapshot too old infrastructure so it
cooperates with my snapshot scalability patch. While trying to
understand the code sufficiently, I think I found a fairly serious
issue:

To map the time-based old_snapshot_threshold to an xid that can be used
as a cutoff for heap_page_prune(), we maintain a ringbuffer of
old_snapshot_threshold + 10 entries in
oldSnapshotControlData->xid_by_minute[]. TransactionIdLimitedForOldSnapshots()
uses that to (at least that's the goal) increase the horizon used for
pruning.

The problem is that there's no protection again the xids in the
ringbuffer getting old enough to wrap around. Given that practical uses
of old_snapshot_threshold are likely to be several hours to several
days, that's not particularly hard to hit.

That then has the consequence that we can use an xid that's either from
"from the future" (if bigger than the current xid), or more recent than
appropriate (if it wrapped far enough to be below nextxid, but not yet
older than OldestXmin) as the OldestXmin argument to heap_page_prune().

Which in turn means that we can end up pruning much more recently
removed rows than intended.

While that'd be bad on its own, the big problem is that we won't detect
that case on access, in contrast to the way old_snapshot_threshold is
intended to work. The reason for that is detecting these errors happens
on the basis of timestamps - which obviously do not wrap around.


This made me want to try to reproduce the problem to at least some
degree. But I hit another wall: I can't make head or tails out of the
values in the xid_by_minute[] mapping.


I added some debug output to print the mapping before/after changes by
MaintainOldSnapshotTimeMapping() (note that I used timestamps relative
to the server start in minutes/seconds to make it easier to interpret).

And the output turns out to be something like:

WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 644169619
  entry 5 (ring 5): min 12: xid 644169619
  entry 6 (ring 6): min 13: xid 644169619
  entry 7 (ring 7): min 14: xid 644169619

WARNING:  head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382

WARNING:  old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 666899382
  entry 5 (ring 5): min 12: xid 644169619
  entry 6 (ring 6): min 13: xid 644169619
  entry 7 (ring 7): min 14: xid 644169619

It's pretty obvious that the xids don't make a ton of sense, I think:
They're not monotonically ordered. The same values exist multiple times,
despite xids being constantly used. Also, despite the ringbuffer
supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we
always add), and the workload having run for 14min, we only have 8
entries.

Then a bit later we get:

WARNING:  old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0
  entry 0 (ring 0): min 7: xid 582921233
  entry 1 (ring 1): min 8: xid 654154155
  entry 2 (ring 2): min 9: xid 661972949
  entry 3 (ring 3): min 10: xid 666899382
  entry 4 (ring 4): min 11: xid 666899382
  entry 5 (ring 5): min 12: xid 666899382
  entry 6 (ring 6): min 13: xid 666899382
  entry 7 (ring 7): min 14: xid 666899382

WARNING:  head 420 s: filling 8 buckets starting at 0 for sec 900 with xmin 666899382
WARNING:  old snapshot mapping at "after update" with head ts: 15, current entries: 15 max entries: 15, offset: 1
  entry 0 (ring 1): min 15: xid 654154155
  entry 1 (ring 2): min 16: xid 661972949
  entry 2 (ring 3): min 17: xid 666899382
  entry 3 (ring 4): min 18: xid 666899382
  entry 4 (ring 5): min 19: xid 666899382
  entry 5 (ring 6): min 20: xid 666899382
  entry 6 (ring 7): min 21: xid 666899382
  entry 7 (ring 8): min 22: xid 666899382
  entry 8 (ring 9): min 23: xid 666899382
  entry 9 (ring 10): min 24: xid 666899382
  entry 10 (ring 11): min 25: xid 666899382
  entry 11 (ring 12): min 26: xid 666899382
  entry 12 (ring 13): min 27: xid 666899382
  entry 13 (ring 14): min 28: xid 666899382
  entry 14 (ring 0): min 29: xid 666899382


At a later point we then enter the "Advance is so far that all old data
is junk; start over." branch, and just reset the whole mapping:
  entry 0 (ring 0): min 30: xid 866711525


The problem, as far as I can tell, is that
oldSnapshotControl->head_timestamp appears to be intended to be the
oldest value in the ring. But we update it unconditionally in the "need
a new bucket, but it might not be the very next one" branch of
MaintainOldSnapshotTimeMapping().

While there's not really a clear-cut comment explaining whether
head_timestamp() is intended to be the oldest or the newest timestamp,
it seems to me that the rest of the code treats it as the "oldest"
timestamp.

TransactionId
TransactionIdLimitedForOldSnapshots(TransactionId recentXmin,
                                    Relation relation)
...
        ts = AlignTimestampToMinuteBoundary(ts)
            - (old_snapshot_threshold * USECS_PER_MINUTE);
...
                LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED);

                if (oldSnapshotControl->count_used > 0
                    && ts >= oldSnapshotControl->head_timestamp)
                {
                    int         offset;

                    offset = ((ts - oldSnapshotControl->head_timestamp)
                              / USECS_PER_MINUTE);
                    if (offset > oldSnapshotControl->count_used - 1)
                        offset = oldSnapshotControl->count_used - 1;
                    offset = (oldSnapshotControl->head_offset + offset)
                        % OLD_SNAPSHOT_TIME_MAP_ENTRIES;
                    xlimit = oldSnapshotControl->xid_by_minute[offset];

                    if (NormalTransactionIdFollows(xlimit, recentXmin))
                        SetOldSnapshotThresholdTimestamp(ts, xlimit);
                }

                LWLockRelease(OldSnapshotTimeMapLock);

So we wind ts back by old_snapshot_threshold minutes. Then check that
that's still newer than oldSnapshotControl->head_timestamp - which
clearly can't be the case if it were the newest ts. And as far as I can
tell the indexing code also only makes sense if head_timestamp is the
oldest timestamp.


This would mean that most cases the old_snapshot_threshold feature is
active it would cause corruption: We'd not trigger errors on access,
because the timestamp set with SetOldSnapshotThresholdTimestamp() would
not actually match the xids used to limit.  But:

It turns out to be somewhat hard to get
TransactionIdLimitedForOldSnapshots() to actually do something. Because
oldSnapshotControl->head_timestamp is updated much more often than it
should, the ts >= oldSnapshotControl->head_timestamp condition will
often prevent the limiting code from being hit.

But it's not very unlikely either. Due to the update of head_timestamp
to the current timestamp, we'll enter the "existing mapping; advance xid
if possible" branch for up to OLD_SNAPSHOT_TIME_MAP_ENTRIES times. Which
means we can hit it for
/*
 * The structure used to map times to TransactionId values for the "snapshot
 * too old" feature must have a few entries at the tail to hold old values;
 * otherwise the lookup will often fail and the expected early pruning or
 * vacuum will not usually occur.  It is best if this padding is for a number
 * of minutes greater than a thread would normally be stalled, but it's OK if
 * early vacuum opportunities are occasionally missed, so there's no need to
 * use an extreme value or get too fancy.  10 minutes seems plenty.
 */
#define OLD_SNAPSHOT_PADDING_ENTRIES 10
#define OLD_SNAPSHOT_TIME_MAP_ENTRIES (old_snapshot_threshold + OLD_SNAPSHOT_PADDING_ENTRIES)

10 minutes, I think. There's some other ways too, but they're much less
likely.

Note that once the issue has been hit once, future
SetOldSnapshotThresholdTimestamp() calls that don't hit those 10 minutes
will also return a corrupted horizon, because
oldSnapshotControl->threshold_xid will have the wrong value, which then
will be used:
        /*
         * Failsafe protection against vacuuming work of active transaction.
         *
         * This is not an assertion because we avoid the spinlock for
         * performance, leaving open the possibility that xlimit could advance
         * and be more current; but it seems prudent to apply this limit.  It
         * might make pruning a tiny bit less aggressive than it could be, but
         * protects against data loss bugs.
         */
        if (TransactionIdIsNormal(latest_xmin)
            && TransactionIdPrecedes(latest_xmin, xlimit))
            xlimit = latest_xmin;

        if (NormalTransactionIdFollows(xlimit, recentXmin))
            return xlimit;



As far as I can tell, this code has been wrong since the feature has
been committed.  The tests don't show a problem, because none of this
code is reached when old_snapshot_threshold = 0 (which has no real world
use, it's purely for testing).


I really don't know what to do here. The feature never worked and will
silently cause wrong query results. Fixing it seems like a pretty large
task - there's a lot more bugs. But ripping out a feature in stable
branches is pretty bad too.


Before figuring out the above, I spent the last several days trying to
make this feature work with my snapshot scalability patch. Trying to
avoid regressing old_snapshot_threshold behaviour / performance. But not
it seems to me that there's no actual working feature that can be
preserved.


I am really tired.


Andres.



pgsql-hackers by date:

Previous
From: Amit Kapila
Date:
Subject: Re: pg_stat_statements issue with parallel maintenance (Was Re: WALusage calculation patch)
Next
From: Vik Fearing
Date:
Subject: Re: proposal \gcsv