snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | snapshot too old issues, first around wraparound and then more. |
Date | |
Msg-id | 20200401064008.qob7bfnnbu4w5cw4@alap3.anarazel.de Whole thread Raw |
Responses |
Re: snapshot too old issues, first around wraparound and then more.
Re: snapshot too old issues, first around wraparound and then more. Re: snapshot too old issues, first around wraparound and then more. Re: snapshot too old issues, first around wraparound and then more. |
List | pgsql-hackers |
Hi, Sorry, this mail is somewhat long. But I think it's important that at least a few committers read it, since I think we're going to have to make some sort of call about what to do. I am trying to change the snapshot too old infrastructure so it cooperates with my snapshot scalability patch. While trying to understand the code sufficiently, I think I found a fairly serious issue: To map the time-based old_snapshot_threshold to an xid that can be used as a cutoff for heap_page_prune(), we maintain a ringbuffer of old_snapshot_threshold + 10 entries in oldSnapshotControlData->xid_by_minute[]. TransactionIdLimitedForOldSnapshots() uses that to (at least that's the goal) increase the horizon used for pruning. The problem is that there's no protection again the xids in the ringbuffer getting old enough to wrap around. Given that practical uses of old_snapshot_threshold are likely to be several hours to several days, that's not particularly hard to hit. That then has the consequence that we can use an xid that's either from "from the future" (if bigger than the current xid), or more recent than appropriate (if it wrapped far enough to be below nextxid, but not yet older than OldestXmin) as the OldestXmin argument to heap_page_prune(). Which in turn means that we can end up pruning much more recently removed rows than intended. While that'd be bad on its own, the big problem is that we won't detect that case on access, in contrast to the way old_snapshot_threshold is intended to work. The reason for that is detecting these errors happens on the basis of timestamps - which obviously do not wrap around. This made me want to try to reproduce the problem to at least some degree. But I hit another wall: I can't make head or tails out of the values in the xid_by_minute[] mapping. I added some debug output to print the mapping before/after changes by MaintainOldSnapshotTimeMapping() (note that I used timestamps relative to the server start in minutes/seconds to make it easier to interpret). And the output turns out to be something like: WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0 entry 0 (ring 0): min 7: xid 582921233 entry 1 (ring 1): min 8: xid 654154155 entry 2 (ring 2): min 9: xid 661972949 entry 3 (ring 3): min 10: xid 666899382 entry 4 (ring 4): min 11: xid 644169619 entry 5 (ring 5): min 12: xid 644169619 entry 6 (ring 6): min 13: xid 644169619 entry 7 (ring 7): min 14: xid 644169619 WARNING: head 420 s: updating existing bucket 4 for sec 660 with xmin 666899382 WARNING: old snapshot mapping at "after update" with head ts: 7, current entries: 8 max entries: 15, offset: 0 entry 0 (ring 0): min 7: xid 582921233 entry 1 (ring 1): min 8: xid 654154155 entry 2 (ring 2): min 9: xid 661972949 entry 3 (ring 3): min 10: xid 666899382 entry 4 (ring 4): min 11: xid 666899382 entry 5 (ring 5): min 12: xid 644169619 entry 6 (ring 6): min 13: xid 644169619 entry 7 (ring 7): min 14: xid 644169619 It's pretty obvious that the xids don't make a ton of sense, I think: They're not monotonically ordered. The same values exist multiple times, despite xids being constantly used. Also, despite the ringbuffer supposedly having 15 entries (that's snapshot_too_old = 5min + the 10 we always add), and the workload having run for 14min, we only have 8 entries. Then a bit later we get: WARNING: old snapshot mapping at "before update" with head ts: 7, current entries: 8 max entries: 15, offset: 0 entry 0 (ring 0): min 7: xid 582921233 entry 1 (ring 1): min 8: xid 654154155 entry 2 (ring 2): min 9: xid 661972949 entry 3 (ring 3): min 10: xid 666899382 entry 4 (ring 4): min 11: xid 666899382 entry 5 (ring 5): min 12: xid 666899382 entry 6 (ring 6): min 13: xid 666899382 entry 7 (ring 7): min 14: xid 666899382 WARNING: head 420 s: filling 8 buckets starting at 0 for sec 900 with xmin 666899382 WARNING: old snapshot mapping at "after update" with head ts: 15, current entries: 15 max entries: 15, offset: 1 entry 0 (ring 1): min 15: xid 654154155 entry 1 (ring 2): min 16: xid 661972949 entry 2 (ring 3): min 17: xid 666899382 entry 3 (ring 4): min 18: xid 666899382 entry 4 (ring 5): min 19: xid 666899382 entry 5 (ring 6): min 20: xid 666899382 entry 6 (ring 7): min 21: xid 666899382 entry 7 (ring 8): min 22: xid 666899382 entry 8 (ring 9): min 23: xid 666899382 entry 9 (ring 10): min 24: xid 666899382 entry 10 (ring 11): min 25: xid 666899382 entry 11 (ring 12): min 26: xid 666899382 entry 12 (ring 13): min 27: xid 666899382 entry 13 (ring 14): min 28: xid 666899382 entry 14 (ring 0): min 29: xid 666899382 At a later point we then enter the "Advance is so far that all old data is junk; start over." branch, and just reset the whole mapping: entry 0 (ring 0): min 30: xid 866711525 The problem, as far as I can tell, is that oldSnapshotControl->head_timestamp appears to be intended to be the oldest value in the ring. But we update it unconditionally in the "need a new bucket, but it might not be the very next one" branch of MaintainOldSnapshotTimeMapping(). While there's not really a clear-cut comment explaining whether head_timestamp() is intended to be the oldest or the newest timestamp, it seems to me that the rest of the code treats it as the "oldest" timestamp. TransactionId TransactionIdLimitedForOldSnapshots(TransactionId recentXmin, Relation relation) ... ts = AlignTimestampToMinuteBoundary(ts) - (old_snapshot_threshold * USECS_PER_MINUTE); ... LWLockAcquire(OldSnapshotTimeMapLock, LW_SHARED); if (oldSnapshotControl->count_used > 0 && ts >= oldSnapshotControl->head_timestamp) { int offset; offset = ((ts - oldSnapshotControl->head_timestamp) / USECS_PER_MINUTE); if (offset > oldSnapshotControl->count_used - 1) offset = oldSnapshotControl->count_used - 1; offset = (oldSnapshotControl->head_offset + offset) % OLD_SNAPSHOT_TIME_MAP_ENTRIES; xlimit = oldSnapshotControl->xid_by_minute[offset]; if (NormalTransactionIdFollows(xlimit, recentXmin)) SetOldSnapshotThresholdTimestamp(ts, xlimit); } LWLockRelease(OldSnapshotTimeMapLock); So we wind ts back by old_snapshot_threshold minutes. Then check that that's still newer than oldSnapshotControl->head_timestamp - which clearly can't be the case if it were the newest ts. And as far as I can tell the indexing code also only makes sense if head_timestamp is the oldest timestamp. This would mean that most cases the old_snapshot_threshold feature is active it would cause corruption: We'd not trigger errors on access, because the timestamp set with SetOldSnapshotThresholdTimestamp() would not actually match the xids used to limit. But: It turns out to be somewhat hard to get TransactionIdLimitedForOldSnapshots() to actually do something. Because oldSnapshotControl->head_timestamp is updated much more often than it should, the ts >= oldSnapshotControl->head_timestamp condition will often prevent the limiting code from being hit. But it's not very unlikely either. Due to the update of head_timestamp to the current timestamp, we'll enter the "existing mapping; advance xid if possible" branch for up to OLD_SNAPSHOT_TIME_MAP_ENTRIES times. Which means we can hit it for /* * The structure used to map times to TransactionId values for the "snapshot * too old" feature must have a few entries at the tail to hold old values; * otherwise the lookup will often fail and the expected early pruning or * vacuum will not usually occur. It is best if this padding is for a number * of minutes greater than a thread would normally be stalled, but it's OK if * early vacuum opportunities are occasionally missed, so there's no need to * use an extreme value or get too fancy. 10 minutes seems plenty. */ #define OLD_SNAPSHOT_PADDING_ENTRIES 10 #define OLD_SNAPSHOT_TIME_MAP_ENTRIES (old_snapshot_threshold + OLD_SNAPSHOT_PADDING_ENTRIES) 10 minutes, I think. There's some other ways too, but they're much less likely. Note that once the issue has been hit once, future SetOldSnapshotThresholdTimestamp() calls that don't hit those 10 minutes will also return a corrupted horizon, because oldSnapshotControl->threshold_xid will have the wrong value, which then will be used: /* * Failsafe protection against vacuuming work of active transaction. * * This is not an assertion because we avoid the spinlock for * performance, leaving open the possibility that xlimit could advance * and be more current; but it seems prudent to apply this limit. It * might make pruning a tiny bit less aggressive than it could be, but * protects against data loss bugs. */ if (TransactionIdIsNormal(latest_xmin) && TransactionIdPrecedes(latest_xmin, xlimit)) xlimit = latest_xmin; if (NormalTransactionIdFollows(xlimit, recentXmin)) return xlimit; As far as I can tell, this code has been wrong since the feature has been committed. The tests don't show a problem, because none of this code is reached when old_snapshot_threshold = 0 (which has no real world use, it's purely for testing). I really don't know what to do here. The feature never worked and will silently cause wrong query results. Fixing it seems like a pretty large task - there's a lot more bugs. But ripping out a feature in stable branches is pretty bad too. Before figuring out the above, I spent the last several days trying to make this feature work with my snapshot scalability patch. Trying to avoid regressing old_snapshot_threshold behaviour / performance. But not it seems to me that there's no actual working feature that can be preserved. I am really tired. Andres.
pgsql-hackers by date: