Re: snapshot too old issues, first around wraparound and then more. - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: snapshot too old issues, first around wraparound and then more. |
Date | |
Msg-id | 20200403001235.e6jfdll3gh2ygbuc@alap3.anarazel.de Whole thread Raw |
In response to | Re: snapshot too old issues, first around wraparound and then more. (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
Hi, I just spend a good bit more time improving my snapshot patch, so it could work well with a fixed version of the old_snapshot_threshold feature. Mostly so there's no unnecessary dependency on the resolution of the issues in that patch. When testing my changes, for quite a while, I could not get src/test/modules/snapshot_too_old/ to trigger a single too-old error. It turns out, that's because there's not a single tuple removed due to old_snapshot_threshold in src/test/modules/snapshot_too_old/. The only reason the current code triggers any such errors is that a) TransactionIdLimitedForOldSnapshots() is always called in heap_page_prune_opt(), even if the "not limited" horizon (i.e. RecentGlobalDataXmin) is more than old enough to allow for pruning. That includes pages without a pd_prune_xid. b) TransactionIdLimitedForOldSnapshots(), in the old_snapshot_threshold == 0 branch, always calls SetOldSnapshotThresholdTimestamp(ts, xlimit) even if the horizon has not changed due to snapshot_too_old (xlimit is initially set tot the "input" horizon, and only increased if between (recentXmin, MyProc->xmin)). To benefit from the snapshot scalability improvements in my patchset, it is important to avoid unnecessarily determining the "accurate" xmin horizon, if it's clear from the "lower boundary" horizon that pruning can happen. Therefore I changed heap_page_prune_opt() and heap_page_prune() to only limit when we couldn't prune. In the course of that I separated getting the horizon from TransactionIdLimitedForOldSnapshots() and triggering errors when an already removed tuple would be needed via TransactionIdLimitedForOldSnapshots(). Because there are no occasions to actually remove tuples in the entire test, there now were no TransactionIdLimitedForOldSnapshots() calls. And thus no errors. My code turns out to actually work. Thus, if I change the code in master from: TransactionId xlimit = recentXmin; ... if (old_snapshot_threshold == 0) { if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin) && TransactionIdFollows(latest_xmin, xlimit)) xlimit = latest_xmin; ts -= 5 * USECS_PER_SEC; SetOldSnapshotThresholdTimestamp(ts, xlimit); return xlimit; } to ... if (old_snapshot_threshold == 0) { if (TransactionIdPrecedes(latest_xmin, MyPgXact->xmin) && TransactionIdFollows(latest_xmin, xlimit)) { xlimit = latest_xmin; SetOldSnapshotThresholdTimestamp(ts, xlimit); } ts -= 5 * USECS_PER_SEC; return xlimit; } there's not a single error raised in the existing tests. Not a *single* tuple removal is caused by old_snapshot_threshold. We just test the order of SetOldSnapshotThresholdTimestamp() calls. We have code in the backend to support testing old_snapshot_threshold, but we don't test anything meaningful in the feature. We basically test a oddly behaving version version of "transaction_timeout = 5s". I can't emphasize enough how baffling I find this. Greetings, Andres Freund
pgsql-hackers by date: