Re: Snapshot related assert failure on skink - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: Snapshot related assert failure on skink |
Date | |
Msg-id | 33c9a3ed-b850-4b21-ac6c-9d9b3d7fe155@vondra.me Whole thread Raw |
In response to | Re: Snapshot related assert failure on skink (Heikki Linnakangas <hlinnaka@iki.fi>) |
List | pgsql-hackers |
On 3/24/25 16:25, Heikki Linnakangas wrote: > On 24/03/2025 16:56, Tomas Vondra wrote: >> >> >> On 3/23/25 17:43, Heikki Linnakangas wrote: >>> On 21/03/2025 17:16, Andres Freund wrote: >>>> Am I right in understanding that the only scenario (when in >>>> STANDBY_SNAPSHOT_READY), where ExpireOldKnownAssignedTransactionIds() >>>> would >>>> "legally" remove a transaction, rather than the commit / abort records >>>> doing >>>> so, is if the primary crash-restarted while transactions were ongoing? >>>> >>>> Those transactions won't have a commit/abort records and thus won't >>>> trigger >>>> ExpireTreeKnownAssignedTransactionIds(), which otherwise would have >>>> updated >>>> ->xactCompletionCount? >>> >>> Correct. >>> >>>> When writing the snapshot caching patch, I tried to make sure that all >>>> the >>>> places that maintain ->latestCompletedXid also update >>>> ->xactCompletionCount. Afaict that's still the case. Which, I think, >>>> means >>>> that we're also missing calls to MaintainLatestCompletedXidRecovery()? >>> >>> Yep, I was just looking at that too. >>> >>>> If latestCompletedXid is incorrect visibility determinations end up >>>> wrong... >>> >>> I think it happens to work, because the transactions are effectively >>> aborted. latestCompletedXid is used to initialize xmax in >>> GetSnapshotData. If, for example, latestCompletedXid is incorrectly set >>> to 1000 even though XID 1001 already aborted, a snapshot with xmax=1000 >>> still correctly considers XID 1001 as "not visible". As soon as a >>> transaction commits, latestCompletedXid is fixed. >>> >>> AFAICS we could skip updating latestCompletedXid on aborts altogether >>> and rename it to latestCommittedXid. But it's hardly worth optimizing >>> for aborts. >>> >>> For the same reason, I believe the assertion failure we're discussing >>> here is also harmless. Even though the reused snapshot has a smaller >>> xmin than expected, all those transactions aborted and are thus not >>> visible anyway. >>> >>> In any case, let's be tidy and fix both latestCompletedXid and >>> xactCompletionCount. >>> >> >> Thanks for looking into this and pushing the fix. >> >> Would it make sense to add a comment documenting this reasoning about >> not handling aborts? Otherwise someone will get to rediscover this in >> the future ... > > Well, we're currently _not_ relying on the reasoning about aborts not > changing visibility. So it's not like someone will forget the reasoning > and have a bug as a result. I guess someone could rediscover that > reasoning and think that they can skip advancing those counters on > aborts as an optimization, re-introducing the assertion failure. But I > believe that was not why we had this bug, it was just a simple omission. > > So the comment would look like this: > > "In principle, we would not need to advance xactCompletionCount and > latestCompletedXid because aborting transactions don't change > visibility. But that could make xmin within a transaction go backwards, > if a snapshot with an older xmin but same xactCompletionCount is reused, > triggering the assertion in GetSnapshotDataReuse()." > > If we add that, I guess it should go into ProcArrayEndTransaction() > which currently just notes that it's used for commits and aborts > interchangeably. Do you think it's worth it? Want to propose a patch? > I think the wording you suggested above is reasonable - I certainly don't have a better one. I don't know what's the likelihood of someone breaking this, but I think it makes sense to mention this because the assert is non-trivial to hit. regards -- Tomas Vondra
pgsql-hackers by date: